MemTable Design
Overview
MemTable is an in-memory write buffer that:
- Stores recent writes in sorted order
- Uses SkipList for efficient lookups and iteration
- Manages memory with Arena allocator
- Supports MVCC through InternalKey encoding
graph TB
subgraph MemTable["MemTable"]
A["Arena (memory pool)"]
B["SkipList<const char*, KeyComparator>"]
C["Reference counting"]
D["InternalKeyComparator"]
end
E1["Entry 1: const char*"]
E2["Entry 2: const char*"]
E3["Entry N: const char*"]
D1["[encoded data in Arena]"]
D2["[encoded data in Arena]"]
D3["[encoded data in Arena]"]
MemTable --> E1
MemTable --> E2
MemTable --> E3
E1 --> D1
E2 --> D2
E3 --> D3
style MemTable fill:#4ECDC4,stroke:#333,stroke-width:2px
style E1 fill:#95E1D3,stroke:#333
style E2 fill:#95E1D3,stroke:#333
style E3 fill:#95E1D3,stroke:#333
Entry Encoding Format
Each entry in MemTable is encoded as a single byte array allocated from Arena.
---
config:
packet:
bitsPerRow: 32
---
packet-beta
0-31: "internal_key_len: varint32"
32-255: "user_key: bytes[user_key_len]"
256-319: "tag: uint64 (seq << 8 | type)"
320-351: "value_len: varint32"
352-511: "value: bytes[value_len]"
Entry Layout (stored in Arena):
┌─────────────────┬──────────────┬─────┬─────────────┬──────────────┐
│ internal_key_len│ user_key │ tag │ value_len │ value │
│ (varint32) │(user_key_len)│(u64)│ (varint32) │ (value_len) │
└─────────────────┴──────────────┴─────┴─────────────┴──────────────┘
↑ ↑
buf (inserted into SkipList) end
where:
internal_key_len = user_key_len + 8
tag = (sequence << 8) | type
Size calculation:
size_t encoded_len =
VarintLength(internal_key_size) + // 1-5 bytes
internal_key_size + // user_key_len + 8
VarintLength(value_size) + // 1-5 bytes
value_size; // value_len
MemTable::Add() - Insert Operation
void MemTable::Add(SequenceNumber seq, ValueType type,
const Slice& key, const Slice& value)
{
// Step 1: Calculate sizes
size_t key_size = key.size();
size_t val_size = value.size();
size_t internal_key_size = key_size + 8;
size_t encoded_len = VarintLength(internal_key_size) +
internal_key_size +
VarintLength(val_size) + val_size;
// Step 2: Allocate from Arena (single allocation)
char* buf = arena_.Allocate(encoded_len);
// Step 3: Encode entry
char* p = EncodeVarint32(buf, internal_key_size);
std::memcpy(p, key.data(), key_size);
p += key_size;
EncodeFixed64(p, (seq << 8) | type);
p += 8;
p = EncodeVarint32(p, val_size);
std::memcpy(p, value.data(), val_size);
// Step 4: Insert pointer into SkipList
table_.Insert(buf);
}
Example:
Input:
sequence = 100
type = kTypeValue (0x1)
key = "foo" (3 bytes)
value = "bar" (3 bytes)
Encoded entry (hex):
0B // internal_key_len = 11 (varint32)
66 6F 6F // "foo"
01 64 00 00 00 00 00 00 // tag = (100 << 8) | 1
03 // value_len = 3 (varint32)
62 61 72 // "bar"
Total: 1 + 3 + 8 + 1 + 3 = 16 bytes
MemTable::Get() - Lookup Operation
bool MemTable::Get(const LookupKey& key, std::string* value, Status* s)
{
// Step 1: Get memtable_key from LookupKey
Slice memkey = key.memtable_key();
// Step 2: Seek to first entry >= memkey
Table::Iterator iter(&table_);
iter.Seek(memkey.data());
if (iter.Valid()) {
const char* entry = iter.key();
// Step 3: Parse entry and extract internal_key
uint32_t key_length;
const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
// Step 4: Compare user_key
if (comparator_.user_comparator()->Compare(
Slice(key_ptr, key_length - 8), // Extract user_key
key.user_key()) == 0) {
// Step 5: Check type in tag
uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
ValueType type = static_cast<ValueType>(tag & 0xff);
switch (type) {
case kTypeValue: {
// Step 6: Extract value
Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
value->assign(v.data(), v.size());
return true;
}
case kTypeDeletion:
*s = Status::NotFound();
return true;
}
}
}
return false; // Key not found
}
Search example:
Query: user_key = "foo", sequence = 200
LookupKey encoding:
0B 66 6F 6F [C8 00 00 00 00 00 00 01]
│ └─ "foo" └─ tag = (200 << 8) | kTypeValue
└─ internal_key_len = 11
SkipList has:
Entry1: 0B 66 6F 6F [64 00 00 00 00 00 00 01] 03 62 61 72 (seq=100)
Entry2: 0B 66 6F 6F [C8 00 00 00 00 00 00 01] 03 62 61 7A (seq=200)
Entry3: 0B 66 6F 6F [2C 01 00 00 00 00 00 01] 03 62 61 78 (seq=300)
Seek(LookupKey) → finds Entry2 (first entry >= search key)
Since InternalKeyComparator sorts by (user_key ASC, seq DESC),
Entry3 (seq=300) comes before Entry2 (seq=200)
Corrected SkipList order:
Entry3: seq=300 (newest)
Entry2: seq=200
Entry1: seq=100 (oldest)
Seek with seq=200 → should find Entry2 or Entry3 depending on exact comparison
KeyComparator Design
MemTable’s SkipList compares const char* pointers, not InternalKey objects.
struct KeyComparator {
const InternalKeyComparator comparator;
explicit KeyComparator(const InternalKeyComparator& c)
: comparator(c) {}
// Compare two entry pointers
int operator()(const char* aptr, const char* bptr) const {
// Extract InternalKey from length-prefixed encoding
Slice a = GetLengthPrefixedSlice(aptr);
Slice b = GetLengthPrefixedSlice(bptr);
// Use InternalKeyComparator
return comparator.Compare(a, b);
}
};
How it works:
aptr → [0B 66 6F 6F 64 00 00 00 00 00 00 01 03 62 61 72]
└─ internal_key_len=11
│
v
GetLengthPrefixedSlice
│
v
Slice("foo\x64\x00\x00\x00\x00\x00\x00\x01", 11)
│
v
InternalKeyComparator::Compare()
Reference Counting
MemTable uses manual reference counting to manage lifetime.
class MemTable {
public:
explicit MemTable(const InternalKeyComparator& cmp)
: comparator_(cmp), refs_(0), table_(comparator_, &arena_) {}
void Ref() { ++refs_; }
void Unref() {
--refs_;
assert(refs_ >= 0);
if (refs_ <= 0) {
delete this; // Self-destruct
}
}
private:
~MemTable() { assert(refs_ == 0); } // Private destructor
int refs_;
// ...
};
Usage pattern:
// DBImpl holds MemTable
MemTable* mem_ = new MemTable(comparator);
mem_->Ref(); // refs_ = 1
// When switching to immutable MemTable
MemTable* imm_ = mem_;
imm_->Ref(); // refs_ = 2
mem_ = new MemTable(comparator);
mem_->Ref(); // refs_ = 1
// After compaction completes
imm_->Unref(); // refs_ = 1
imm_->Unref(); // refs_ = 0 → delete this
Why reference counting?
- Multiple components may hold MemTable (DBImpl, iterators, compaction)
- Allows safe concurrent reads during compaction
- Prevents premature deletion
Memory Management
Arena Integration
class MemTable {
private:
Arena arena_;
Table table_; // SkipList uses &arena_
};
// SkipList constructor
SkipList(Comparator cmp, Arena* arena)
: arena_(arena), ... {}
// Node allocation
Node* SkipList::NewNode(const Key& key, int height) {
char* mem = arena_->AllocateAligned(
sizeof(Node) + sizeof(atomic<Node*>) * (height - 1));
return new (mem) Node(key);
}
Memory layout:
Arena blocks:
┌──────────────────────────────────────────────┐
│ Block 1 (4KB) │
├──────────────────────────────────────────────┤
│ Entry1 [key+value data] │
│ SkipList Node1 [pointers] │
│ Entry2 [key+value data] │
│ SkipList Node2 [pointers] │
│ ... │
└──────────────────────────────────────────────┘
┌──────────────────────────────────────────────┐
│ Block 2 (4KB) │
├──────────────────────────────────────────────┤
│ Entry N [key+value data] │
│ ... │
└──────────────────────────────────────────────┘
Benefits:
- No per-object allocation overhead
- Better cache locality
- Fast allocation (bump-pointer in most cases)
- Bulk deallocation (free entire MemTable at once)
Iterator Design
class MemTableIterator : public Iterator {
public:
explicit MemTableIterator(MemTable::Table* table)
: iter_(table) {}
bool Valid() const override { return iter_.Valid(); }
void Next() override { iter_.Next(); }
void Prev() override { iter_.Prev(); }
void SeekToFirst() override { iter_.SeekToFirst(); }
void SeekToLast() override { iter_.SeekToLast(); }
void Seek(const Slice& target) override {
iter_.Seek(EncodeKey(&tmp_, target));
}
Slice key() const override {
return GetLengthPrefixedSlice(iter_.key());
}
Slice value() const override {
Slice key_slice = GetLengthPrefixedSlice(iter_.key());
return GetLengthPrefixedSlice(key_slice.data() + key_slice.size());
}
private:
MemTable::Table::Iterator iter_; // SkipList::Iterator
std::string tmp_; // Scratch space for encoding
};
Approximate Memory Usage
size_t MemTable::ApproximateMemoryUsage() {
return arena_.MemoryUsage();
}
What’s counted:
- All Arena blocks (entry data + SkipList nodes)
- Block overhead (
sizeof(char*)per block)
What’s NOT counted:
- MemTable object itself (
sizeof(MemTable)) - KeyComparator
- SkipList metadata (head node is in Arena)
Usage:
// DBImpl decides when to create immutable MemTable
if (mem_->ApproximateMemoryUsage() > options_.write_buffer_size) {
// Switch to new MemTable
imm_ = mem_;
mem_ = new MemTable(comparator_);
// Trigger compaction...
}
Design Decisions
Why const char* instead of Key objects?
Rejected design:
using Table = SkipList<InternalKey, Comparator>;
Problems:
- Each node stores a copy of InternalKey
- Expensive key copies during insertion
- Separate allocation for value
Chosen design:
using Table = SkipList<const char*, KeyComparator>;
Benefits:
- Single allocation for key + value + metadata
- Zero-copy insertion (just pointer)
- Arena-friendly (variable-size data)
Why length-prefixed encoding?
Alternative: Fixed struct
struct Entry {
uint32_t key_len;
uint32_t value_len;
char data[];
};
Problems:
- Wastes space for small keys/values
- Alignment padding
- Harder to extend format
Length-prefixed (varint) benefits:
- Compact (1 byte for lengths < 128)
- Self-describing
- Easy to parse forward
Why Arena for MemTable?
Benefits:
- Fast allocation (no malloc overhead)
- No fragmentation
- Bulk deallocation
- Cache-friendly (sequential allocations)
Trade-offs:
- Cannot delete individual entries
- Memory held until entire MemTable is freed
- OK for MemTable because it’s immutable after becoming
imm_
Performance Characteristics
| Operation | Time Complexity | Notes |
|---|---|---|
| Add | O(log N) | SkipList insertion |
| Get | O(log N) | SkipList search |
| Iterator creation | O(1) | Lightweight wrapper |
| Iterator advance | O(1) amortized | SkipList traversal |
| Memory usage | O(N) | Linear in data size |
| Destruction | O(B) | B = number of Arena blocks |
Space overhead:
Per entry:
- Varint lengths: 2-10 bytes (typically 2)
- Tag: 8 bytes
- SkipList node: 16-32 bytes (depends on height)
Total overhead: ~26-50 bytes per entry
For 1M entries of 100 bytes each:
Data: 100 MB
Overhead: ~26-50 MB
Total: ~126-150 MB