MemTable Design
Overview
MemTable is an in-memory write buffer that:
- Stores recent writes in sorted order
- Uses SkipList for efficient lookups and iteration
- Manages memory with Arena allocator
- Supports MVCC through InternalKey encoding
graph TB
subgraph MemTable["MemTable"]
A["Arena (memory pool)"]
B["SkipList<const char*, KeyComparator>"]
C["Reference counting"]
D["InternalKeyComparator"]
end
E1["Entry 1: const char*"]
E2["Entry 2: const char*"]
E3["Entry N: const char*"]
D1["[encoded data in Arena]"]
D2["[encoded data in Arena]"]
D3["[encoded data in Arena]"]
MemTable --> E1
MemTable --> E2
MemTable --> E3
E1 --> D1
E2 --> D2
E3 --> D3
style MemTable fill:#4ECDC4,stroke:#333,stroke-width:2px
style E1 fill:#95E1D3,stroke:#333
style E2 fill:#95E1D3,stroke:#333
style E3 fill:#95E1D3,stroke:#333
Entry Encoding Format
Each entry in MemTable is encoded as a single byte array allocated from Arena.
---
config:
packet:
bitsPerRow: 32
---
packet-beta
0-31: "internal_key_len: varint32"
32-255: "user_key: bytes[user_key_len]"
256-319: "tag: uint64 (seq << 8 | type)"
320-351: "value_len: varint32"
352-511: "value: bytes[value_len]"
Entry Layout (stored in Arena):┌─────────────────┬──────────────┬─────┬─────────────┬──────────────┐│ internal_key_len│ user_key │ tag │ value_len │ value ││ (varint32) │(user_key_len)│(u64)│ (varint32) │ (value_len) │└─────────────────┴──────────────┴─────┴─────────────┴──────────────┘ ↑ ↑ buf (inserted into SkipList) end
where: internal_key_len = user_key_len + 8 tag = (sequence << 8) | typeSize calculation:
size_t encoded_len = VarintLength(internal_key_size) + // 1-5 bytes internal_key_size + // user_key_len + 8 VarintLength(value_size) + // 1-5 bytes value_size; // value_lenMemTable::Add() - Insert Operation
void MemTable::Add(SequenceNumber seq, ValueType type, const Slice& key, const Slice& value){ // Step 1: Calculate sizes size_t key_size = key.size(); size_t val_size = value.size(); size_t internal_key_size = key_size + 8; size_t encoded_len = VarintLength(internal_key_size) + internal_key_size + VarintLength(val_size) + val_size;
// Step 2: Allocate from Arena (single allocation) char* buf = arena_.Allocate(encoded_len);
// Step 3: Encode entry char* p = EncodeVarint32(buf, internal_key_size); std::memcpy(p, key.data(), key_size); p += key_size; EncodeFixed64(p, (seq << 8) | type); p += 8; p = EncodeVarint32(p, val_size); std::memcpy(p, value.data(), val_size);
// Step 4: Insert pointer into SkipList table_.Insert(buf);}Example:
Input: sequence = 100 type = kTypeValue (0x1) key = "foo" (3 bytes) value = "bar" (3 bytes)
Encoded entry (hex): 0B // internal_key_len = 11 (varint32) 66 6F 6F // "foo" 01 64 00 00 00 00 00 00 // tag = (100 << 8) | 1 03 // value_len = 3 (varint32) 62 61 72 // "bar"
Total: 1 + 3 + 8 + 1 + 3 = 16 bytesMemTable::Get() - Lookup Operation
bool MemTable::Get(const LookupKey& key, std::string* value, Status* s){ // Step 1: Get memtable_key from LookupKey Slice memkey = key.memtable_key();
// Step 2: Seek to first entry >= memkey Table::Iterator iter(&table_); iter.Seek(memkey.data());
if (iter.Valid()) { const char* entry = iter.key();
// Step 3: Parse entry and extract internal_key uint32_t key_length; const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
// Step 4: Compare user_key if (comparator_.user_comparator()->Compare( Slice(key_ptr, key_length - 8), // Extract user_key key.user_key()) == 0) {
// Step 5: Check type in tag uint64_t tag = DecodeFixed64(key_ptr + key_length - 8); ValueType type = static_cast<ValueType>(tag & 0xff);
switch (type) { case kTypeValue: { // Step 6: Extract value Slice v = GetLengthPrefixedSlice(key_ptr + key_length); value->assign(v.data(), v.size()); return true; } case kTypeDeletion: *s = Status::NotFound(); return true; } } } return false; // Key not found}Search example:
Query: user_key = "foo", sequence = 200
LookupKey encoding: 0B 66 6F 6F [C8 00 00 00 00 00 00 01] │ └─ "foo" └─ tag = (200 << 8) | kTypeValue └─ internal_key_len = 11
SkipList has: Entry1: 0B 66 6F 6F [64 00 00 00 00 00 00 01] 03 62 61 72 (seq=100) Entry2: 0B 66 6F 6F [C8 00 00 00 00 00 00 01] 03 62 61 7A (seq=200) Entry3: 0B 66 6F 6F [2C 01 00 00 00 00 00 01] 03 62 61 78 (seq=300)
Seek(LookupKey) → finds Entry2 (first entry >= search key)Since InternalKeyComparator sorts by (user_key ASC, seq DESC),Entry3 (seq=300) comes before Entry2 (seq=200)
Corrected SkipList order: Entry3: seq=300 (newest) Entry2: seq=200 Entry1: seq=100 (oldest)
Seek with seq=200 → should find Entry2 or Entry3 depending on exact comparisonKeyComparator Design
MemTable’s SkipList compares const char* pointers, not InternalKey objects.
struct KeyComparator { const InternalKeyComparator comparator;
explicit KeyComparator(const InternalKeyComparator& c) : comparator(c) {}
// Compare two entry pointers int operator()(const char* aptr, const char* bptr) const { // Extract InternalKey from length-prefixed encoding Slice a = GetLengthPrefixedSlice(aptr); Slice b = GetLengthPrefixedSlice(bptr);
// Use InternalKeyComparator return comparator.Compare(a, b); }};How it works:
aptr → [0B 66 6F 6F 64 00 00 00 00 00 00 01 03 62 61 72] └─ internal_key_len=11 │ v GetLengthPrefixedSlice │ v Slice("foo\x64\x00\x00\x00\x00\x00\x00\x01", 11) │ v InternalKeyComparator::Compare()Reference Counting
MemTable uses manual reference counting to manage lifetime.
class MemTable {public: explicit MemTable(const InternalKeyComparator& cmp) : comparator_(cmp), refs_(0), table_(comparator_, &arena_) {}
void Ref() { ++refs_; }
void Unref() { --refs_; assert(refs_ >= 0); if (refs_ <= 0) { delete this; // Self-destruct } }
private: ~MemTable() { assert(refs_ == 0); } // Private destructor
int refs_; // ...};Usage pattern:
// DBImpl holds MemTableMemTable* mem_ = new MemTable(comparator);mem_->Ref(); // refs_ = 1
// When switching to immutable MemTableMemTable* imm_ = mem_;imm_->Ref(); // refs_ = 2
mem_ = new MemTable(comparator);mem_->Ref(); // refs_ = 1
// After compaction completesimm_->Unref(); // refs_ = 1imm_->Unref(); // refs_ = 0 → delete thisWhy reference counting?
- Multiple components may hold MemTable (DBImpl, iterators, compaction)
- Allows safe concurrent reads during compaction
- Prevents premature deletion
Memory Management
Arena Integration
class MemTable {private: Arena arena_; Table table_; // SkipList uses &arena_};
// SkipList constructorSkipList(Comparator cmp, Arena* arena) : arena_(arena), ... {}
// Node allocationNode* SkipList::NewNode(const Key& key, int height) { char* mem = arena_->AllocateAligned( sizeof(Node) + sizeof(atomic<Node*>) * (height - 1)); return new (mem) Node(key);}Memory layout:
Arena blocks:┌──────────────────────────────────────────────┐│ Block 1 (4KB) │├──────────────────────────────────────────────┤│ Entry1 [key+value data] ││ SkipList Node1 [pointers] ││ Entry2 [key+value data] ││ SkipList Node2 [pointers] ││ ... │└──────────────────────────────────────────────┘┌──────────────────────────────────────────────┐│ Block 2 (4KB) │├──────────────────────────────────────────────┤│ Entry N [key+value data] ││ ... │└──────────────────────────────────────────────┘Benefits:
- No per-object allocation overhead
- Better cache locality
- Fast allocation (bump-pointer in most cases)
- Bulk deallocation (free entire MemTable at once)
Iterator Design
class MemTableIterator : public Iterator {public: explicit MemTableIterator(MemTable::Table* table) : iter_(table) {}
bool Valid() const override { return iter_.Valid(); } void Next() override { iter_.Next(); } void Prev() override { iter_.Prev(); } void SeekToFirst() override { iter_.SeekToFirst(); } void SeekToLast() override { iter_.SeekToLast(); }
void Seek(const Slice& target) override { iter_.Seek(EncodeKey(&tmp_, target)); }
Slice key() const override { return GetLengthPrefixedSlice(iter_.key()); }
Slice value() const override { Slice key_slice = GetLengthPrefixedSlice(iter_.key()); return GetLengthPrefixedSlice(key_slice.data() + key_slice.size()); }
private: MemTable::Table::Iterator iter_; // SkipList::Iterator std::string tmp_; // Scratch space for encoding};Approximate Memory Usage
size_t MemTable::ApproximateMemoryUsage() { return arena_.MemoryUsage();}What’s counted:
- All Arena blocks (entry data + SkipList nodes)
- Block overhead (
sizeof(char*)per block)
What’s NOT counted:
- MemTable object itself (
sizeof(MemTable)) - KeyComparator
- SkipList metadata (head node is in Arena)
Usage:
// DBImpl decides when to create immutable MemTableif (mem_->ApproximateMemoryUsage() > options_.write_buffer_size) { // Switch to new MemTable imm_ = mem_; mem_ = new MemTable(comparator_); // Trigger compaction...}Design Decisions
Why const char* instead of Key objects?
Rejected design:
using Table = SkipList<InternalKey, Comparator>;Problems:
- Each node stores a copy of InternalKey
- Expensive key copies during insertion
- Separate allocation for value
Chosen design:
using Table = SkipList<const char*, KeyComparator>;Benefits:
- Single allocation for key + value + metadata
- Zero-copy insertion (just pointer)
- Arena-friendly (variable-size data)
Why length-prefixed encoding?
Alternative: Fixed struct
struct Entry { uint32_t key_len; uint32_t value_len; char data[];};Problems:
- Wastes space for small keys/values
- Alignment padding
- Harder to extend format
Length-prefixed (varint) benefits:
- Compact (1 byte for lengths < 128)
- Self-describing
- Easy to parse forward
Why Arena for MemTable?
Benefits:
- Fast allocation (no malloc overhead)
- No fragmentation
- Bulk deallocation
- Cache-friendly (sequential allocations)
Trade-offs:
- Cannot delete individual entries
- Memory held until entire MemTable is freed
- OK for MemTable because it’s immutable after becoming
imm_
Performance Characteristics
| Operation | Time Complexity | Notes |
|---|---|---|
| Add | O(log N) | SkipList insertion |
| Get | O(log N) | SkipList search |
| Iterator creation | O(1) | Lightweight wrapper |
| Iterator advance | O(1) amortized | SkipList traversal |
| Memory usage | O(N) | Linear in data size |
| Destruction | O(B) | B = number of Arena blocks |
Space overhead:
Per entry: - Varint lengths: 2-10 bytes (typically 2) - Tag: 8 bytes - SkipList node: 16-32 bytes (depends on height) Total overhead: ~26-50 bytes per entry
For 1M entries of 100 bytes each: Data: 100 MB Overhead: ~26-50 MB Total: ~126-150 MB