Database Format Design

Internal Key Format

LevelDB uses InternalKey instead of raw user keys to support MVCC (Multi-Version Concurrency Control) and snapshots.

--- config: packet: bitsPerRow: 32 --- packet-beta 0-255: "user_key: bytes[user_key_len]" 256-319: "sequence: uint56 (bits 8-63)" 320-327: "type: uint8 (bits 0-7)"

InternalKey :=
   user_key: bytes[user_key_len]
   tag: uint64
   
tag := (sequence << 8) | type

where:
   sequence: uint56 (0 to 2^56 - 1)
   type: uint8 (kTypeDeletion = 0x0, kTypeValue = 0x1)

InternalKey Comparison Rules

InternalKey comparison follows these rules (in order):

Compare user_key (ascending order)
- If user_key_a < user_key_b, then InternalKey_a < InternalKey_b
If user_key equal, compare sequence (descending order)
- Newer entries (larger sequence) come first
- If seq_a > seq_b, then InternalKey_a < InternalKey_b
If sequence equal, compare type (descending order)
- kTypeValue (0x1) < kTypeDeletion (0x0)
- Values come before deletions

Why descending sequence?

When searching for a key, we want to find the newest version first
SkipList returns the first match, so newer versions must sort earlier

LookupKey Format

LookupKey is used for querying in MemTable. It encodes the search target.

--- config: packet: bitsPerRow: 32 --- packet-beta 0-31: "internal_key_len: varint32" 32-255: "user_key: bytes[user_key_len]" 256-319: "tag: uint64 (seq << 8 | type)"

LookupKey Layout:
┌─────────────────┬──────────────────────┬─────────────────┐
│ internal_key_len│      user_key        │      tag        │
│   (varint32)    │   (user_key_len)     │    (uint64)     │
└─────────────────┴──────────────────────┴─────────────────┘
     ↑                 ↑                                    ↑
   start_           kstart_                               end_

Three key pointers:

start_: Points to internal_key_len (for MemTable queries)
kstart_: Points to user_key (for InternalKey extraction)
end_: Points to end of tag

Usage:

LookupKey lkey(user_key, sequence);
Slice memtable_key = lkey.memtable_key();  // [start_, end_)
Slice internal_key = lkey.internal_key();   // [kstart_, end_)
Slice user_key = lkey.user_key();           // [kstart_, end_ - 8)

Value Types

enum ValueType { 
    kTypeDeletion = 0x0,  // Tombstone marker
    kTypeValue = 0x1      // Actual value
};

Why store deletions?

Deletions are tombstone markers in LSM-tree
They mark keys as deleted without physically removing old versions
Compaction will eventually remove tombstones and their older versions

Sequence Numbers

using SequenceNumber = uint64_t;
static const SequenceNumber kMaxSequenceNumber = ((0x1ull << 56) - 1);

Range: 0 to 2^56 - 1 (leaves 8 bits for ValueType)
Purpose: MVCC version control
Monotonically increasing: Each write operation increments the sequence
Snapshot isolation: Readers can specify a sequence to read consistent data

Comparator Design

InternalKeyComparator

class InternalKeyComparator {
public:
    explicit InternalKeyComparator(const Comparator* user_cmp);
    
    // Compare two InternalKeys
    int Compare(const Slice& a, const Slice& b) const;
    
private:
    const Comparator* user_comparator_;
};

Comparison logic:

int InternalKeyComparator::Compare(const Slice& a, const Slice& b) const {
    // 1. Extract user keys (strip last 8 bytes)
    Slice user_a = ExtractUserKey(a);
    Slice user_b = ExtractUserKey(b);
    
    // 2. Compare user keys (ascending)
    int r = user_comparator_->Compare(user_a, user_b);
    if (r != 0) return r;
    
    // 3. Extract tags (last 8 bytes)
    uint64_t tag_a = DecodeFixed64(a.data() + a.size() - 8);
    uint64_t tag_b = DecodeFixed64(b.data() + b.size() - 8);
    
    // 4. Compare tags (descending - larger tag comes first)
    if (tag_a > tag_b) return -1;
    if (tag_a < tag_b) return +1;
    return 0;
}

Design Rationale

Why InternalKey?

MVCC Support: Multiple versions of the same key can coexist
Snapshot Reads: Readers can see a consistent view at a specific sequence
Efficient Updates: No need to physically delete old versions immediately
Conflict Resolution: Newer writes naturally override older ones

Why Length-Prefixed Encoding?

MemTable stores const char* pointers, not Key objects
Length-prefixed format allows decoding without knowing the key size beforehand
Efficient for variable-length keys

Memory Layout Considerations

Traditional approach (not used):
  Node { Key key; Value value; Node* next[]; }
  
LevelDB approach:
  Node { const char* entry; Node* next[]; }
  
  entry points to:
  ┌──────────────┬─────────┬─────┬───────────┬───────┐
  │ internal_key │ user_key│ tag │ value_len │ value │
  │     _len     │         │     │           │       │
  └──────────────┴─────────┴─────┴───────────┴───────┘

Benefits:

Single memory allocation for key+value+metadata
Better cache locality
Arena-friendly (no small object allocations)
Pointer-sized node overhead

LevelDB DBformat

The Design of LevelDB's Database Format