Database Format Design
Internal Key Format
LevelDB uses InternalKey instead of raw user keys to support MVCC (Multi-Version Concurrency Control) and snapshots.
---
config:
packet:
bitsPerRow: 32
---
packet-beta
0-255: "user_key: bytes[user_key_len]"
256-319: "sequence: uint56 (bits 8-63)"
320-327: "type: uint8 (bits 0-7)"
InternalKey := user_key: bytes[user_key_len] tag: uint64
tag := (sequence << 8) | type
where: sequence: uint56 (0 to 2^56 - 1) type: uint8 (kTypeDeletion = 0x0, kTypeValue = 0x1)InternalKey Comparison Rules
InternalKey comparison follows these rules (in order):
-
Compare user_key (ascending order)
- If
user_key_a < user_key_b, thenInternalKey_a < InternalKey_b
- If
-
If user_key equal, compare sequence (descending order)
- Newer entries (larger sequence) come first
- If
seq_a > seq_b, thenInternalKey_a < InternalKey_b
-
If sequence equal, compare type (descending order)
kTypeValue(0x1) <kTypeDeletion(0x0)- Values come before deletions
Why descending sequence?
- When searching for a key, we want to find the newest version first
- SkipList returns the first match, so newer versions must sort earlier
LookupKey Format
LookupKey is used for querying in MemTable. It encodes the search target.
---
config:
packet:
bitsPerRow: 32
---
packet-beta
0-31: "internal_key_len: varint32"
32-255: "user_key: bytes[user_key_len]"
256-319: "tag: uint64 (seq << 8 | type)"
LookupKey Layout:┌─────────────────┬──────────────────────┬─────────────────┐│ internal_key_len│ user_key │ tag ││ (varint32) │ (user_key_len) │ (uint64) │└─────────────────┴──────────────────────┴─────────────────┘ ↑ ↑ ↑ start_ kstart_ end_Three key pointers:
start_: Points tointernal_key_len(for MemTable queries)kstart_: Points touser_key(for InternalKey extraction)end_: Points to end of tag
Usage:
LookupKey lkey(user_key, sequence);Slice memtable_key = lkey.memtable_key(); // [start_, end_)Slice internal_key = lkey.internal_key(); // [kstart_, end_)Slice user_key = lkey.user_key(); // [kstart_, end_ - 8)Value Types
enum ValueType { kTypeDeletion = 0x0, // Tombstone marker kTypeValue = 0x1 // Actual value};Why store deletions?
- Deletions are tombstone markers in LSM-tree
- They mark keys as deleted without physically removing old versions
- Compaction will eventually remove tombstones and their older versions
Sequence Numbers
using SequenceNumber = uint64_t;static const SequenceNumber kMaxSequenceNumber = ((0x1ull << 56) - 1);- Range: 0 to 2^56 - 1 (leaves 8 bits for
ValueType) - Purpose: MVCC version control
- Monotonically increasing: Each write operation increments the sequence
- Snapshot isolation: Readers can specify a sequence to read consistent data
Comparator Design
InternalKeyComparator
class InternalKeyComparator {public: explicit InternalKeyComparator(const Comparator* user_cmp);
// Compare two InternalKeys int Compare(const Slice& a, const Slice& b) const;
private: const Comparator* user_comparator_;};Comparison logic:
int InternalKeyComparator::Compare(const Slice& a, const Slice& b) const { // 1. Extract user keys (strip last 8 bytes) Slice user_a = ExtractUserKey(a); Slice user_b = ExtractUserKey(b);
// 2. Compare user keys (ascending) int r = user_comparator_->Compare(user_a, user_b); if (r != 0) return r;
// 3. Extract tags (last 8 bytes) uint64_t tag_a = DecodeFixed64(a.data() + a.size() - 8); uint64_t tag_b = DecodeFixed64(b.data() + b.size() - 8);
// 4. Compare tags (descending - larger tag comes first) if (tag_a > tag_b) return -1; if (tag_a < tag_b) return +1; return 0;}Design Rationale
Why InternalKey?
- MVCC Support: Multiple versions of the same key can coexist
- Snapshot Reads: Readers can see a consistent view at a specific sequence
- Efficient Updates: No need to physically delete old versions immediately
- Conflict Resolution: Newer writes naturally override older ones
Why Length-Prefixed Encoding?
- MemTable stores
const char*pointers, not Key objects - Length-prefixed format allows decoding without knowing the key size beforehand
- Efficient for variable-length keys
Memory Layout Considerations
Traditional approach (not used): Node { Key key; Value value; Node* next[]; }
LevelDB approach: Node { const char* entry; Node* next[]; }
entry points to: ┌──────────────┬─────────┬─────┬───────────┬───────┐ │ internal_key │ user_key│ tag │ value_len │ value │ │ _len │ │ │ │ │ └──────────────┴─────────┴─────┴───────────┴───────┘Benefits:
- Single memory allocation for key+value+metadata
- Better cache locality
- Arena-friendly (no small object allocations)
- Pointer-sized node overhead