SSTable Format
SSTable (Sorted String Table) is the on-disk storage format for Prism. It stores sorted key-value pairs in an immutable, block-based structure optimized for both sequential scans and random lookups.
File Layout
Structure:
<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1] (optional, e.g., filter block)
...
[meta block K] (optional, e.g., stats block)
[metaindex block]
[index block]
[footer] (fixed 48 bytes)
<end_of_file>
BlockHandle
A BlockHandle is a pointer to a block within the file:
struct BlockHandle {
uint64_t offset; // varint64: byte offset in file
uint64_t size; // varint64: block content size (excludes trailer)
};
Encoding: Two consecutive varint64 values.
Usage:
- Index block entries point to data blocks
- MetaIndex entries point to meta blocks (e.g., filter)
- Footer contains handles for metaindex and index blocks
Data Block Format
Data blocks store sorted key-value pairs with prefix compression.
Block Structure
shared=0] R2[Record 2
shared=3] R3[Record 3
shared=5] RN[Record N
shared=4] RA[Restart Array
4 bytes each] RC[Restart Count
4 bytes] end R1 --> R2 R2 --> R3 R3 --> RN RN --> RA RA --> RC style R1 fill:#2ECC71,stroke:#333,color:#000 style RA fill:#FF6B6B,stroke:#333,color:#000 style RC fill:#FF6B6B,stroke:#333,color:#000
Entry Encoding
Each key-value entry is encoded as:
Entry := shared (varint32) // bytes shared with previous key
| non_shared (varint32) // bytes following shared prefix
| value_length (varint32) // length of value
| key_delta [non_shared] // unshared part of key
| value [value_length]
Example:
Keys: "apple", "apply", "application"
Entry 1: shared=0, non_shared=5, value_len=10
key_delta="apple", value="<10 bytes>"
Entry 2: shared=4, non_shared=1, value_len=8
key_delta="y", value="<8 bytes>"
(reconstructed key = "appl" + "y" = "apply")
Entry 3: shared=4, non_shared=7, value_len=12
key_delta="ication", value="<12 bytes>"
(reconstructed key = "appl" + "ication" = "application")
Restart Points
Every block_restart_interval entries (default 16), a restart point is created:
shared = 0(full key stored)- Offset recorded in restart array
Purpose: Enable binary search within block without decoding all entries.
Block Trailer
Block Trailer := compression_type (1 byte)
| crc32c (4 bytes)
- Compression type:
0x00= none,0x01= Snappy - CRC32C: Checksum of
block_contents + compression_type
Index Block Format
The index block maps keys to data block locations.
Value: BlockHandle1"] E2["Key: separator2
Value: BlockHandle2"] EN["Key: separatorN
Value: BlockHandleN"] end E1 --> E2 E2 --> EN subgraph Corresponding Data Blocks DB1[Data Block 1] DB2[Data Block 2] DBN[Data Block N] end E1 -.-> DB1 E2 -.-> DB2 EN -.-> DBN
Index Entry:
Key := shortest separator between blocks
(last_key_in_block[i] < separator <= first_key_in_block[i+1])
Value := BlockHandle (varint64 offset | varint64 size)
Separator Generation:
For block i, the index key is:
if (i < num_blocks - 1) {
// Use shortest separator between this block's last key and next block's first key
comparator->FindShortestSeparator(&last_key_block_i, first_key_block_i+1);
} else {
// Last block: use short successor of last key
comparator->FindShortSuccessor(&last_key_block_i);
}
Benefits:
- Reduced index size (shorter keys)
- Maintains correctness (separator still routes lookups correctly)
Meta Blocks
Meta blocks store auxiliary data that helps optimize table operations. Unlike data blocks which contain actual key-value pairs, meta blocks contain metadata.
Common meta block types:
- Filter block - Bloom filters for fast negative lookups
- Stats block - Statistics about the table (future)
- Compression dictionary - Shared dictionary for compression (future)
Properties:
- Stored after all data blocks
- Each has a unique name (e.g.,
"filter.bloom","stats") - Mapped by MetaIndex block - readers use MetaIndex to locate specific meta blocks
- Formatted using
BlockBuilder(can have restart points, block trailer) - Optional (table valid without them)
Distinction from Index Block:
While Index block maps data block separators to data blocks, Meta blocks + MetaIndex work as a pair:
- Meta blocks = storage for metadata (e.g., filter data)
- MetaIndex block = index/directory of meta blocks
Think of it as: Data blocks → indexed by → Index block, while Meta blocks → indexed by → MetaIndex block.
MetaIndex Block Format
The MetaIndex block maps meta block names to their locations:
Entry: Key = meta_block_name (e.g., "filter.bloom", "stats")
Value = BlockHandle (varint64 offset | varint64 size)
Purpose: Allow readers to locate specific meta blocks by name without scanning.
Example entries:
Key="filter.bloom" → Value=BlockHandle(offset=12345, size=4096)
Key="stats" → Value=BlockHandle(offset=16441, size=256)
Key="dict" → Value=BlockHandle(offset=16697, size=1024)
Format Characteristics:
MetaIndex blocks technically use the same BlockBuilder as all other blocks (with restart points and trailer), but behave more like Index blocks in practice:
| Characteristic | Index Block | MetaIndex Block |
|---|---|---|
| Entry count | Medium (10s-100s) | Small (typically < 10) |
| Prefix compression | Effective (data block separators share prefixes) | Ineffective (meta block names have no common prefixes) |
| Restart points | Multiple (one per 16 entries) | Usually only 1 (at offset 0) |
| Search method | Binary search | Linear scan (enough for small count) |
Typical structure:
Entry 1: shared=0, key="dict", value=BlockHandle(...)
Entry 2: shared=0, key="filter.bloom", value=BlockHandle(...)
Entry 3: shared=0, key="stats", value=BlockHandle(...)
Restart Array: [0] (only one restart point)
Restart Count: 1
Block Trailer: compression_type | crc32c
Why use BlockBuilder if it’s mostly ineffective?
- Code reuse: Single block format for all block types
- Consistency: Unified reading/verification logic for all blocks
- Future-proof: If meta block names grow or share prefixes, compression becomes useful
Block Types Comparison
To clarify the three main block types and their relationships:
(KV pairs)"] DB2["Data Block 2
(KV pairs)"] DBN["Data Block N
(KV pairs)"] end subgraph Index_Layer["Index Layer"] IB["Index Block
(separator → BlockHandle)"] end subgraph Meta_Layer["Meta Layer"] MB1["Filter Block
(filter data)"] MB2["Stats Block
(statistics)"] MIB["MetaIndex Block
(name → BlockHandle)"] end DB1 --> IB DB2 --> IB DBN --> IB MB1 --> MIB MB2 --> MIB style DB1 fill:#2ECC71,stroke:#333,stroke-width:2px,color:#000 style DB2 fill:#2ECC71,stroke:#333,stroke-width:2px,color:#000 style DBN fill:#2ECC71,stroke:#333,stroke-width:2px,color:#000 style IB fill:#4ECDC4,stroke:#333,stroke-width:2px,color:#000 style MB1 fill:#F39C12,stroke:#333,stroke-width:2px,color:#000 style MB2 fill:#F39C12,stroke:#333,stroke-width:2px,color:#000 style MIB fill:#FF6B6B,stroke:#333,stroke-width:2px,color:#000
Detailed Comparison
| Aspect | Data Block | Index Block | MetaIndex Block |
|---|---|---|---|
| Purpose | Store actual KV pairs | Index data blocks | Index meta blocks |
| Indexed by | Index block | Footer | Footer |
| Entry content | Key → Value | Separator → BlockHandle | Meta name → BlockHandle |
| Entry count | Large (100s-1000s) | Medium (10s-100s) | Small (typically < 10) |
| Prefix compression | Very effective | Effective | Ineffective (names differ) |
| Restart interval | Every 16 entries | Every 16 entries | Usually only 1 (all shared=0) |
| Must have | Yes | Yes | Only if meta blocks exist |
| Format type | BlockBuilder output |
BlockBuilder output |
BlockBuilder output |
| Typical size | ~4 KB each | A few KB | < 1 KB |
Storage Hierarchy
SSTable File Structure:
├── Layer 1: Data
│ ├── Data Block 1 (4 KB)
│ ├── Data Block 2 (4 KB)
│ └── ...
├── Layer 2: Metadata
│ ├── Filter Block (2 KB)
│ ├── Stats Block (256 B)
│ └── ...
├── Layer 3: Indices
│ ├── MetaIndex Block (200 B) ← indexes Layer 2
│ └── Index Block (4 KB) ← indexes Layer 1
└── Layer 4: Navigation
└── Footer (48 B) ← indexes Layer 3
Query Flow Examples
Example 1: Find data value
Footer → Index Block location
→ Read Index Block
→ Binary search for key
→ Get Data Block location
→ Read Data Block
→ Seek/Linear scan for value
Example 2: Use filter block
Footer → MetaIndex Block location
→ Read MetaIndex Block
→ Lookup "filter.bloom"
→ Get Filter Block location
→ Read Filter Block
→ Check if key might exist
→ If no → return NotFound (fast path)
→ If yes → proceed with Example 1
Filter Block Format
Stores Bloom filters for data blocks.
[filter 0] // For keys in data blocks [0*base, 1*base)
[filter 1] // For keys in data blocks [1*base, 2*base)
...
[filter N-1]
[offset of filter 0] // fixed32
[offset of filter 1] // fixed32
...
[offset of filter N-1] // fixed32
[offset of offset array] // fixed32
lg(base) // 1 byte (default: lg(2048) = 11)
Base: 2KB (2048 bytes). All keys in data blocks whose file offset falls in [i*2KB, (i+1)*2KB) are combined into filter i.
Footer Format
The footer is fixed 48 bytes at the end of the file:
Footer := metaindex_handle (varint64 offset | varint64 size, padded to 20 bytes)
| index_handle (varint64 offset | varint64 size, padded to 20 bytes)
| padding (fill to 40 bytes total)
| magic (0xdb4775248b80fb57, fixed64, 8 bytes)
Total: 40 bytes (handles + padding) + 8 bytes (magic) = 48 bytes.
Magic number: 0xdb4775248b80fb57 (little-endian)
Reading:
- Seek to
file_size - 48 - Read 48 bytes
- Verify magic number
- Decode
metaindex_handleandindex_handle
Building an SSTable
Key steps:
- Add entries: Call
TableBuilder::Add(key, value)in sorted order - Flush data blocks: When size threshold reached, flush to file
- Build index: Record separator key → BlockHandle for each data block
- Write filter: Optionally write Bloom filter
- Write metaindex: Map “filter.bloom” → filter BlockHandle
- Write index: Flush index block
- Write footer: Record metaindex/index BlockHandles + magic
Reading from SSTable
Two-level lookup:
- Index block: Binary search to find data block containing key
- Data block: Binary search on restart points, then linear scan
Iterator Implementation
Two-Level Iterator
outer] Data[Data Block Iterator
inner] end Index -->|current entry's BlockHandle| Data subgraph Operations S[Seek/Next/Prev] S --> Index Index --> Data end
Algorithm:
TwoLevelIterator:
outer: index_block_iterator
inner: data_block_iterator (lazy loaded)
Seek(target):
outer.Seek(target)
InitInner() // Read data block at outer.value()
inner.Seek(target)
Next():
inner.Next()
if !inner.Valid():
outer.Next()
if outer.Valid():
InitInner()
inner.SeekToFirst()
key() := inner.key()
value() := inner.value()
Compression
Supported types:
kNoCompression (0x00)kSnappyCompression (0x01)
Applied to: Block contents (before appending trailer)
Trade-off:
- Pros: Reduced disk I/O, storage space
- Cons: CPU overhead on read/write
Default: Snappy (fast compression/decompression)
CRC32C Checksum
Purpose: Detect data corruption
Coverage: block_contents + compression_type
Algorithm: CRC32C (Castagnoli polynomial)
Location: Last 4 bytes of block trailer
Verification:
uint32_t expected_crc = DecodeFixed32(trailer + 1);
uint32_t actual_crc = crc32c::Value(block_contents);
actual_crc = crc32c::Extend(actual_crc, &compression_type, 1);
if (actual_crc != expected_crc) {
return Status::Corruption("Block checksum mismatch");
}
Note: Unlike WAL, table block CRCs are not masked.
InternalKey Encoding in SSTable
Keys stored in SSTable are InternalKeys:
InternalKey := user_key (variable length)
| sequence_number (7 bytes, big-endian subset of tag)
| value_type (1 byte: kTypeValue=1, kTypeDeletion=0)
Tag: (sequence << 8) | type, stored as fixed64 (little-endian)
Comparison order:
- User key: ascending (lexicographic)
- Sequence number: descending (newer first)
- Value type: descending (value before deletion)
Example:
User puts: key="foo", value="v1" at seq=10
User puts: key="foo", value="v2" at seq=20
User deletes: key="foo" at seq=30
SSTable stores (sorted by InternalKey):
InternalKey("foo", 30, kTypeDeletion) → "" (empty value)
InternalKey("foo", 20, kTypeValue) → "v2"
InternalKey("foo", 10, kTypeValue) → "v1"
Get("foo", snapshot_seq=25) → "v2" (ignores seq=30)
Get("foo", snapshot_seq=35) → NotFound (sees deletion at seq=30)
Size Limits and Thresholds
| Parameter | Default | Purpose |
|---|---|---|
block_size |
4 KB | Target size for data blocks |
block_restart_interval |
16 | Restart points per block |
write_buffer_size |
4 MB | MemTable flush threshold |
filter_base |
2 KB | Filter granularity |
Why 4KB blocks?
- Matches typical filesystem/SSD page size
- Good balance: small enough for random reads, large enough for compression
TableCache (Future)
index in memory] H2 --> T2[Table 2
index in memory] HN --> TN[Table N
index in memory]
Purpose:
- Keep recently-used tables open (file handles + index blocks)
- Avoid repeated file open/close overhead
- Amortize index block parsing
Eviction: LRU (Least Recently Used)
Integration with DBImpl
Minor Compaction (MemTable → SSTable)
Read Path
SSTables} L1{Check L1+
SSTables} NotFound[Return NotFound] Start --> Mem Mem -->|Miss| Imm Mem -->|Hit| Return Imm -->|Miss| L0 Imm -->|Hit| Return L0 -->|Miss| L1 L0 -->|Hit| Return L1 -->|Miss| NotFound L1 -->|Hit| Return style Return fill:#2ECC71,stroke:#333,stroke-width:2px,color:#000 style NotFound fill:#FF6B6B,stroke:#333,stroke-width:2px,color:#000
Search order:
mem_(current MemTable)imm_(immutable MemTable being compacted)- Level 0 SSTables (may overlap)
- Level 1+ SSTables (non-overlapping within level)
File Naming Convention
<dbname>/
├── CURRENT # Points to current MANIFEST
├── MANIFEST-000001 # Version metadata
├── 000003.log # WAL (Write-Ahead Log)
├── 000005.sst # SSTable files
├── 000007.sst
└── LOG # Human-readable log
SSTable naming: <file_number>.sst or <file_number>.ldb
- File number: monotonically increasing uint64
- Managed by
VersionSet
References
- LevelDB Table Format
- LevelDB Implementation Notes
- LSM-Tree Paper - O’Neil et al., 1996
Last Updated: 2025-01-16
Status: Design documentation (implementation pending)