Notes/LSM Trees and Write-Optimized Storage: Why Cassandra and RocksDB Chose Append-Only

LSM Trees and Write-Optimized Storage: Why Cassandra and RocksDB Chose Append-Only

How LSM trees turn random writes into sequential I/O through memtables, SSTables, and compaction, and why write-heavy databases chose them over B-trees.

2025-10-03AI-Synthesized from Personal Notes

Database InternalsLsm TreesSstablesDatabases

Terminology

Term	Definition	Trade-off / Gotcha
LSM Tree	Log-Structured Merge Tree: a write-optimized storage structure that buffers writes in memory and flushes sorted, immutable files to disk.	Reads are slower than B-trees because data may span multiple levels; bloom filters help but do not eliminate the cost.
Memtable	An in-memory sorted data structure (skip list or red-black tree) that absorbs all incoming writes before they reach disk.	If the process crashes before flushing, data in the memtable is lost unless a WAL protects it.
SSTable	Sorted String Table: an immutable, sorted file on disk containing key-value pairs with an index block and optional bloom filter.	Immutability means updates create new entries rather than modifying old ones, increasing space usage until compaction runs.
Write-Ahead Log	A sequential, append-only log on disk that records every write before it enters the memtable, ensuring durability on crash.	WAL writes are sequential and fast, but they add write amplification (every write hits disk twice: WAL + SSTable).
Compaction	A background process that merges overlapping SSTables into fewer, larger files, removing duplicates and tombstones.	Compaction consumes CPU and disk I/O; if it falls behind the write rate, read performance degrades as unmerged files accumulate.
Write Amplification	The ratio of total bytes written to disk versus bytes written by the application; compaction rewrites data multiple times across levels.	Leveled compaction has $O(L \cdot T)$ write amplification where $L$ is levels and $T$ is the size ratio between levels.
Read Amplification	The number of SSTables (or disk reads) that must be checked to satisfy a single point query.	Bloom filters reduce expected reads to $O(1)$ per level, but worst-case is still proportional to the number of levels.
Space Amplification	The ratio of actual disk usage to the logical data size, caused by duplicate keys existing across multiple levels before compaction.	Leveled compaction keeps space amplification near 1.1x; tiered compaction can reach 2x or more.
Tombstone	A special delete marker written to the LSM tree; the actual key-value pair is removed only when compaction processes the tombstone.	Tombstones consume space and slow reads until compaction clears them; range scans over deleted ranges are especially affected.
Bloom Filter	A probabilistic data structure attached to each SSTable that can definitively say "key not here" but may give false positives.	False positive rate drops exponentially with bits per key; 10 bits/key gives roughly 1% false positives.
Level	A tier in the LSM tree's disk hierarchy; L0 holds recently flushed SSTables, and each deeper level holds progressively larger, merged files.	More levels mean higher read and write amplification but lower space amplification (leveled) or vice versa (tiered).

What & Why

The previous post covered B-tree indexes: the default read-optimized structure that updates pages in place on disk. B-trees are excellent for read-heavy workloads because any key can be found in 3-4 disk reads. But every write to a B-tree is a random I/O: find the correct leaf page, read it, modify it, write it back. For workloads that are write-heavy (event logging, time-series ingestion, messaging, IoT telemetry), this random write pattern becomes the bottleneck.

LSM trees solve this by never updating data in place. Instead, all writes go to an in-memory buffer (the memtable), which periodically flushes to disk as a sorted, immutable file (an SSTable). This converts random writes into sequential writes, which are 10-100x faster on both HDDs and SSDs. The trade-off is that reads must check multiple files across multiple levels, making point lookups and range scans more expensive than in a B-tree.

This is not a theoretical trade-off. It is the reason Cassandra, RocksDB, LevelDB, HBase, ScyllaDB, and CockroachDB all chose LSM trees as their storage engine. These systems prioritize write throughput and are willing to pay the read cost, mitigated by bloom filters and caching.

How It Works

The LSM Write Path

Every write follows the same path: WAL first, then memtable, then eventually disk. The WAL ensures durability (if the process crashes, replay the WAL to recover the memtable). The memtable provides fast sorted insertion. When the memtable reaches a size threshold (typically 64 MB in RocksDB), it is frozen, a new empty memtable takes its place, and a background thread flushes the frozen memtable to disk as an L0 SSTable.

The critical insight: the memtable is a sorted in-memory structure. When it flushes, the output SSTable is already sorted, so the flush is a single sequential write. No random I/O at all. This is why LSM trees achieve 10-100x higher write throughput than B-trees on the same hardware.

SSTable On-Disk Layout

An SSTable is not just a flat file of sorted key-value pairs. It has internal structure that enables efficient lookups without reading the entire file.

To look up a key in an SSTable: (1) read the footer to find the bloom filter and index offsets, (2) check the bloom filter, if it says "definitely not here," skip this file entirely, (3) binary search the index block to find which data block contains the key, (4) read that single 4 KB data block and scan for the key. In practice, the footer, bloom filter, and index block are cached in memory, so a point lookup is typically one disk read (the data block).

Memtable Structure: Skip List in Memory

The memtable needs to support fast sorted insertion and fast sequential iteration (for flushing). Both skip lists and red-black trees work, but most modern LSM implementations (RocksDB, LevelDB, Cassandra) use skip lists because they are simpler to implement lock-free for concurrent writes.

The flush is a simple in-order traversal of the skip list, writing each key-value pair sequentially to a new SSTable file. Because the skip list is already sorted, no sorting step is needed. The entire flush is one sequential write, which is why it is so fast.

Compaction Strategies: Leveled vs Tiered

Compaction is where the real engineering trade-offs live. The choice of compaction strategy determines the balance between write amplification, read amplification, and space amplification. No strategy wins on all three: you pick two.

Read and Write Amplification Across Levels

As data moves deeper through LSM levels, both read and write amplification grow. In leveled compaction with a size ratio $T = 10$, each level is 10x larger than the previous one. A key written to L0 may be rewritten $T$ times when it is compacted into L1, another $T$ times into L2, and so on.

Without bloom filters, a point lookup must check every level from L0 down to the level where the key actually lives. With bloom filters (10 bits per key, ~1% false positive rate), the expected number of SSTables actually read drops to roughly 1 plus a tiny false-positive overhead. This is why bloom filters are not optional in production LSM trees: they are essential for acceptable read latency.

B-Tree vs LSM: Write Throughput and Read Latency

The fundamental trade-off between B-trees and LSM trees shows up clearly when you vary the read/write mix of a workload.

At 100% writes, LSM trees achieve 10x or more write throughput because every write is sequential. At 100% reads, B-trees win because a point lookup is a single tree traversal (3-4 random reads) versus checking multiple LSM levels. The crossover point depends on hardware, but on SSDs with fast random reads, LSM trees remain competitive for mixed workloads up to about 50/50 read/write.

Compaction Strategy Comparison

FIFO compaction is the simplest: SSTables are never merged, just deleted when they exceed a time-to-live (TTL). This works only for time-series data where old data is never queried and can be dropped entirely. Universal compaction (a RocksDB-specific strategy) dynamically chooses between leveled-like and tiered-like behavior based on the current state of the LSM tree.

Complexity Analysis

All complexities assume an LSM tree with $L$ levels, size ratio $T$ between levels (typically 10), $N$ total key-value pairs, memtable size $M$, and bloom filters with false positive rate $p$ (typically 0.01).

Operation	Time Complexity	Disk I/Os	Notes
Write (to memtable)	$O(\log M)$	1 sequential (WAL)	Amortized $O(1)$ for skip list with constant memtable size
Point read (with bloom)	$O(L \cdot p)$ expected	~1 random read	Bloom filters eliminate most levels; expected reads $\approx 1 + L \cdot p$
Point read (no bloom)	$O(L \cdot \log(N/L))$	$L$ random reads	Must binary search one SSTable per level in worst case
Range scan ($k$ keys)	$O(L \cdot \log(N/L) + k)$	$O(k/B + L)$	Merge $L$ sorted runs; $B$ = keys per data block
Flush (memtable to L0)	$O(M)$	1 sequential write	In-order traversal of memtable, single sequential file write
Compaction (leveled)	$O(T \cdot S_i)$ per level	Sequential I/O	$S_i$ = size of level $i$; merge-sort of overlapping ranges

Write amplification for leveled compaction across all levels:

$W_{amp} = O(T \cdot L) = O\left(T \cdot \log_T\frac{N}{M}\right)$

For $T = 10$, $N = 10^9$ keys, $M = 10^6$ keys (64 MB memtable):

$W_{amp} \approx 10 \times \log_{10}\frac{10^9}{10^6} = 10 \times 3 = 30$

This means every byte written by the application results in roughly 30 bytes written to disk due to compaction. For tiered compaction, write amplification drops to $O(L)$ (roughly 3 in this example), but read amplification increases proportionally.

The expected number of disk reads for a point lookup with bloom filters:

$R_{expected} = 1 + (L - 1) \cdot p \approx 1 + 6 \times 0.01 = 1.06$

With 7 levels and 1% false positive bloom filters, a point lookup reads on average 1.06 SSTables. This is why bloom filters are the single most important optimization in an LSM tree: they reduce read amplification from $O(L)$ to effectively $O(1)$.

Implementation

Pseudocode: LSM Write Path

ALGORITHM LSMWrite(key, value)
INPUT: key: the key to write
       value: the value to associate with the key
OUTPUT: acknowledgment that the write is durable

BEGIN
  // Step 1: Write to WAL for durability
  WAL.Append(key, value)
  WAL.Sync()  // fsync to guarantee durability

  // Step 2: Insert into active memtable
  active_memtable.Insert(key, value)

  // Step 3: Check if memtable exceeds size threshold
  IF active_memtable.Size() >= MEMTABLE_THRESHOLD THEN
    frozen = active_memtable
    active_memtable = new SkipList()

    // Schedule background flush
    BackgroundFlush(frozen)
  END IF

  RETURN ACK
END

Pseudocode: LSM Point Read

ALGORITHM LSMRead(key)
INPUT: key: the key to look up
OUTPUT: the value associated with key, or NOT_FOUND

BEGIN
  // Step 1: Check active memtable (most recent writes)
  result = active_memtable.Get(key)
  IF result IS NOT NULL THEN
    IF result IS TOMBSTONE THEN RETURN NOT_FOUND
    RETURN result
  END IF

  // Step 2: Check frozen memtable (if flush in progress)
  IF frozen_memtable IS NOT NULL THEN
    result = frozen_memtable.Get(key)
    IF result IS NOT NULL THEN
      IF result IS TOMBSTONE THEN RETURN NOT_FOUND
      RETURN result
    END IF
  END IF

  // Step 3: Check each level from L0 to Lmax
  FOR level = 0 TO max_level DO
    FOR EACH sstable IN level.SSTables() DO
      // Bloom filter check: skip if key definitely not present
      IF NOT sstable.BloomFilter.MayContain(key) THEN
        CONTINUE
      END IF

      // Binary search the index block
      block_offset = sstable.Index.Search(key)
      IF block_offset IS NOT NULL THEN
        result = sstable.ReadDataBlock(block_offset).Search(key)
        IF result IS NOT NULL THEN
          IF result IS TOMBSTONE THEN RETURN NOT_FOUND
          RETURN result
        END IF
      END IF
    END FOR
  END FOR

  RETURN NOT_FOUND
END

Pseudocode: Leveled Compaction

ALGORITHM LeveledCompaction(level_n)
INPUT: level_n: the level that has exceeded its size limit
OUTPUT: merged SSTables promoted to level_n+1

BEGIN
  // Pick one SSTable from level_n
  source = PickSSTable(level_n)  // round-robin or oldest

  // Find all overlapping SSTables in level_n+1
  key_range = source.MinKey() .. source.MaxKey()
  targets = level_n_plus_1.GetOverlapping(key_range)

  // Merge-sort source + targets into new SSTables
  merged_iterator = MergeSortIterator(source, targets)
  new_sstables = empty list

  current_block = new DataBlock()
  WHILE merged_iterator.HasNext() DO
    (key, value) = merged_iterator.Next()

    // Skip tombstones if this is the deepest level
    IF value IS TOMBSTONE AND level_n+1 = max_level THEN
      CONTINUE
    END IF

    // Keep only the most recent version of each key
    IF key = previous_key THEN CONTINUE
    previous_key = key

    current_block.Add(key, value)
    IF current_block.Size() >= BLOCK_SIZE THEN
      new_sstables.Append(current_block.Flush())
      current_block = new DataBlock()
    END IF
  END WHILE

  // Atomic swap: install new SSTables, remove old ones
  level_n.Remove(source)
  level_n_plus_1.Remove(targets)
  level_n_plus_1.Add(new_sstables)

  // Update manifest (metadata file tracking all SSTables)
  Manifest.Update()
END

Real-World Applications

Cassandra uses tiered compaction (Size-Tiered Compaction Strategy, STCS) by default because its primary use case is high-throughput writes: event logging, time-series, and messaging. Cassandra also offers Leveled Compaction Strategy (LCS) for read-heavy workloads and a newer Unified Compaction Strategy (UCS) that adapts dynamically.

RocksDB (developed by Facebook, now Meta) uses leveled compaction by default and is the storage engine behind many systems: MySQL's MyRocks engine, CockroachDB, TiKV (the storage layer of TiDB), and Kafka Streams' state stores. RocksDB also supports universal compaction for workloads that need lower write amplification.

LevelDB (developed by Google, the predecessor to RocksDB) introduced the leveled compaction design that gives it its name. It is used in Chrome's IndexedDB implementation and was the original storage engine for Bitcoin Core.

HBase (the Hadoop database) uses LSM trees with tiered compaction for its RegionServer storage. It is designed for massive write throughput on HDFS, where sequential writes are critical because the underlying distributed file system penalizes random I/O heavily.

ScyllaDB reimplements Cassandra's data model in C++ with a custom LSM engine that supports both tiered and leveled compaction, achieving significantly lower tail latencies through careful memory management and a shared-nothing architecture.

SQLite's LSM extension (lsm1) adds an optional LSM storage backend to SQLite for write-heavy embedded use cases, demonstrating that even traditionally B-tree-only databases recognize the value of LSM trees for specific workloads.

Key Takeaways

LSM trees convert random writes into sequential writes by buffering in a memtable and flushing sorted SSTables, achieving 10x+ write throughput over B-trees.
The three amplification factors (write, read, space) are in tension: leveled compaction minimizes read and space amplification at the cost of write amplification; tiered compaction does the opposite.
Bloom filters are essential, not optional: they reduce point read amplification from $O(L)$ levels to effectively $O(1)$ expected disk reads.
Compaction strategy is the most important tuning knob in an LSM-based database: leveled for read-heavy, tiered for write-heavy, FIFO for TTL-based time-series.
Cassandra, RocksDB, LevelDB, HBase, and CockroachDB all chose LSM trees because their workloads are write-dominated or require high ingestion throughput, and they mitigate the read cost with bloom filters, caching, and careful compaction tuning.
SSTables are not flat files: they have internal structure (data blocks, index blocks, bloom filters, metadata footers) that enables efficient point lookups and range scans without reading the entire file.

2025-10-04

Query Execution and Optimization: How Databases Turn SQL into Fast Plans

How the query optimizer transforms SQL into execution plans, how cost-based optimization picks the cheapest path, and why nested loop, hash join, and merge join exist for different workloads.

2025-10-05

Transaction Isolation and MVCC: How Postgres Reads Without Locking

How transaction isolation levels prevent anomalies, how Postgres uses Multi-Version Concurrency Control to let readers and writers coexist, and why write skew still sneaks through snapshot isolation.

2025-10-15

The Physical I/O Path: How Bytes Hit the Disk

fsync semantics, O_DIRECT vs buffered I/O, the OS page cache, SSD vs HDD access patterns, torn write protection, and the fsync-gate incident that shook PostgreSQL.