LSM Trees: Write-Optimized Storage with Log-Structured Merge
How LSM Trees turn random writes into sequential I/O, powering the storage engines behind Cassandra, RocksDB, and LevelDB.
Terminology
- LSM Tree: Log-Structured Merge Tree, a data structure that buffers writes in memory and periodically flushes them to disk as sorted, immutable files
- Memtable: the in-memory write buffer (typically a balanced BST or skip list) that accepts all incoming writes
- SSTable: Sorted String Table, an immutable, sorted file on disk containing key-value pairs
- Write-Ahead Log (WAL): a sequential log on disk that records every write before it enters the memtable, ensuring durability if the process crashes
- Compaction: the background process of merging multiple SSTables into fewer, larger SSTables, removing duplicates and deleted entries
- Level: a tier in the LSM Tree's disk hierarchy; level 0 holds recently flushed SSTables, higher levels hold progressively larger merged files
- Tombstone: a special marker written to indicate that a key has been deleted; the actual removal happens during compaction
- Write amplification: the ratio of total bytes written to disk versus bytes written by the application; compaction increases this ratio
- Read amplification: the number of disk reads needed to satisfy a single query, since data may be spread across multiple levels
- Space amplification: the ratio of disk space used versus the logical size of the data, caused by duplicate keys across levels before compaction
- Bloom filter: a probabilistic data structure used per SSTable to quickly rule out keys that are not present, reducing unnecessary disk reads
What & Why
An LSM Tree is a storage architecture designed for workloads where writes vastly outnumber reads, or where write throughput is the primary bottleneck. Instead of updating data in place on disk (like a B-Tree), an LSM Tree buffers all writes in memory and periodically flushes them as sorted, immutable files. This converts random writes into sequential writes, which is dramatically faster on both HDDs and SSDs.
Why not just use a B-Tree? B-Trees update pages in place, which means every write is a random disk I/O. For write-heavy workloads (logging, time-series data, messaging), this becomes the bottleneck. An LSM Tree absorbs thousands of writes in memory, then writes them all at once in a single sequential flush. The trade-off: reads are slower because data may be spread across multiple files on disk.
Key properties:
- Writes go to an in-memory buffer (memtable), making them extremely fast
- When the memtable fills up, it is flushed to disk as an immutable SSTable
- Background compaction merges SSTables to keep read performance manageable
- Deletes are handled by writing tombstone markers, not by erasing data
- Bloom filters on each SSTable reduce unnecessary disk reads during queries
How It Works
Write Path
Every write first goes to the WAL (for crash recovery), then to the memtable. When the memtable reaches a size threshold, it is frozen and flushed to disk as a new SSTable at level 0. A fresh memtable takes its place.
Read Path
To read a key, check the memtable first (most recent data). If not found, check level 0 SSTables (most recent flushes), then level 1, level 2, and so on. At each SSTable, use the Bloom filter to skip files that definitely do not contain the key, then binary search within the SSTable if the Bloom filter says "maybe."
Compaction
Over time, SSTables accumulate. Compaction merges overlapping SSTables from one level into the next, producing larger sorted files with duplicates and tombstones removed. Two common strategies:
- Size-tiered compaction: merge SSTables of similar size; simpler, better write throughput, higher space amplification
- Leveled compaction: each level has a size limit; when exceeded, SSTables are merged into the next level; better read performance, lower space amplification, higher write amplification
Complexity Analysis
Let $n$ be the number of key-value pairs, $B$ be the page size in entries, and $T$ be the size ratio between levels (typically 10).
| Operation | Leveled Compaction | Size-Tiered | Notes |
|---|---|---|---|
| Write (amortized) | $O\!\left(\frac{T \log_T n}{B}\right)$ | $O\!\left(\frac{\log_T n}{B}\right)$ | Leveled has higher write amplification |
| Point read | $O(\log_T n)$ | $O(T \log_T n)$ | Bloom filters reduce this in practice |
| Range scan ($k$ results) | $O(\log_T n + k/B)$ | $O(T \log_T n + k/B)$ | Merge sorted iterators across levels |
| Space amplification | $O(1/T)$ overhead | $O(T)$ overhead | Leveled is more space-efficient |
The fundamental trade-off in LSM Trees is between write amplification, read amplification, and space amplification. You can optimize for any two at the expense of the third. Leveled compaction favors reads and space; size-tiered favors writes.
With Bloom filters, the practical read cost drops significantly. A well-tuned Bloom filter with 10 bits per key gives a ~1% false positive rate, meaning only 1 in 100 SSTable lookups results in an unnecessary disk read.
Implementation
STRUCTURE LSMTree
memtable: sorted in-memory structure (e.g., skip list)
memtableSize: current byte size of memtable
maxMemtableSize: threshold for flushing
wal: write-ahead log file
levels: array of lists of SSTables, one per level
END STRUCTURE
ALGORITHM LSMWrite(tree, key, value)
INPUT: tree: LSMTree, key, value: data to write
BEGIN
Append (key, value) to tree.wal
tree.memtable.Insert(key, value)
tree.memtableSize <- tree.memtableSize + Size(key, value)
IF tree.memtableSize >= tree.maxMemtableSize THEN
FlushMemtable(tree)
END IF
END
ALGORITHM FlushMemtable(tree)
BEGIN
sortedEntries <- tree.memtable.GetAllSorted()
newSSTable <- WriteSSTable(sortedEntries)
bloomFilter <- BuildBloomFilter(sortedEntries)
Attach bloomFilter to newSSTable
Append newSSTable to tree.levels[0]
tree.memtable <- new empty sorted structure
tree.memtableSize <- 0
Delete tree.wal
tree.wal <- new empty WAL
IF ShouldCompact(tree.levels[0]) THEN
Compact(tree, 0)
END IF
END
ALGORITHM LSMRead(tree, key)
INPUT: tree: LSMTree, key: key to look up
OUTPUT: value or NOT_FOUND
BEGIN
result <- tree.memtable.Search(key)
IF result IS NOT NOT_FOUND THEN
IF result IS TOMBSTONE THEN RETURN NOT_FOUND
RETURN result
END IF
FOR level <- 0 TO tree.maxLevel DO
FOR EACH sstable IN tree.levels[level] (newest first) DO
IF sstable.bloomFilter.MayContain(key) THEN
result <- sstable.BinarySearch(key)
IF result IS NOT NOT_FOUND THEN
IF result IS TOMBSTONE THEN RETURN NOT_FOUND
RETURN result
END IF
END IF
END FOR
END FOR
RETURN NOT_FOUND
END
ALGORITHM Compact(tree, level)
INPUT: tree: LSMTree, level: the level to compact from
BEGIN
tablesToMerge <- SelectTables(tree.levels[level])
overlapping <- FindOverlapping(tree.levels[level + 1], tablesToMerge)
merged <- MergeSorted(tablesToMerge + overlapping)
Remove tombstones for keys not present in higher levels
Remove duplicate keys (keep newest)
newSSTables <- SplitIntoSSTables(merged, targetSize)
Replace overlapping in tree.levels[level + 1] with newSSTables
Remove tablesToMerge from tree.levels[level]
IF ShouldCompact(tree.levels[level + 1]) THEN
Compact(tree, level + 1)
END IF
END
Real-World Applications
- RocksDB / LevelDB: the canonical LSM Tree implementations, used as embedded storage engines by dozens of systems including MySQL (MyRocks), CockroachDB, and TiKV
- Apache Cassandra: uses LSM Trees with size-tiered compaction by default, optimized for high write throughput across distributed nodes
- Apache HBase: the Hadoop-ecosystem wide-column store uses LSM Trees for its storage layer
- ScyllaDB: a high-performance Cassandra-compatible database built on LSM Trees with advanced compaction strategies
- InfluxDB: time-series databases are a natural fit for LSM Trees because time-series data is almost entirely append-only writes
- SQLite (WAL mode): while not a full LSM Tree, SQLite's WAL mode uses similar principles of buffering writes and merging them later
Key Takeaways
- LSM Trees convert random writes into sequential I/O by buffering in memory and flushing sorted files to disk
- The write path is simple and fast: WAL for durability, memtable for speed, flush when full
- The read path checks memtable first, then each disk level; Bloom filters make this practical by skipping irrelevant SSTables
- Compaction is the critical background process that keeps read performance in check by merging and deduplicating SSTables
- The three-way trade-off between write amplification, read amplification, and space amplification is the central design decision for any LSM-based system