Notes/LSM Trees: Write-Optimized Storage with Log-Structured Merge

LSM Trees: Write-Optimized Storage with Log-Structured Merge

How LSM Trees turn random writes into sequential I/O, powering the storage engines behind Cassandra, RocksDB, and LevelDB.

2021-02-06AI-Synthesized from Personal Notes

Advanced Data StructuresLsm TreesData Structures

Terminology

LSM Tree: Log-Structured Merge Tree, a data structure that buffers writes in memory and periodically flushes them to disk as sorted, immutable files
Memtable: the in-memory write buffer (typically a balanced BST or skip list) that accepts all incoming writes
SSTable: Sorted String Table, an immutable, sorted file on disk containing key-value pairs
Write-Ahead Log (WAL): a sequential log on disk that records every write before it enters the memtable, ensuring durability if the process crashes
Compaction: the background process of merging multiple SSTables into fewer, larger SSTables, removing duplicates and deleted entries
Level: a tier in the LSM Tree's disk hierarchy; level 0 holds recently flushed SSTables, higher levels hold progressively larger merged files
Tombstone: a special marker written to indicate that a key has been deleted; the actual removal happens during compaction
Write amplification: the ratio of total bytes written to disk versus bytes written by the application; compaction increases this ratio
Read amplification: the number of disk reads needed to satisfy a single query, since data may be spread across multiple levels
Space amplification: the ratio of disk space used versus the logical size of the data, caused by duplicate keys across levels before compaction
Bloom filter: a probabilistic data structure used per SSTable to quickly rule out keys that are not present, reducing unnecessary disk reads

What & Why

An LSM Tree is a storage architecture designed for workloads where writes vastly outnumber reads, or where write throughput is the primary bottleneck. Instead of updating data in place on disk (like a B-Tree), an LSM Tree buffers all writes in memory and periodically flushes them as sorted, immutable files. This converts random writes into sequential writes, which is dramatically faster on both HDDs and SSDs.

Why not just use a B-Tree? B-Trees update pages in place, which means every write is a random disk I/O. For write-heavy workloads (logging, time-series data, messaging), this becomes the bottleneck. An LSM Tree absorbs thousands of writes in memory, then writes them all at once in a single sequential flush. The trade-off: reads are slower because data may be spread across multiple files on disk.

Key properties:

Writes go to an in-memory buffer (memtable), making them extremely fast
When the memtable fills up, it is flushed to disk as an immutable SSTable
Background compaction merges SSTables to keep read performance manageable
Deletes are handled by writing tombstone markers, not by erasing data
Bloom filters on each SSTable reduce unnecessary disk reads during queries

How It Works

Write Path

Every write first goes to the WAL (for crash recovery), then to the memtable. When the memtable reaches a size threshold, it is frozen and flushed to disk as a new SSTable at level 0. A fresh memtable takes its place.

Read Path

To read a key, check the memtable first (most recent data). If not found, check level 0 SSTables (most recent flushes), then level 1, level 2, and so on. At each SSTable, use the Bloom filter to skip files that definitely do not contain the key, then binary search within the SSTable if the Bloom filter says "maybe."

Compaction

Over time, SSTables accumulate. Compaction merges overlapping SSTables from one level into the next, producing larger sorted files with duplicates and tombstones removed. Two common strategies:

Size-tiered compaction: merge SSTables of similar size; simpler, better write throughput, higher space amplification
Leveled compaction: each level has a size limit; when exceeded, SSTables are merged into the next level; better read performance, lower space amplification, higher write amplification

Complexity Analysis

Let $n$ be the number of key-value pairs, $B$ be the page size in entries, and $T$ be the size ratio between levels (typically 10).

Operation	Leveled Compaction	Size-Tiered	Notes
Write (amortized)	$O\!\left(\frac{T \log_T n}{B}\right)$	$O\!\left(\frac{\log_T n}{B}\right)$	Leveled has higher write amplification
Point read	$O(\log_T n)$	$O(T \log_T n)$	Bloom filters reduce this in practice
Range scan ($k$ results)	$O(\log_T n + k/B)$	$O(T \log_T n + k/B)$	Merge sorted iterators across levels
Space amplification	$O(1/T)$ overhead	$O(T)$ overhead	Leveled is more space-efficient

The fundamental trade-off in LSM Trees is between write amplification, read amplification, and space amplification. You can optimize for any two at the expense of the third. Leveled compaction favors reads and space; size-tiered favors writes.

With Bloom filters, the practical read cost drops significantly. A well-tuned Bloom filter with 10 bits per key gives a ~1% false positive rate, meaning only 1 in 100 SSTable lookups results in an unnecessary disk read.

Implementation

STRUCTURE LSMTree
  memtable: sorted in-memory structure (e.g., skip list)
  memtableSize: current byte size of memtable
  maxMemtableSize: threshold for flushing
  wal: write-ahead log file
  levels: array of lists of SSTables, one per level
END STRUCTURE

ALGORITHM LSMWrite(tree, key, value)
INPUT: tree: LSMTree, key, value: data to write
BEGIN
  Append (key, value) to tree.wal
  tree.memtable.Insert(key, value)
  tree.memtableSize <- tree.memtableSize + Size(key, value)

  IF tree.memtableSize >= tree.maxMemtableSize THEN
    FlushMemtable(tree)
  END IF
END

ALGORITHM FlushMemtable(tree)
BEGIN
  sortedEntries <- tree.memtable.GetAllSorted()
  newSSTable <- WriteSSTable(sortedEntries)
  bloomFilter <- BuildBloomFilter(sortedEntries)
  Attach bloomFilter to newSSTable

  Append newSSTable to tree.levels[0]
  tree.memtable <- new empty sorted structure
  tree.memtableSize <- 0
  Delete tree.wal
  tree.wal <- new empty WAL

  IF ShouldCompact(tree.levels[0]) THEN
    Compact(tree, 0)
  END IF
END

ALGORITHM LSMRead(tree, key)
INPUT: tree: LSMTree, key: key to look up
OUTPUT: value or NOT_FOUND
BEGIN
  result <- tree.memtable.Search(key)
  IF result IS NOT NOT_FOUND THEN
    IF result IS TOMBSTONE THEN RETURN NOT_FOUND
    RETURN result
  END IF

  FOR level <- 0 TO tree.maxLevel DO
    FOR EACH sstable IN tree.levels[level] (newest first) DO
      IF sstable.bloomFilter.MayContain(key) THEN
        result <- sstable.BinarySearch(key)
        IF result IS NOT NOT_FOUND THEN
          IF result IS TOMBSTONE THEN RETURN NOT_FOUND
          RETURN result
        END IF
      END IF
    END FOR
  END FOR

  RETURN NOT_FOUND
END

ALGORITHM Compact(tree, level)
INPUT: tree: LSMTree, level: the level to compact from
BEGIN
  tablesToMerge <- SelectTables(tree.levels[level])
  overlapping <- FindOverlapping(tree.levels[level + 1], tablesToMerge)

  merged <- MergeSorted(tablesToMerge + overlapping)
  Remove tombstones for keys not present in higher levels
  Remove duplicate keys (keep newest)

  newSSTables <- SplitIntoSSTables(merged, targetSize)
  Replace overlapping in tree.levels[level + 1] with newSSTables
  Remove tablesToMerge from tree.levels[level]

  IF ShouldCompact(tree.levels[level + 1]) THEN
    Compact(tree, level + 1)
  END IF
END

Real-World Applications

RocksDB / LevelDB: the canonical LSM Tree implementations, used as embedded storage engines by dozens of systems including MySQL (MyRocks), CockroachDB, and TiKV
Apache Cassandra: uses LSM Trees with size-tiered compaction by default, optimized for high write throughput across distributed nodes
Apache HBase: the Hadoop-ecosystem wide-column store uses LSM Trees for its storage layer
ScyllaDB: a high-performance Cassandra-compatible database built on LSM Trees with advanced compaction strategies
InfluxDB: time-series databases are a natural fit for LSM Trees because time-series data is almost entirely append-only writes
SQLite (WAL mode): while not a full LSM Tree, SQLite's WAL mode uses similar principles of buffering writes and merging them later

Key Takeaways

LSM Trees convert random writes into sequential I/O by buffering in memory and flushing sorted files to disk
The write path is simple and fast: WAL for durability, memtable for speed, flush when full
The read path checks memtable first, then each disk level; Bloom filters make this practical by skipping irrelevant SSTables
Compaction is the critical background process that keeps read performance in check by merging and deduplicating SSTables
The three-way trade-off between write amplification, read amplification, and space amplification is the central design decision for any LSM-based system

2021-02-07

Red-Black Trees: Self-Balancing BSTs with Color Invariants

How red-black trees use node coloring and rotations to guarantee O(log n) operations, powering standard library maps and sets across languages.

2021-02-01

B-Trees: Balanced Multi-Way Trees for Disk-Optimized Search

How B-Trees keep data sorted and balanced across multiple children per node, minimizing disk I/O for databases and file systems.

2021-02-02

B+ Trees: Leaf-Linked Trees for Range Queries and Database Indexing

How B+ Trees extend B-Trees by storing all data in linked leaf nodes, enabling fast range scans that power every major database index.