Back to Blog

LSM Trees: Write-Optimized Storage with Log-Structured Merge

How LSM Trees turn random writes into sequential I/O, powering the storage engines behind Cassandra, RocksDB, and LevelDB.

2021-02-06
Share
Advanced Data Structureslsm-treesdata-structures

Terminology

  • LSM Tree: Log-Structured Merge Tree, a data structure that buffers writes in memory and periodically flushes them to disk as sorted, immutable files
  • Memtable: the in-memory write buffer (typically a balanced BST or skip list) that accepts all incoming writes
  • SSTable: Sorted String Table, an immutable, sorted file on disk containing key-value pairs
  • Write-Ahead Log (WAL): a sequential log on disk that records every write before it enters the memtable, ensuring durability if the process crashes
  • Compaction: the background process of merging multiple SSTables into fewer, larger SSTables, removing duplicates and deleted entries
  • Level: a tier in the LSM Tree's disk hierarchy; level 0 holds recently flushed SSTables, higher levels hold progressively larger merged files
  • Tombstone: a special marker written to indicate that a key has been deleted; the actual removal happens during compaction
  • Write amplification: the ratio of total bytes written to disk versus bytes written by the application; compaction increases this ratio
  • Read amplification: the number of disk reads needed to satisfy a single query, since data may be spread across multiple levels
  • Space amplification: the ratio of disk space used versus the logical size of the data, caused by duplicate keys across levels before compaction
  • Bloom filter: a probabilistic data structure used per SSTable to quickly rule out keys that are not present, reducing unnecessary disk reads

What & Why

An LSM Tree is a storage architecture designed for workloads where writes vastly outnumber reads, or where write throughput is the primary bottleneck. Instead of updating data in place on disk (like a B-Tree), an LSM Tree buffers all writes in memory and periodically flushes them as sorted, immutable files. This converts random writes into sequential writes, which is dramatically faster on both HDDs and SSDs.

Why not just use a B-Tree? B-Trees update pages in place, which means every write is a random disk I/O. For write-heavy workloads (logging, time-series data, messaging), this becomes the bottleneck. An LSM Tree absorbs thousands of writes in memory, then writes them all at once in a single sequential flush. The trade-off: reads are slower because data may be spread across multiple files on disk.

Key properties:

  • Writes go to an in-memory buffer (memtable), making them extremely fast
  • When the memtable fills up, it is flushed to disk as an immutable SSTable
  • Background compaction merges SSTables to keep read performance manageable
  • Deletes are handled by writing tombstone markers, not by erasing data
  • Bloom filters on each SSTable reduce unnecessary disk reads during queries

How It Works

Write Path

Every write first goes to the WAL (for crash recovery), then to the memtable. When the memtable reaches a size threshold, it is frozen and flushed to disk as a new SSTable at level 0. A fresh memtable takes its place.

Read Path

To read a key, check the memtable first (most recent data). If not found, check level 0 SSTables (most recent flushes), then level 1, level 2, and so on. At each SSTable, use the Bloom filter to skip files that definitely do not contain the key, then binary search within the SSTable if the Bloom filter says "maybe."

Compaction

Over time, SSTables accumulate. Compaction merges overlapping SSTables from one level into the next, producing larger sorted files with duplicates and tombstones removed. Two common strategies:

  • Size-tiered compaction: merge SSTables of similar size; simpler, better write throughput, higher space amplification
  • Leveled compaction: each level has a size limit; when exceeded, SSTables are merged into the next level; better read performance, lower space amplification, higher write amplification
Write path WAL Memtable (in memory) flush Disk (SSTables) SST SST SST Level 0 SSTable SSTable Level 1 SSTable (larger, merged) Level 2 compact compact Read path Memtable miss L0 miss L1 miss L2 Check each level until key found (Bloom filters skip irrelevant SSTables)

Complexity Analysis

Let $n$ be the number of key-value pairs, $B$ be the page size in entries, and $T$ be the size ratio between levels (typically 10).

Operation Leveled Compaction Size-Tiered Notes
Write (amortized) $O\!\left(\frac{T \log_T n}{B}\right)$ $O\!\left(\frac{\log_T n}{B}\right)$ Leveled has higher write amplification
Point read $O(\log_T n)$ $O(T \log_T n)$ Bloom filters reduce this in practice
Range scan ($k$ results) $O(\log_T n + k/B)$ $O(T \log_T n + k/B)$ Merge sorted iterators across levels
Space amplification $O(1/T)$ overhead $O(T)$ overhead Leveled is more space-efficient

The fundamental trade-off in LSM Trees is between write amplification, read amplification, and space amplification. You can optimize for any two at the expense of the third. Leveled compaction favors reads and space; size-tiered favors writes.

With Bloom filters, the practical read cost drops significantly. A well-tuned Bloom filter with 10 bits per key gives a ~1% false positive rate, meaning only 1 in 100 SSTable lookups results in an unnecessary disk read.

Implementation

STRUCTURE LSMTree
  memtable: sorted in-memory structure (e.g., skip list)
  memtableSize: current byte size of memtable
  maxMemtableSize: threshold for flushing
  wal: write-ahead log file
  levels: array of lists of SSTables, one per level
END STRUCTURE

ALGORITHM LSMWrite(tree, key, value)
INPUT: tree: LSMTree, key, value: data to write
BEGIN
  Append (key, value) to tree.wal
  tree.memtable.Insert(key, value)
  tree.memtableSize <- tree.memtableSize + Size(key, value)

  IF tree.memtableSize >= tree.maxMemtableSize THEN
    FlushMemtable(tree)
  END IF
END

ALGORITHM FlushMemtable(tree)
BEGIN
  sortedEntries <- tree.memtable.GetAllSorted()
  newSSTable <- WriteSSTable(sortedEntries)
  bloomFilter <- BuildBloomFilter(sortedEntries)
  Attach bloomFilter to newSSTable

  Append newSSTable to tree.levels[0]
  tree.memtable <- new empty sorted structure
  tree.memtableSize <- 0
  Delete tree.wal
  tree.wal <- new empty WAL

  IF ShouldCompact(tree.levels[0]) THEN
    Compact(tree, 0)
  END IF
END

ALGORITHM LSMRead(tree, key)
INPUT: tree: LSMTree, key: key to look up
OUTPUT: value or NOT_FOUND
BEGIN
  result <- tree.memtable.Search(key)
  IF result IS NOT NOT_FOUND THEN
    IF result IS TOMBSTONE THEN RETURN NOT_FOUND
    RETURN result
  END IF

  FOR level <- 0 TO tree.maxLevel DO
    FOR EACH sstable IN tree.levels[level] (newest first) DO
      IF sstable.bloomFilter.MayContain(key) THEN
        result <- sstable.BinarySearch(key)
        IF result IS NOT NOT_FOUND THEN
          IF result IS TOMBSTONE THEN RETURN NOT_FOUND
          RETURN result
        END IF
      END IF
    END FOR
  END FOR

  RETURN NOT_FOUND
END

ALGORITHM Compact(tree, level)
INPUT: tree: LSMTree, level: the level to compact from
BEGIN
  tablesToMerge <- SelectTables(tree.levels[level])
  overlapping <- FindOverlapping(tree.levels[level + 1], tablesToMerge)

  merged <- MergeSorted(tablesToMerge + overlapping)
  Remove tombstones for keys not present in higher levels
  Remove duplicate keys (keep newest)

  newSSTables <- SplitIntoSSTables(merged, targetSize)
  Replace overlapping in tree.levels[level + 1] with newSSTables
  Remove tablesToMerge from tree.levels[level]

  IF ShouldCompact(tree.levels[level + 1]) THEN
    Compact(tree, level + 1)
  END IF
END

Real-World Applications

  • RocksDB / LevelDB: the canonical LSM Tree implementations, used as embedded storage engines by dozens of systems including MySQL (MyRocks), CockroachDB, and TiKV
  • Apache Cassandra: uses LSM Trees with size-tiered compaction by default, optimized for high write throughput across distributed nodes
  • Apache HBase: the Hadoop-ecosystem wide-column store uses LSM Trees for its storage layer
  • ScyllaDB: a high-performance Cassandra-compatible database built on LSM Trees with advanced compaction strategies
  • InfluxDB: time-series databases are a natural fit for LSM Trees because time-series data is almost entirely append-only writes
  • SQLite (WAL mode): while not a full LSM Tree, SQLite's WAL mode uses similar principles of buffering writes and merging them later

Key Takeaways

  • LSM Trees convert random writes into sequential I/O by buffering in memory and flushing sorted files to disk
  • The write path is simple and fast: WAL for durability, memtable for speed, flush when full
  • The read path checks memtable first, then each disk level; Bloom filters make this practical by skipping irrelevant SSTables
  • Compaction is the critical background process that keeps read performance in check by merging and deduplicating SSTables
  • The three-way trade-off between write amplification, read amplification, and space amplification is the central design decision for any LSM-based system