Notes/B-Trees: Balanced Multi-Way Trees for Disk-Optimized Search

B-Trees: Balanced Multi-Way Trees for Disk-Optimized Search

How B-Trees keep data sorted and balanced across multiple children per node, minimizing disk I/O for databases and file systems.

2021-02-01AI-Synthesized from Personal Notes

Advanced Data StructuresB TreesData Structures

Terminology

Node: a container that holds one or more keys and pointers to child nodes
Key: a value stored in a node, kept in sorted order within that node
Order (m): the maximum number of children a node can have; a node holds at most $m - 1$ keys
Leaf: a node with no children, sitting at the bottom level of the tree
Internal node: any node that is not a leaf; it has at least one child
Root: the topmost node in the tree; the entry point for all operations
Balanced: all leaf nodes sit at the same depth, so no path from root to leaf is longer than another
Disk I/O: a read or write operation to secondary storage (hard drive, SSD); orders of magnitude slower than RAM access
Page: a fixed-size block of data (commonly 4 KB or 8 KB) that the OS reads from or writes to disk in one operation
Branching factor: the number of children per node; higher branching factor means a shallower tree
Splitting: the process of dividing an overfull node into two nodes and promoting the median key to the parent
Merging: combining two underfull sibling nodes into one during deletion

What & Why

A B-Tree is a self-balancing search tree designed for systems that read and write large blocks of data, like databases and file systems. Unlike a binary search tree where each node has at most 2 children, a B-Tree node can have hundreds or thousands of children. This wide, shallow shape is the whole point: it minimizes the number of disk reads needed to find a key.

Why not just use a binary search tree? On disk, every node access is a separate I/O operation. A balanced BST with $n$ keys has height $O(\log_2 n)$, so searching 1 billion keys requires about 30 disk reads. A B-Tree of order 1000 stores the same billion keys in a tree of height 3, requiring only 3 disk reads. That difference is enormous when each disk read takes milliseconds.

Key properties of a B-Tree of order $m$:

Every node has at most $m$ children and $m - 1$ keys
Every non-root internal node has at least $\lceil m/2 \rceil$ children
The root has at least 2 children (unless it is a leaf)
All leaves appear at the same depth
Keys within each node are sorted

How It Works

Structure

Each internal node stores keys that act as dividers. If a node contains keys $k_1, k_2, \ldots, k_j$, it has $j + 1$ child pointers. All keys in the first child are less than $k_1$, all keys in the second child are between $k_1$ and $k_2$, and so on.

The diagram above shows a B-Tree of order 3 (each node holds at most 2 keys and has at most 3 children). The root contains keys 30 and 60, splitting the key space into three ranges.

Search

Searching a B-Tree is a top-down traversal. At each node, scan the keys to find which child pointer to follow. Because keys within a node are sorted, you can use binary search within the node for large $m$.

Insertion

Insert always targets a leaf. Walk down from the root to find the correct leaf, then insert the key in sorted position. If the leaf now has $m$ keys (overfull), split it: take the median key, push it up to the parent, and divide the remaining keys into two new nodes. If the parent also overflows, split it too. This can cascade up to the root, which is the only way the tree grows taller.

Deletion

Deletion is more involved. If the key is in a leaf, remove it directly. If it is in an internal node, replace it with its in-order predecessor or successor (which is always in a leaf), then delete from the leaf. After removal, if a node has fewer than $\lceil m/2 \rceil - 1$ keys, rebalance by borrowing a key from a sibling or merging with a sibling.

Complexity Analysis

For a B-Tree of order $m$ containing $n$ keys, the height $h$ satisfies:

$h \leq \log_{\lceil m/2 \rceil} \frac{n+1}{2}$

Operation	Disk I/O	CPU Time	Notes
Search	$O(\log_m n)$	$O(m \log_m n)$	One disk read per level; linear or binary scan within node
Insert	$O(\log_m n)$	$O(m \log_m n)$	Walk down + possible splits up
Delete	$O(\log_m n)$	$O(m \log_m n)$	Walk down + possible merges up
Space	-	$O(n)$	Each key stored exactly once

The critical insight: disk I/O is $O(\log_m n)$, not $O(\log_2 n)$. With $m = 1000$ and $n = 10^9$, that is $\log_{1000}(10^9) = 3$ disk reads versus $\log_2(10^9) \approx 30$ for a BST.

Implementation

ALGORITHM BTreeSearch(node, key)
INPUT: node: a B-Tree node, key: the value to find
OUTPUT: (foundNode, index) or NOT_FOUND

BEGIN
  i <- 0
  WHILE i < node.numKeys AND key > node.keys[i] DO
    i <- i + 1
  END WHILE

  IF i < node.numKeys AND key = node.keys[i] THEN
    RETURN (node, i)
  END IF

  IF node.isLeaf THEN
    RETURN NOT_FOUND
  END IF

  DiskRead(node.children[i])
  RETURN BTreeSearch(node.children[i], key)
END

ALGORITHM BTreeInsert(tree, key)
INPUT: tree: a B-Tree, key: the value to insert
OUTPUT: tree with key inserted

BEGIN
  root <- tree.root

  IF root.numKeys = 2 * t - 1 THEN
    newRoot <- AllocateNode()
    newRoot.isLeaf <- FALSE
    newRoot.children[0] <- root
    SplitChild(newRoot, 0)
    tree.root <- newRoot
    InsertNonFull(newRoot, key)
  ELSE
    InsertNonFull(root, key)
  END IF
END

ALGORITHM InsertNonFull(node, key)
BEGIN
  i <- node.numKeys - 1

  IF node.isLeaf THEN
    WHILE i >= 0 AND key < node.keys[i] DO
      node.keys[i + 1] <- node.keys[i]
      i <- i - 1
    END WHILE
    node.keys[i + 1] <- key
    node.numKeys <- node.numKeys + 1
    DiskWrite(node)
  ELSE
    WHILE i >= 0 AND key < node.keys[i] DO
      i <- i - 1
    END WHILE
    i <- i + 1
    DiskRead(node.children[i])
    IF node.children[i].numKeys = 2 * t - 1 THEN
      SplitChild(node, i)
      IF key > node.keys[i] THEN
        i <- i + 1
      END IF
    END IF
    InsertNonFull(node.children[i], key)
  END IF
END

ALGORITHM SplitChild(parent, i)
INPUT: parent node, index i of the full child to split
BEGIN
  fullChild <- parent.children[i]
  newChild <- AllocateNode()
  newChild.isLeaf <- fullChild.isLeaf
  newChild.numKeys <- t - 1

  FOR j <- 0 TO t - 2 DO
    newChild.keys[j] <- fullChild.keys[j + t]
  END FOR

  IF NOT fullChild.isLeaf THEN
    FOR j <- 0 TO t - 1 DO
      newChild.children[j] <- fullChild.children[j + t]
    END FOR
  END IF

  fullChild.numKeys <- t - 1

  FOR j <- parent.numKeys DOWNTO i + 1 DO
    parent.children[j + 1] <- parent.children[j]
  END FOR
  parent.children[i + 1] <- newChild

  FOR j <- parent.numKeys - 1 DOWNTO i DO
    parent.keys[j + 1] <- parent.keys[j]
  END FOR
  parent.keys[i] <- fullChild.keys[t - 1]
  parent.numKeys <- parent.numKeys + 1

  DiskWrite(fullChild)
  DiskWrite(newChild)
  DiskWrite(parent)
END

Real-World Applications

Database indexing: most relational databases (PostgreSQL, MySQL/InnoDB, Oracle) use B-Tree variants for their primary indexes
File systems: NTFS, HFS+, ext4 (via htree), and Btrfs all use B-Trees or B+ Trees to organize directory entries and file metadata
Key-value stores: embedded databases like SQLite and BerkeleyDB use B-Trees as their core storage structure
Operating systems: the Linux kernel uses B-Trees in several subsystems for managing memory mappings and device data
Copy-on-write storage: Btrfs and ZFS use copy-on-write B-Trees for snapshot and transaction support

Key Takeaways

B-Trees are designed to minimize disk I/O by packing many keys into each node, creating wide, shallow trees
A B-Tree of order $m$ has height $O(\log_m n)$, so increasing $m$ dramatically reduces the number of disk reads
All leaves sit at the same depth, guaranteeing worst-case $O(\log_m n)$ for search, insert, and delete
Insertions may trigger splits that propagate upward; deletions may trigger merges or borrows from siblings
B-Trees are the foundation for B+ Trees (used in most production databases), which add leaf-level linking for efficient range queries

2021-02-02

B+ Trees: Leaf-Linked Trees for Range Queries and Database Indexing

How B+ Trees extend B-Trees by storing all data in linked leaf nodes, enabling fast range scans that power every major database index.

2021-02-03

Tries: Prefix Trees for Autocomplete and String Matching

How tries store strings character by character in a tree structure, enabling prefix search, autocomplete, and fast string matching.

2021-02-04

Skip Lists: Probabilistic Balancing with Layered Linked Lists

How skip lists use randomized multi-level linked lists to achieve O(log n) search without the complexity of tree rotations.