Back to Blog

B-Trees: Balanced Multi-Way Trees for Disk-Optimized Search

How B-Trees keep data sorted and balanced across multiple children per node, minimizing disk I/O for databases and file systems.

2021-02-01
Share
Advanced Data Structuresb-treesdata-structures

Terminology

  • Node: a container that holds one or more keys and pointers to child nodes
  • Key: a value stored in a node, kept in sorted order within that node
  • Order (m): the maximum number of children a node can have; a node holds at most $m - 1$ keys
  • Leaf: a node with no children, sitting at the bottom level of the tree
  • Internal node: any node that is not a leaf; it has at least one child
  • Root: the topmost node in the tree; the entry point for all operations
  • Balanced: all leaf nodes sit at the same depth, so no path from root to leaf is longer than another
  • Disk I/O: a read or write operation to secondary storage (hard drive, SSD); orders of magnitude slower than RAM access
  • Page: a fixed-size block of data (commonly 4 KB or 8 KB) that the OS reads from or writes to disk in one operation
  • Branching factor: the number of children per node; higher branching factor means a shallower tree
  • Splitting: the process of dividing an overfull node into two nodes and promoting the median key to the parent
  • Merging: combining two underfull sibling nodes into one during deletion

What & Why

A B-Tree is a self-balancing search tree designed for systems that read and write large blocks of data, like databases and file systems. Unlike a binary search tree where each node has at most 2 children, a B-Tree node can have hundreds or thousands of children. This wide, shallow shape is the whole point: it minimizes the number of disk reads needed to find a key.

Why not just use a binary search tree? On disk, every node access is a separate I/O operation. A balanced BST with n keys has height O(\log_2 n), so searching 1 billion keys requires about 30 disk reads. A B-Tree of order 1000 stores the same billion keys in a tree of height 3, requiring only 3 disk reads. That difference is enormous when each disk read takes milliseconds.

Key properties of a B-Tree of order m:

  • Every node has at most m children and m - 1 keys
  • Every non-root internal node has at least \lceil m/2 \rceil children
  • The root has at least 2 children (unless it is a leaf)
  • All leaves appear at the same depth
  • Keys within each node are sorted

How It Works

Structure

Each internal node stores keys that act as dividers. If a node contains keys $k_1, k_2, \ldots, k_j$, it has $j + 1$ child pointers. All keys in the first child are less than $k_1$, all keys in the second child are between $k_1$ and $k_2$, and so on.

30 60 10 20 40 50 70 5 15 25 35 45 55 65 75

The diagram above shows a B-Tree of order 3 (each node holds at most 2 keys and has at most 3 children). The root contains keys 30 and 60, splitting the key space into three ranges.

Search

Searching a B-Tree is a top-down traversal. At each node, scan the keys to find which child pointer to follow. Because keys within a node are sorted, you can use binary search within the node for large $m$.

Insertion

Insert always targets a leaf. Walk down from the root to find the correct leaf, then insert the key in sorted position. If the leaf now has $m$ keys (overfull), split it: take the median key, push it up to the parent, and divide the remaining keys into two new nodes. If the parent also overflows, split it too. This can cascade up to the root, which is the only way the tree grows taller.

Deletion

Deletion is more involved. If the key is in a leaf, remove it directly. If it is in an internal node, replace it with its in-order predecessor or successor (which is always in a leaf), then delete from the leaf. After removal, if a node has fewer than $\lceil m/2 \rceil - 1$ keys, rebalance by borrowing a key from a sibling or merging with a sibling.

Complexity Analysis

For a B-Tree of order $m$ containing $n$ keys, the height $h$ satisfies:

$h \leq \log_{\lceil m/2 \rceil} \frac{n+1}{2}$
Operation Disk I/O CPU Time Notes
Search $O(\log_m n)$ $O(m \log_m n)$ One disk read per level; linear or binary scan within node
Insert $O(\log_m n)$ $O(m \log_m n)$ Walk down + possible splits up
Delete $O(\log_m n)$ $O(m \log_m n)$ Walk down + possible merges up
Space - $O(n)$ Each key stored exactly once

The critical insight: disk I/O is $O(\log_m n)$, not $O(\log_2 n)$. With $m = 1000$ and $n = 10^9$, that is $\log_{1000}(10^9) = 3$ disk reads versus $\log_2(10^9) \approx 30$ for a BST.

Implementation

ALGORITHM BTreeSearch(node, key)
INPUT: node: a B-Tree node, key: the value to find
OUTPUT: (foundNode, index) or NOT_FOUND

BEGIN
  i <- 0
  WHILE i < node.numKeys AND key > node.keys[i] DO
    i <- i + 1
  END WHILE

  IF i < node.numKeys AND key = node.keys[i] THEN
    RETURN (node, i)
  END IF

  IF node.isLeaf THEN
    RETURN NOT_FOUND
  END IF

  DiskRead(node.children[i])
  RETURN BTreeSearch(node.children[i], key)
END
ALGORITHM BTreeInsert(tree, key)
INPUT: tree: a B-Tree, key: the value to insert
OUTPUT: tree with key inserted

BEGIN
  root <- tree.root

  IF root.numKeys = 2 * t - 1 THEN
    newRoot <- AllocateNode()
    newRoot.isLeaf <- FALSE
    newRoot.children[0] <- root
    SplitChild(newRoot, 0)
    tree.root <- newRoot
    InsertNonFull(newRoot, key)
  ELSE
    InsertNonFull(root, key)
  END IF
END

ALGORITHM InsertNonFull(node, key)
BEGIN
  i <- node.numKeys - 1

  IF node.isLeaf THEN
    WHILE i >= 0 AND key < node.keys[i] DO
      node.keys[i + 1] <- node.keys[i]
      i <- i - 1
    END WHILE
    node.keys[i + 1] <- key
    node.numKeys <- node.numKeys + 1
    DiskWrite(node)
  ELSE
    WHILE i >= 0 AND key < node.keys[i] DO
      i <- i - 1
    END WHILE
    i <- i + 1
    DiskRead(node.children[i])
    IF node.children[i].numKeys = 2 * t - 1 THEN
      SplitChild(node, i)
      IF key > node.keys[i] THEN
        i <- i + 1
      END IF
    END IF
    InsertNonFull(node.children[i], key)
  END IF
END

ALGORITHM SplitChild(parent, i)
INPUT: parent node, index i of the full child to split
BEGIN
  fullChild <- parent.children[i]
  newChild <- AllocateNode()
  newChild.isLeaf <- fullChild.isLeaf
  newChild.numKeys <- t - 1

  FOR j <- 0 TO t - 2 DO
    newChild.keys[j] <- fullChild.keys[j + t]
  END FOR

  IF NOT fullChild.isLeaf THEN
    FOR j <- 0 TO t - 1 DO
      newChild.children[j] <- fullChild.children[j + t]
    END FOR
  END IF

  fullChild.numKeys <- t - 1

  FOR j <- parent.numKeys DOWNTO i + 1 DO
    parent.children[j + 1] <- parent.children[j]
  END FOR
  parent.children[i + 1] <- newChild

  FOR j <- parent.numKeys - 1 DOWNTO i DO
    parent.keys[j + 1] <- parent.keys[j]
  END FOR
  parent.keys[i] <- fullChild.keys[t - 1]
  parent.numKeys <- parent.numKeys + 1

  DiskWrite(fullChild)
  DiskWrite(newChild)
  DiskWrite(parent)
END

Real-World Applications

  • Database indexing: most relational databases (PostgreSQL, MySQL/InnoDB, Oracle) use B-Tree variants for their primary indexes
  • File systems: NTFS, HFS+, ext4 (via htree), and Btrfs all use B-Trees or B+ Trees to organize directory entries and file metadata
  • Key-value stores: embedded databases like SQLite and BerkeleyDB use B-Trees as their core storage structure
  • Operating systems: the Linux kernel uses B-Trees in several subsystems for managing memory mappings and device data
  • Copy-on-write storage: Btrfs and ZFS use copy-on-write B-Trees for snapshot and transaction support

Key Takeaways

  • B-Trees are designed to minimize disk I/O by packing many keys into each node, creating wide, shallow trees
  • A B-Tree of order $m$ has height $O(\log_m n)$, so increasing $m$ dramatically reduces the number of disk reads
  • All leaves sit at the same depth, guaranteeing worst-case $O(\log_m n)$ for search, insert, and delete
  • Insertions may trigger splits that propagate upward; deletions may trigger merges or borrows from siblings
  • B-Trees are the foundation for B+ Trees (used in most production databases), which add leaf-level linking for efficient range queries