B-Trees: Balanced Multi-Way Trees for Disk-Optimized Search
How B-Trees keep data sorted and balanced across multiple children per node, minimizing disk I/O for databases and file systems.
Terminology
- Node: a container that holds one or more keys and pointers to child nodes
- Key: a value stored in a node, kept in sorted order within that node
- Order (m): the maximum number of children a node can have; a node holds at most $m - 1$ keys
- Leaf: a node with no children, sitting at the bottom level of the tree
- Internal node: any node that is not a leaf; it has at least one child
- Root: the topmost node in the tree; the entry point for all operations
- Balanced: all leaf nodes sit at the same depth, so no path from root to leaf is longer than another
- Disk I/O: a read or write operation to secondary storage (hard drive, SSD); orders of magnitude slower than RAM access
- Page: a fixed-size block of data (commonly 4 KB or 8 KB) that the OS reads from or writes to disk in one operation
- Branching factor: the number of children per node; higher branching factor means a shallower tree
- Splitting: the process of dividing an overfull node into two nodes and promoting the median key to the parent
- Merging: combining two underfull sibling nodes into one during deletion
What & Why
A B-Tree is a self-balancing search tree designed for systems that read and write large blocks of data, like databases and file systems. Unlike a binary search tree where each node has at most 2 children, a B-Tree node can have hundreds or thousands of children. This wide, shallow shape is the whole point: it minimizes the number of disk reads needed to find a key.
Why not just use a binary search tree? On disk, every node access is a separate I/O operation. A balanced BST with n keys has height O(\log_2 n), so searching 1 billion keys requires about 30 disk reads. A B-Tree of order 1000 stores the same billion keys in a tree of height 3, requiring only 3 disk reads. That difference is enormous when each disk read takes milliseconds.
Key properties of a B-Tree of order m:
- Every node has at most
mchildren andm - 1keys - Every non-root internal node has at least
\lceil m/2 \rceilchildren - The root has at least 2 children (unless it is a leaf)
- All leaves appear at the same depth
- Keys within each node are sorted
How It Works
Structure
Each internal node stores keys that act as dividers. If a node contains keys $k_1, k_2, \ldots, k_j$, it has $j + 1$ child pointers. All keys in the first child are less than $k_1$, all keys in the second child are between $k_1$ and $k_2$, and so on.
The diagram above shows a B-Tree of order 3 (each node holds at most 2 keys and has at most 3 children). The root contains keys 30 and 60, splitting the key space into three ranges.
Search
Searching a B-Tree is a top-down traversal. At each node, scan the keys to find which child pointer to follow. Because keys within a node are sorted, you can use binary search within the node for large $m$.
Insertion
Insert always targets a leaf. Walk down from the root to find the correct leaf, then insert the key in sorted position. If the leaf now has $m$ keys (overfull), split it: take the median key, push it up to the parent, and divide the remaining keys into two new nodes. If the parent also overflows, split it too. This can cascade up to the root, which is the only way the tree grows taller.
Deletion
Deletion is more involved. If the key is in a leaf, remove it directly. If it is in an internal node, replace it with its in-order predecessor or successor (which is always in a leaf), then delete from the leaf. After removal, if a node has fewer than $\lceil m/2 \rceil - 1$ keys, rebalance by borrowing a key from a sibling or merging with a sibling.
Complexity Analysis
For a B-Tree of order $m$ containing $n$ keys, the height $h$ satisfies:
| Operation | Disk I/O | CPU Time | Notes |
|---|---|---|---|
| Search | $O(\log_m n)$ | $O(m \log_m n)$ | One disk read per level; linear or binary scan within node |
| Insert | $O(\log_m n)$ | $O(m \log_m n)$ | Walk down + possible splits up |
| Delete | $O(\log_m n)$ | $O(m \log_m n)$ | Walk down + possible merges up |
| Space | - | $O(n)$ | Each key stored exactly once |
The critical insight: disk I/O is $O(\log_m n)$, not $O(\log_2 n)$. With $m = 1000$ and $n = 10^9$, that is $\log_{1000}(10^9) = 3$ disk reads versus $\log_2(10^9) \approx 30$ for a BST.
Implementation
ALGORITHM BTreeSearch(node, key)
INPUT: node: a B-Tree node, key: the value to find
OUTPUT: (foundNode, index) or NOT_FOUND
BEGIN
i <- 0
WHILE i < node.numKeys AND key > node.keys[i] DO
i <- i + 1
END WHILE
IF i < node.numKeys AND key = node.keys[i] THEN
RETURN (node, i)
END IF
IF node.isLeaf THEN
RETURN NOT_FOUND
END IF
DiskRead(node.children[i])
RETURN BTreeSearch(node.children[i], key)
END
ALGORITHM BTreeInsert(tree, key)
INPUT: tree: a B-Tree, key: the value to insert
OUTPUT: tree with key inserted
BEGIN
root <- tree.root
IF root.numKeys = 2 * t - 1 THEN
newRoot <- AllocateNode()
newRoot.isLeaf <- FALSE
newRoot.children[0] <- root
SplitChild(newRoot, 0)
tree.root <- newRoot
InsertNonFull(newRoot, key)
ELSE
InsertNonFull(root, key)
END IF
END
ALGORITHM InsertNonFull(node, key)
BEGIN
i <- node.numKeys - 1
IF node.isLeaf THEN
WHILE i >= 0 AND key < node.keys[i] DO
node.keys[i + 1] <- node.keys[i]
i <- i - 1
END WHILE
node.keys[i + 1] <- key
node.numKeys <- node.numKeys + 1
DiskWrite(node)
ELSE
WHILE i >= 0 AND key < node.keys[i] DO
i <- i - 1
END WHILE
i <- i + 1
DiskRead(node.children[i])
IF node.children[i].numKeys = 2 * t - 1 THEN
SplitChild(node, i)
IF key > node.keys[i] THEN
i <- i + 1
END IF
END IF
InsertNonFull(node.children[i], key)
END IF
END
ALGORITHM SplitChild(parent, i)
INPUT: parent node, index i of the full child to split
BEGIN
fullChild <- parent.children[i]
newChild <- AllocateNode()
newChild.isLeaf <- fullChild.isLeaf
newChild.numKeys <- t - 1
FOR j <- 0 TO t - 2 DO
newChild.keys[j] <- fullChild.keys[j + t]
END FOR
IF NOT fullChild.isLeaf THEN
FOR j <- 0 TO t - 1 DO
newChild.children[j] <- fullChild.children[j + t]
END FOR
END IF
fullChild.numKeys <- t - 1
FOR j <- parent.numKeys DOWNTO i + 1 DO
parent.children[j + 1] <- parent.children[j]
END FOR
parent.children[i + 1] <- newChild
FOR j <- parent.numKeys - 1 DOWNTO i DO
parent.keys[j + 1] <- parent.keys[j]
END FOR
parent.keys[i] <- fullChild.keys[t - 1]
parent.numKeys <- parent.numKeys + 1
DiskWrite(fullChild)
DiskWrite(newChild)
DiskWrite(parent)
END
Real-World Applications
- Database indexing: most relational databases (PostgreSQL, MySQL/InnoDB, Oracle) use B-Tree variants for their primary indexes
- File systems: NTFS, HFS+, ext4 (via htree), and Btrfs all use B-Trees or B+ Trees to organize directory entries and file metadata
- Key-value stores: embedded databases like SQLite and BerkeleyDB use B-Trees as their core storage structure
- Operating systems: the Linux kernel uses B-Trees in several subsystems for managing memory mappings and device data
- Copy-on-write storage: Btrfs and ZFS use copy-on-write B-Trees for snapshot and transaction support
Key Takeaways
- B-Trees are designed to minimize disk I/O by packing many keys into each node, creating wide, shallow trees
- A B-Tree of order $m$ has height $O(\log_m n)$, so increasing $m$ dramatically reduces the number of disk reads
- All leaves sit at the same depth, guaranteeing worst-case $O(\log_m n)$ for search, insert, and delete
- Insertions may trigger splits that propagate upward; deletions may trigger merges or borrows from siblings
- B-Trees are the foundation for B+ Trees (used in most production databases), which add leaf-level linking for efficient range queries