TEST2:B-Tree Indexes Deep Dive: Why Your WHERE Clause Is Fast
How B-trees store sorted data on disk, how page splits work, what fill factor controls, and the difference between clustered and non-clustered indexes.
Terminology
| Term | Definition | Trade-off / Gotcha |
|---|---|---|
| B-Tree | A self-balancing tree where each node can hold many keys and child pointers, designed to minimize disk I/O by matching node size to disk page size. For example, think of it like a library where each shelf can hold multiple books, allowing for efficient searching. | Not the same as a binary tree: database B-trees have branching factors in the hundreds, not two. |
| Page | A fixed-size block (typically 4 KB, 8 KB, or 16 KB) that the database reads from and writes to disk in a single I/O operation. Imagine it as a page in a book that contains a certain amount of information. | Larger pages mean fewer I/Os but more wasted space if rows are small or access is random. |
| Fill Factor | The percentage of each page that the database fills with data during index creation or rebuild, leaving the rest as free space for future inserts. Think of it like filling a jar with marbles but leaving some space for more marbles later. | Too high causes frequent page splits on write-heavy tables; too low wastes disk and slows reads. |
| Page Split | When an insert targets a full page, the database divides it into two pages and promotes a separator key to the parent node. This can be likened to splitting a crowded room into two to make space for more people. | Splits are expensive (3+ page writes) and can cascade to the root, briefly stalling concurrent readers. |
| Clustered Index | An index where the leaf pages contain the actual table rows, stored in sorted order by the index key. A table can have at most one. It's like having a well-organized filing cabinet where the files are sorted by name. | Only one per table, so choose the key wisely: changing it later requires rewriting the entire table. |
| Non-Clustered Index | An index where the leaf pages contain index keys plus pointers (row IDs) back to the heap or clustered index pages that hold the actual row data. Think of it as a table of contents that points you to where the information is located. | Each lookup may require a bookmark lookup (extra random I/O) unless the index covers all queried columns. |
| Slot Array | A small directory at the beginning or end of a page that stores offsets pointing to where each record (cell) begins within the page. This is similar to an index at the back of a book that tells you where to find specific topics. | Adds a small per-record overhead (typically 2 bytes), but avoids physically moving cells on insert. |
| Branching Factor | The number of child pointers per internal node, typically in the hundreds for database B-trees because nodes are sized to fill an entire disk page. Picture it as the number of branches on a tree, where more branches can lead to more options. | Wider keys (e.g., long strings) reduce the branching factor, making the tree taller and lookups slower. |
| Heap File | An unordered collection of table pages where rows are inserted wherever free space exists, with no guaranteed sort order. It's like tossing papers into a box without organizing them. | Fast for inserts (append anywhere), but full table scans are the only option without an index. |
| Row ID (RID) | A physical address (page number + slot number) that uniquely locates a row within a heap file or clustered index. Think of it as a specific address for a house in a neighborhood. | Physical RIDs become invalid after a table reorganization or page compaction, requiring index rebuilds. |
What & Why
Every database query that uses a WHERE clause on an indexed column owes its speed to a B-tree. When you write SELECT * FROM orders WHERE customer_id = 42, the database does not scan every row in the table. Instead, it walks down a B-tree index, which is a type of data structure that helps organize data efficiently, reading 3 or 4 disk pages to find exactly the rows it needs out of billions.
B-trees are the default index structure in PostgreSQL, MySQL/InnoDB, SQL Server, Oracle, and SQLite. They are so dominant because they handle both point lookups (which are searches for a specific value, $O(\log_B n)$ where $B$ is the branching factor) and range scans (which retrieve a range of values) efficiently. They stay balanced automatically through page splits (a process that occurs when a node in the tree becomes full), and their wide, shallow shape maps perfectly onto how disks and SSDs transfer data in fixed-size pages.
This post goes deeper than the data structure itself. We will look at how a database lays out B-tree nodes on disk pages, what happens physically during a page split, how fill factor (the amount of space used in a page) trades write performance for read density, and the critical difference between clustered and non-clustered indexes. Understanding these internals is the difference between an index that makes your query fast and one that causes unnecessary I/O.
How It Works
Disk Page Layout
A database B-tree node is not an abstract object in memory. It is a physical disk page, typically 8 KB or 16 KB, with a precise binary layout. The page has a header, a slot array, key/pointer cells, and free space.
The slot array is the key design trick. Slots are fixed-size entries (typically 2 bytes each) that store the byte offset of each cell within the page. When the database inserts a new key, it appends the cell to the end of the cell area and inserts a new slot in sorted position. This avoids physically moving existing cells around on the page, keeping inserts fast. The slot array grows from the front of the page, cells grow from the back, and free space sits in the middle.
B-Tree Structure: Root to Leaf
In a database B-tree index, internal nodes contain separator keys and child page pointers. Leaf nodes contain the actual indexed keys plus either the row data itself (clustered index) or a pointer to where the row lives (non-clustered index). All leaves are linked together in a doubly-linked list so range scans can walk from one leaf to the next without going back up the tree.
The Page Split Process
When you insert a key into a leaf page that is already full, the database must split it. This is the most expensive write operation in a B-tree index, because it involves allocating a new page, redistributing keys, and updating the parent.
The split process works in five steps: (1) allocate a new page, (2) sort all keys including the new one, (3) find the median key, (4) move the lower half to the original page and the upper half to the new page, (5) promote the median key to the parent as a separator. If the parent is also full, it splits recursively. In the worst case, splits cascade all the way to the root, increasing the tree height by one.
Clustered vs Non-Clustered Indexes
The distinction between clustered and non-clustered indexes is one of the most important concepts in database performance tuning.
In a clustered index, the leaf level of the B-tree IS the table data. When you scan a range of keys, you read consecutive leaf pages, which is sequential I/O. In a non-clustered index, each leaf entry contains a pointer (Row ID) back to the heap file where the actual row lives. A range scan on a non-clustered index may require a separate random I/O for each row to fetch the actual data, unless the index "covers" the query by including all requested columns.
Fill Factor and Its Impact on Performance
Fill factor controls how full each page is when the index is built or rebuilt. A fill factor of 90% means each page is 90% full, leaving 10% free space for future inserts. This is a direct trade-off between read performance and write performance.
At 50% fill factor, pages have plenty of room for new keys, so inserts rarely cause page splits. But reads are slower because the data is spread across nearly twice as many pages, meaning more I/O to scan the same number of rows. At 100% fill factor, every page is packed tight, so reads touch fewer pages, but almost every insert into the middle of the index triggers a page split.
The sweet spot depends on your workload. For read-heavy tables that rarely change (lookup tables, historical data), use 90-100%. For write-heavy tables with random inserts (UUIDs as primary keys, event logs), use 70-80% to absorb inserts without constant splitting.
B-Tree Height vs Number of Records
The height of a B-tree determines the maximum number of disk I/O operations for a point lookup. With a branching factor of $B$ (keys per internal node), a tree of height $h$ can hold up to $B^h$ leaf entries. Equivalently, the height for $n$ records is:
$h = \lceil \log_B(n) \rceil$
Because database pages are large (8-16 KB) and keys are small (4-8 bytes for integers), the branching factor is typically in the hundreds. This means even billions of rows require only 3-4 levels.
With a branching factor of 500 (realistic for 8 KB pages with 16-byte keys), 10 billion records fit in a tree of height 4. That means any row in a 10-billion-row table can be found with at most 4 disk reads. The root page is almost always cached in the buffer pool, so in practice it is 2-3 disk reads.
Leaf Page Layout: Clustered vs Non-Clustered
The physical content of a leaf page differs dramatically between clustered and non-clustered indexes.
Because non-clustered index leaves only store keys and Row IDs (not full rows), they are much smaller per entry. This means more entries fit per page, the tree is shallower, and index-only scans (where all needed columns are in the index) are very fast. But any query that needs columns not in the index requires a "bookmark lookup": following the Row ID to the heap page to fetch the full row.
Index Type Comparison
B-trees are not the only index structure. Different access patterns call for different index types.
B-trees dominate because they handle the widest range of query patterns: equality, range, prefix matching, and ORDER BY. Hash indexes are faster for exact lookups but cannot do range scans at all. GiST and GIN are specialized for spatial and full-text queries. BRIN indexes are tiny (storing only min/max per block range) but only work well when the physical order of rows on disk correlates with the index key.
Complexity Analysis
All complexities below assume a B-tree of order $B$ (branching factor) with $n$ total records and tree height $h = \lceil \log_B(n) \rceil$. Note that the $O(\log_B n)$ for point lookup assumes a balanced B-tree. In the worst case, if the B-tree becomes unbalanced due to poor insertions or deletions, the height can approach $n$, leading to $O(n)$ complexity for lookups.
| Operation | Time Complexity | Disk I/Os | Notes |
|---|---|---|---|
| Point lookup | $O(\log_B n)$ | $h$ (typically 3-4) | One page read per tree level |
| Range scan ($k$ rows) | $O(\log_B n + k/B)$ | $h + \lceil k/F \rceil$ | $F$ = records per leaf page; fragmentation can increase I/O cost, potentially leading to $O(n)$ in extreme cases. |
| Insert (no split) | $O(\log_B n)$ | $h$ reads + 1 write | Traverse to leaf, append key |
| Insert (with split) | $O(\log_B n)$ | $h$ reads + 3 writes | Split leaf + write 2 pages + update parent |
| Delete | $O(\log_B n)$ | $h$ reads + 1 write | May trigger merge if page underfull |
| Index build (bulk) | $O(n \log_B n)$ | $O(n/B)$ | Sort-then-build avoids random I/O |
The key insight is that the branching factor $B$ is large (100-500), so $\log_B(n)$ grows extremely slowly. For $n = 10^9$ and $B = 500$:
$h = \lceil \log_{500}(10^9) \rceil = \lceil \frac{9}{\log_{10}(500)} \rceil = \lceil \frac{9}{2.699} \rceil = \lceil 3.33 \rceil = 4$
Four disk reads to find any row among a billion. With the root and first internal level cached in the buffer pool (common for hot indexes), it drops to 2 disk reads.
The cost of a page split is significant: 3 page writes (original page, new page, parent page) plus the WAL entries. If splits cascade to the parent, each additional level adds 2 more writes. This is why fill factor matters: leaving free space in pages absorbs inserts without splitting.
$\text{Split cost} = 2h + 1 \text{ writes (worst case, cascade to root)}$
For range scans, the cost after finding the first key is purely sequential I/O: walk the leaf-level linked list, reading one page at a time. If each leaf page holds $F$ records, scanning $k$ rows requires $\lceil k/F \rceil$ additional page reads. This is why clustered indexes are so fast for range queries: the leaf pages ARE the data, so there is no bookmark lookup overhead.
Implementation
Pseudocode: B-Tree Point Lookup
ALGORITHM BTreeLookup(root, search_key)
INPUT: root: page ID of the B-tree root
search_key: the key to find
OUTPUT: the record matching search_key, or NOT_FOUND
BEGIN
current_page = BufferPool.FetchPage(root)
WHILE current_page.level > 0 DO
// Internal node: binary search for the correct child pointer
child_index = BinarySearch(current_page.keys, search_key)
child_page_id = current_page.children[child_index]
current_page = BufferPool.FetchPage(child_page_id)
END WHILE
// Leaf node: binary search for the exact key
slot = BinarySearch(current_page.keys, search_key)
IF current_page.keys[slot] = search_key THEN
RETURN current_page.records[slot]
ELSE
RETURN NOT_FOUND
END IF
END
Pseudocode: B-Tree Insert with Page Split
ALGORITHM BTreeInsert(root, key, record)
INPUT: root: page ID of the B-tree root
key: the index key to insert
record: the data or row pointer to store
OUTPUT: new root page ID (may change if root splits)
BEGIN
// Phase 1: Traverse to the correct leaf
path = empty stack
current = BufferPool.FetchPage(root)
WHILE current.level > 0 DO
path.Push(current)
child_index = BinarySearch(current.keys, key)
current = BufferPool.FetchPage(current.children[child_index])
END WHILE
// Phase 2: Insert into the leaf
InsertIntoPage(current, key, record)
// Phase 3: Split if necessary, propagate up
WHILE current.IsFull() DO
median_key, new_page = SplitPage(current)
WAL.LogSplit(current.page_id, new_page.page_id, median_key)
IF path.IsEmpty() THEN
// Root split: create a new root
new_root = AllocatePage(level = current.level + 1)
new_root.keys = [median_key]
new_root.children = [current.page_id, new_page.page_id]
WAL.LogNewRoot(new_root.page_id)
RETURN new_root.page_id
END IF
parent = path.Pop()
InsertIntoPage(parent, median_key, new_page.page_id)
current = parent
END WHILE
RETURN root
END
Pseudocode: Page Split
ALGORITHM SplitPage(page)
INPUT: page: a full B-tree page
OUTPUT: (median_key, new_page) tuple
BEGIN
all_keys = page.keys // already sorted
mid = LENGTH(all_keys) / 2
median_key = all_keys[mid]
new_page = AllocatePage(level = page.level)
// Move upper half to new page
new_page.keys = all_keys[mid + 1 .. end]
new_page.records = page.records[mid + 1 .. end]
IF page.level = 0 THEN
// Leaf: update sibling pointers for range scans
new_page.next_page = page.next_page
new_page.prev_page = page.page_id
page.next_page = new_page.page_id
IF new_page.next_page IS NOT NULL THEN
right_neighbor = BufferPool.FetchPage(new_page.next_page)
right_neighbor.prev_page = new_page.page_id
END IF
ELSE
// Internal: move child pointers too
new_page.children = page.children[mid + 1 .. end]
END IF
// Truncate original page to lower half
page.keys = all_keys[0 .. mid - 1]
page.records = page.records[0 .. mid - 1]
BufferPool.MarkDirty(page)
BufferPool.MarkDirty(new_page)
RETURN (median_key, new_page)
END
Pseudocode: Range Scan on Clustered Index
ALGORITHM ClusteredRangeScan(root, low_key, high_key)
INPUT: root: page ID of the clustered index root
low_key: start of range (inclusive)
high_key: end of range (inclusive)
OUTPUT: list of matching records
BEGIN
results = empty list
// Step 1: Find the leaf containing low_key
leaf = BTreeLookup_LeafPage(root, low_key)
start_slot = FirstSlotGreaterOrEqual(leaf, low_key)
// Step 2: Scan forward through linked leaf pages
current = leaf
slot = start_slot
WHILE current IS NOT NULL DO
WHILE slot < current.key_count DO
IF current.keys[slot] > high_key THEN
RETURN results // Past the range, done
END IF
results.Append(current.records[slot])
slot = slot + 1
END WHILE
// Move to next leaf page (sequential I/O)
current = BufferPool.FetchPage(current.next_page)
slot = 0
END WHILE
RETURN results
END
Real-World Applications
PostgreSQL uses B-tree indexes as the default index type for all primary keys and any column with a CREATE INDEX statement. The internal implementation uses Lehman-Yao B-trees, a variant that adds right-link pointers between sibling nodes at every level (not just leaves) to allow concurrent access without holding locks on parent nodes during splits.
MySQL/InnoDB uses a clustered index on the primary key for most tables, but it is possible to create tables without a primary key, which would lead to a non-clustered index being used instead. The leaf pages of the primary key index contain the actual row data. All secondary (non-clustered) indexes store the primary key value as their row pointer instead of a physical Row ID, so a secondary index lookup requires two B-tree traversals: one on the secondary index to find the primary key, then one on the clustered index to find the row.
SQLite stores the entire database in a single file, with each table and index as a separate B-tree. Tables use B+ trees (data only in leaves), while indexes use regular B-trees. The page size defaults to 4096 bytes and can be configured at database creation.
SQL Server offers both clustered and non-clustered indexes with explicit fill factor control. The FILLFACTOR option in CREATE INDEX lets you set the percentage (1-100) of each leaf page to fill during index creation. SQL Server also supports "included columns" in non-clustered indexes, which store additional non-key columns in the leaf pages to create covering indexes that avoid bookmark lookups.
MongoDB uses B-tree indexes (specifically, WiredTiger's B-tree implementation) for its _id field and any user-created indexes. The WiredTiger storage engine uses a combination of B-trees for indexes and an optional LSM tree mode for write-heavy workloads.
Key Takeaways
- A B-tree index node is a disk page with a header, slot array, and cell area. The slot array allows sorted access without moving cells during inserts.
- Tree height affects lookup cost: with branching factors of 100-500, billions of rows require only 3-5 disk reads. The root page is often cached, which helps reduce this further.
- Page splits are costly write operations. They allocate a new page, redistribute keys, and update the parent. Fill factor controls split frequency by leaving free space in pages.
- Clustered indexes store actual row data in leaf pages, enabling sequential I/O for range scans. Non-clustered indexes store key + Row ID pointers, needing a bookmark lookup for non-covered columns.
- Fill factor balances read and write efficiency: a high fill factor (90-100%) packs more data per page for faster reads, while a low fill factor (50-70%) leaves space for inserts to minimize splits.
- B-trees are preferred for database indexing because they efficiently handle equality, range, prefix, and ORDER BY queries, unlike hash indexes (which only support equality) or BRIN indexes (which only work with naturally ordered data).
Read More
2025-10-03
LSM Trees and Write-Optimized Storage: Why Cassandra and RocksDB Chose Append-Only
How LSM trees turn random writes into sequential I/O through memtables, SSTables, and compaction, and why write-heavy databases chose them over B-trees.
2025-10-04
Query Execution and Optimization: How Databases Turn SQL into Fast Plans
How the query optimizer transforms SQL into execution plans, how cost-based optimization picks the cheapest path, and why nested loop, hash join, and merge join exist for different workloads.
2025-10-05
Transaction Isolation and MVCC: How Postgres Reads Without Locking
How transaction isolation levels prevent anomalies, how Postgres uses Multi-Version Concurrency Control to let readers and writers coexist, and why write skew still sneaks through snapshot isolation.