TEST:B-Tree Indexes Deep Dive: Why Your WHERE Clause Is Fast
How B-trees store sorted data on disk, how page splits work, what fill factor controls, and the difference between clustered and non-clustered indexes.
Terminology
| Term | Definition | Trade-off / Gotcha |
|---|---|---|
| B-Tree | A self-balancing tree data structure that maintains sorted data and allows for efficient insertion, deletion, and search operations. Each node can hold multiple keys and child pointers, designed to minimize disk I/O by matching node size to disk page size. For example, think of it like a library where each shelf can hold many books, allowing for efficient searching without having to check every single book. | Not the same as a binary tree: database B-trees have branching factors in the hundreds, not two. |
| Page | A fixed-size block (typically 4 KB, 8 KB, or 16 KB) that the database uses to read from and write to disk in a single I/O operation. You can think of a page like a box that holds a certain number of items; when the box is full, you have to deal with it all at once. | Larger pages mean fewer I/Os but more wasted space if rows are small or access is random. |
| Fill Factor | The percentage of each page that the database fills with data during index creation or rebuild, leaving the rest as free space for future inserts. Imagine packing a suitcase: if you pack too tightly, you can't add more items later, but if you leave too much empty space, you're wasting room. | A fill factor that is too high can cause frequent page splits on write-heavy tables; too low wastes disk space and can slow down reads. |
| Page Split | When an insert targets a full page, the database divides it into two pages to accommodate the new data. This is similar to needing to split a crowded table at a restaurant when more guests arrive. Each row id (rid) helps identify where each piece of data belongs. | This can lead to increased overhead as the database has to manage more pages. |
| Index | A data structure that improves the speed of data retrieval operations on a database table at the cost of additional space and maintenance overhead. Think of it like an index in a book that helps you quickly find the information you need without reading every page. | While indexes speed up read operations, they can slow down write operations due to the need to update the index. |
| Node | A fundamental part of a B-tree that contains keys and pointers to child nodes. Each node can have multiple keys, which helps in organizing and searching the data efficiently. | A node that is too full can lead to page splits, increasing complexity and overhead in managing the tree. |
| Leaf Node | A node in a B-tree that does not have any children. Leaf nodes contain the actual data or pointers to the data. They are crucial for the retrieval process since they are the endpoints of the search. | If leaf nodes are not balanced properly, it can lead to inefficient searches and increased access times. |
What & Why
Every database query that uses a WHERE clause on an indexed column owes its speed to a B-tree. When you write SELECT * FROM orders WHERE customer_id = 42, the database does not scan every row in the table. Instead, it walks down a B-tree index, reading 3 or 4 disk pages to find exactly the rows it needs out of billions.
B-trees are the default index structure in PostgreSQL, MySQL/InnoDB, SQL Server, Oracle, and SQLite. They are so dominant because they handle both point lookups ($O(\log_B n)$ where $B$ is the branching factor) and range scans efficiently, they stay balanced automatically through page splits, and their wide, shallow shape maps perfectly onto how disks and SSDs transfer data in fixed-size pages.
However, B-trees are not without their drawbacks. In scenarios with highly skewed data distributions, B-trees can lead to uneven growth and performance issues. For example, if a column has a small number of unique values with many duplicates, the index may become less efficient, resulting in poor performance during lookups. Additionally, B-trees can struggle with write-heavy workloads, as the need to maintain balance through page splits can introduce overhead. In cases where data is frequently updated or deleted, alternatives like hash indexes or other data structures may be more suitable.
This post goes deeper than the data structure itself. We will look at how a database lays out B-tree nodes on disk pages, what happens physically during a page split, how fill factor trades write performance for read density, and the critical difference between clustered and non-clustered indexes. Understanding these internals is the difference between an index that makes your query fast and one that causes unnecessary I/O.
How It Works
Disk Page Layout
Think of a database B-tree node like a physical box where data is stored. This box, known as a disk page, typically measures 8 KB or 16 KB and has a specific layout that organizes the information inside. Understanding how this box is structured is key to grasping how databases work efficiently. Each part of the box has a role in storing, accessing, and managing data.
The page has a header, a slot array, key/pointer cells, and free space.
- Page Header: This contains essential metadata such as the page ID, level, key count, free space offset, and checksum. It helps the database engine manage the page effectively.
- Slot Array: This is where pointers to the actual data entries are stored. Each slot corresponds to a specific key/pointer cell, allowing quick access to the data.
- Key/Pointer Cells: These hold the actual keys and their corresponding pointers to data. They are critical for searching and retrieving data efficiently.
- Free Space: This area is reserved for future inserts and updates, ensuring that the page can grow as needed without requiring a complete rewrite.
Read More
2025-10-03
LSM Trees and Write-Optimized Storage: Why Cassandra and RocksDB Chose Append-Only
How LSM trees turn random writes into sequential I/O through memtables, SSTables, and compaction, and why write-heavy databases chose them over B-trees.
2025-10-04
Query Execution and Optimization: How Databases Turn SQL into Fast Plans
How the query optimizer transforms SQL into execution plans, how cost-based optimization picks the cheapest path, and why nested loop, hash join, and merge join exist for different workloads.
2025-10-05
Transaction Isolation and MVCC: How Postgres Reads Without Locking
How transaction isolation levels prevent anomalies, how Postgres uses Multi-Version Concurrency Control to let readers and writers coexist, and why write skew still sneaks through snapshot isolation.