Notes/TEST:B-Tree Indexes Deep Dive: Why Your WHERE Clause Is Fast

TEST:B-Tree Indexes Deep Dive: Why Your WHERE Clause Is Fast

How B-trees store sorted data on disk, how page splits work, what fill factor controls, and the difference between clustered and non-clustered indexes.

2025-10-02AI-Synthesized from Personal Notes

Database InternalsB TreeIndexesDatabases

Terminology

Term	Definition	Trade-off / Gotcha
B-Tree	A self-balancing tree data structure that maintains sorted data and allows for efficient insertion, deletion, and search operations. Each node can hold multiple keys and child pointers, designed to minimize disk I/O by matching node size to disk page size. For example, think of it like a library where each shelf can hold many books, allowing for efficient searching without having to check every single book.	Not the same as a binary tree: database B-trees have branching factors in the hundreds, not two.
Page	A fixed-size block (typically 4 KB, 8 KB, or 16 KB) that the database uses to read from and write to disk in a single I/O operation. You can think of a page like a box that holds a certain number of items; when the box is full, you have to deal with it all at once.	Larger pages mean fewer I/Os but more wasted space if rows are small or access is random.
Fill Factor	The percentage of each page that the database fills with data during index creation or rebuild, leaving the rest as free space for future inserts. Imagine packing a suitcase: if you pack too tightly, you can't add more items later, but if you leave too much empty space, you're wasting room.	A fill factor that is too high can cause frequent page splits on write-heavy tables; too low wastes disk space and can slow down reads.
Page Split	When an insert targets a full page, the database divides it into two pages to accommodate the new data. This is similar to needing to split a crowded table at a restaurant when more guests arrive. Each row id (rid) helps identify where each piece of data belongs.	This can lead to increased overhead as the database has to manage more pages.
Index	A data structure that improves the speed of data retrieval operations on a database table at the cost of additional space and maintenance overhead. Think of it like an index in a book that helps you quickly find the information you need without reading every page.	While indexes speed up read operations, they can slow down write operations due to the need to update the index.
Node	A fundamental part of a B-tree that contains keys and pointers to child nodes. Each node can have multiple keys, which helps in organizing and searching the data efficiently.	A node that is too full can lead to page splits, increasing complexity and overhead in managing the tree.
Leaf Node	A node in a B-tree that does not have any children. Leaf nodes contain the actual data or pointers to the data. They are crucial for the retrieval process since they are the endpoints of the search.	If leaf nodes are not balanced properly, it can lead to inefficient searches and increased access times.

What & Why

Every database query that uses a WHERE clause on an indexed column owes its speed to a B-tree. When you write SELECT * FROM orders WHERE customer_id = 42, the database does not scan every row in the table. Instead, it walks down a B-tree index, reading 3 or 4 disk pages to find exactly the rows it needs out of billions.

B-trees are the default index structure in PostgreSQL, MySQL/InnoDB, SQL Server, Oracle, and SQLite. They are so dominant because they handle both point lookups ($O(\log_B n)$ where $B$ is the branching factor) and range scans efficiently, they stay balanced automatically through page splits, and their wide, shallow shape maps perfectly onto how disks and SSDs transfer data in fixed-size pages.

However, B-trees are not without their drawbacks. In scenarios with highly skewed data distributions, B-trees can lead to uneven growth and performance issues. For example, if a column has a small number of unique values with many duplicates, the index may become less efficient, resulting in poor performance during lookups. Additionally, B-trees can struggle with write-heavy workloads, as the need to maintain balance through page splits can introduce overhead. In cases where data is frequently updated or deleted, alternatives like hash indexes or other data structures may be more suitable.

This post goes deeper than the data structure itself. We will look at how a database lays out B-tree nodes on disk pages, what happens physically during a page split, how fill factor trades write performance for read density, and the critical difference between clustered and non-clustered indexes. Understanding these internals is the difference between an index that makes your query fast and one that causes unnecessary I/O.

How It Works

Disk Page Layout

Think of a database B-tree node like a physical box where data is stored. This box, known as a disk page, typically measures 8 KB or 16 KB and has a specific layout that organizes the information inside. Understanding how this box is structured is key to grasping how databases work efficiently. Each part of the box has a role in storing, accessing, and managing data.

The page has a header, a slot array, key/pointer cells, and free space.

Page Header: This contains essential metadata such as the page ID, level, key count, free space offset, and checksum. It helps the database engine manage the page effectively.
Slot Array: This is where pointers to the actual data entries are stored. Each slot corresponds to a specific key/pointer cell, allowing quick access to the data.
Key/Pointer Cells: These hold the actual keys and their corresponding pointers to data. They are critical for searching and retrieving data efficiently.
Free Space: This area is reserved for future inserts and updates, ensuring that the page can grow as needed without requiring a complete rewrite.

Complexity Analysis

All complexities below assume a B-tree of order $B$ (branching factor) with $n$ total records and tree height $h = \lceil \log_B(n) \rceil$. It is important to note that the time complexity for point lookups, $O(\log_B n)$, assumes a balanced B-tree. If the B-tree becomes unbalanced due to poor insertions or deletions, the height could approach $n$, leading to $O(n)$ complexity for lookups. This is a significant edge case that can occur in practice, especially with frequent insertions and deletions without rebalancing.

Similarly, the complexity for range scans, $O(\log_B n + k/B)$, assumes that the range scan does not require additional I/O for fetching rows. If the range is large and the data is not contiguous, it could lead to significantly more I/O operations, potentially making it worse than $O(\log_B n + k/B)$.

Operation	Time Complexity	Disk I/Os	Notes
Point lookup	$O(\log_B n)$	$h$ (typically 3-4)	One page read per tree level
Range scan ($k$ rows)	$O(\log_B n + k/B)$	$h + \lceil k/F \rceil$	$F$ = records per leaf page
Insert (no split)	$O(\log_B n)$	$h$ reads + 1 write	Traverse to leaf, append key
Insert (with split)	$O(\log_B n)$	$h$ reads + 3 writes	Split leaf + write 2 pages + update parent
Delete	$O(\log_B n)$	$h$ reads + 1 write	May trigger merge if page underfull
Index build (bulk)	$O(n \log_B n)$	$O(n/B)$	Sort-then-build avoids random I/O

The key insight is that the branching factor $B$ is large (100-500), so $\log_B(n)$ grows extremely slowly. For $n = 10^9$ and $B = 500$:

$h = \lceil \log_{500}(10^9) \rceil = \lceil \frac{9}{\log_{10}(500)} \rceil = \lceil \frac{9}{2.699} \rceil = \lceil 3.33 \rceil = 4$

Four disk reads to find any row among a billion. With the root and first

Implementation

ALGORITHM BTreeInsert(root, key, record)
INPUT: root: page ID of the B-tree root
       key: the index key to insert
       record: the data or row pointer to store
OUTPUT: new root page ID (may change if root splits)

BEGIN
  // Phase 1: Traverse to the correct leaf
  path = empty stack
  current = BufferPool.FetchPage(root)

  WHILE current.level > 0 DO
    path.Push(current)
    child_index = BinarySearch(current.keys, key)
    current = BufferPool.FetchPage(current.children[child_index])
  END WHILE

  // Phase 2: Insert into the leaf
  InsertIntoPage(current, key, record)

  // Phase 3: Split if necessary, propagate up
  WHILE current.IsFull() DO
    median = SplitPage(current)
    IF path.IsEmpty() THEN
      // Create a new root if we are at the top level
      new_root = CreateNewRoot(median.key, current.page_id)
      RETURN new_root
    ELSE
      parent = path.Pop()
      InsertIntoPage(parent, median.key, median.new_page_id)
      current = parent
  END WHILE
END

Real-World Applications

PostgreSQL uses a variant of B-tree indexes as the default index type for all primary keys and any column with a CREATE INDEX statement. Its internal implementation has unique characteristics that differ from standard B-trees, including features that allow concurrent access without holding locks on parent nodes during splits.

MySQL/InnoDB uses a clustered index on the primary key for every table. The leaf pages of the primary key index contain the actual row data. All secondary (non-clustered) indexes store the primary key value as their row pointer instead of a physical Row ID, so a secondary index lookup requires two B-tree traversals: one on the secondary index to find the primary key, then one on the clustered index to find the row.

SQLite stores the entire database in a single file, with each table and index as a separate B-tree. Tables use B+ trees (data only in leaves), while indexes use regular B-trees. The page size defaults to 4096 bytes and can be configured at database creation.

SQL Server offers both clustered and non-clustered indexes with explicit fill factor control. The FILLFACTOR option in CREATE INDEX lets you set the percentage (1-100) of each leaf page to fill during index creation. SQL Server also supports "included columns" in non-clustered indexes, which store additional non-key columns in the leaf pages to create covering indexes that avoid bookmark lookups.

MongoDB uses B-tree indexes (specifically, WiredTiger's B-tr

Key Takeaways

A B-tree index node is a physical disk page with a header, slot array, and cell area. The slot array enables sorted access without physically moving cells during inserts.
Tree height determines lookup cost: with branching factors of 100-500, even billions of rows need only 3-5 disk reads. The root page is almost always cached, reducing this further.
Page splits are the most expensive write operation. They allocate a new page, redistribute keys, and update the parent. Fill factor controls how often splits happen by leaving free space in pages.
Clustered indexes store actual row data in the leaf pages, making range scans sequential I/O. Non-clustered indexes store key + Row ID pointers, requiring a bookmark lookup for non-covered columns.
Fill factor is a read vs write trade-off: a high fill factor (90-100%) packs more data per page for faster reads, while a low fill factor (50-70%) leaves room for inserts to avoid splits.
B-trees are efficient for database indexing because they handle equality, range, prefix, and ORDER BY queries effectively. They outperform hash indexes, which only support equality, and BRIN indexes, which work with naturally ordered data.

2025-10-03

LSM Trees and Write-Optimized Storage: Why Cassandra and RocksDB Chose Append-Only

How LSM trees turn random writes into sequential I/O through memtables, SSTables, and compaction, and why write-heavy databases chose them over B-trees.

2025-10-04

Query Execution and Optimization: How Databases Turn SQL into Fast Plans

How the query optimizer transforms SQL into execution plans, how cost-based optimization picks the cheapest path, and why nested loop, hash join, and merge join exist for different workloads.

2025-10-05

Transaction Isolation and MVCC: How Postgres Reads Without Locking

How transaction isolation levels prevent anomalies, how Postgres uses Multi-Version Concurrency Control to let readers and writers coexist, and why write skew still sneaks through snapshot isolation.