Notes/Query Execution and Optimization: How Databases Turn SQL into Fast Plans

Query Execution and Optimization: How Databases Turn SQL into Fast Plans

How the query optimizer transforms SQL into execution plans, how cost-based optimization picks the cheapest path, and why nested loop, hash join, and merge join exist for different workloads.

2025-10-04AI-Synthesized from Personal Notes

Database InternalsQuery OptimizationJoin AlgorithmsDatabases

Terminology

Term	Definition	Trade-off / Gotcha
Query Plan	A tree of physical operators (scan, join, sort, filter) that the database executes to produce the result of a SQL statement.	The same SQL can produce dozens of valid plans; the optimizer's job is to pick the cheapest one, but it sometimes gets it wrong.
Cost-Based Optimizer	A component that estimates the cost (disk I/O, CPU, memory) of each candidate plan using table statistics and picks the plan with the lowest estimated total cost.	Stale statistics lead to bad plans; always run ANALYZE after bulk loads or schema changes.
Cardinality Estimate	The optimizer's prediction of how many rows each operator will produce, used to size hash tables, choose join order, and estimate I/O.	Estimates compound multiplicatively through the plan tree, so a 2x error at the bottom can become 100x at the top.
Selectivity	The fraction of rows that pass a filter predicate, ranging from 0 (no rows) to 1 (all rows).	Correlated columns violate the independence assumption most optimizers use, causing severe underestimates.
Nested Loop Join	A join that iterates over the outer table and, for each row, scans the inner table (or its index) for matches.	Simple and efficient for small inner tables or indexed lookups, but $O(n \cdot m)$ without an index is catastrophic on large tables.
Hash Join	A join that builds a hash table from the smaller input (build side) and probes it with each row from the larger input (probe side).	Requires enough memory to hold the build-side hash table; if it spills to disk, performance drops dramatically.
Sort-Merge Join	A join that sorts both inputs on the join key and merges them in a single pass, advancing pointers through both sorted streams.	Excellent when inputs are already sorted (e.g., from an index scan), but the upfront sort cost is $O(n \log n)$ if not pre-sorted.
Sequential Scan	Reading every page of a table from first to last, processing all rows regardless of whether they match the query predicate.	Faster than an index scan when selectivity is above roughly 5-10%, because sequential I/O is much cheaper than random I/O.
Index Scan	Traversing a B-tree index to find matching keys, then fetching the corresponding rows from the heap or clustered index.	Each row fetch may be a random I/O; for low-selectivity queries this is fast, but for high-selectivity queries a sequential scan wins.
EXPLAIN ANALYZE	A diagnostic command that runs the query and reports the actual row counts, execution times, and memory usage at each operator node.	It actually executes the query (including writes for INSERT/UPDATE), so use it carefully on production data.

What & Why

The previous posts covered how databases store data on disk: B-tree indexes for read-heavy workloads, LSM trees for write-heavy ones. But storage is only half the story. When you write SELECT o.id, c.name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE c.country = 'US' AND o.total > 100, the database must decide how to actually execute that query. Which table to scan first? Use an index or read the whole table? Which join algorithm? In what order?

This is the job of the query optimizer, and it is arguably the most complex component in any relational database. A good optimizer can make a query run in milliseconds; a bad plan for the same query can take hours. The difference is not the data or the hardware, it is the execution strategy.

The optimizer works in stages. First, the parser turns SQL text into an abstract syntax tree. Then the planner generates a logical plan (what operations to perform). The optimizer transforms this into a physical plan (how to perform each operation), evaluating potentially thousands of candidate plans and picking the one with the lowest estimated cost. Finally, the executor runs the chosen plan, streaming rows through the operator tree.

Understanding this pipeline is essential for anyone who writes SQL, because EXPLAIN ANALYZE output is the window into what the database actually did. When a query is slow, the answer is almost always in the plan: a bad cardinality estimate, a missing index, or the wrong join algorithm.

How It Works

The Query Execution Pipeline

Every SQL statement passes through the same pipeline, from raw text to result rows. Each stage transforms the query into a progressively more concrete representation.

The parser validates syntax and produces an abstract syntax tree. The logical plan expresses the query as relational algebra: scan, filter, project, join. The optimizer is where the real work happens: it considers different join orders, access methods (index scan vs sequential scan), and join algorithms, estimating the cost of each combination using table statistics (row counts, column histograms, index selectivity). The physical plan is the winner, a concrete tree of operators with specific algorithms assigned. The executor walks this tree, pulling rows from the leaves (table scans) up through joins and filters to produce the final result.

Join Algorithms: The Core Decision

The choice of join algorithm is often the single biggest factor in query performance. Databases implement three fundamental join algorithms, each optimal for different data sizes and access patterns.

Hash Join Memory Layout

The hash join is the workhorse for large analytical queries. Understanding its two-phase execution explains both its speed and its memory sensitivity.

During the build phase, the database reads every row from the smaller table (the build side), hashes the join key, and inserts the row into an in-memory hash table organized as an array of buckets. Each bucket is a linked list of entries that hash to the same slot. During the probe phase, the database streams every row from the larger table (the probe side), hashes its join key, looks up the matching bucket, and walks the chain to find all matching rows.

The key insight: the build side must fit in memory (or at least in work_mem). If it does not, the database partitions both inputs by hash value, writes partitions to temporary files on disk, and processes each partition pair independently. This "grace hash join" is slower but still $O(n + m)$ in total work.

Join Performance at Scale

The performance difference between join algorithms becomes dramatic as table sizes grow. Nested loop joins degrade quadratically, while hash joins and merge joins scale linearly.

At 1K rows, all three algorithms finish in single-digit milliseconds and the choice barely matters. At 100K rows, nested loop without an index is already 70x slower than hash join. At 10M rows, nested loop takes over 90 seconds while hash join finishes in under 200 ms. This is why the optimizer's join algorithm choice is so critical for analytical queries on large tables.

Note: nested loop with an index on the inner table performs much better (closer to hash join for selective lookups), which is why the optimizer often picks nested loop for OLTP queries that join a small result set to a large indexed table.

Query Plan Tree: Reading EXPLAIN Output

When you run EXPLAIN ANALYZE, the database shows you the physical plan as a tree of operator nodes. Each node reports its estimated and actual row counts, startup cost, total cost, and execution time. Reading this tree from bottom to top tells you exactly what the database did.

Reading this plan bottom-up: the database scans the customers table using an index (filtering to country = 'US', yielding an estimated 800 rows), builds a hash table from those 800 rows, then sequentially scans the orders table (filtering to total > 100, yielding an estimated 50K rows), probes the hash table for each order row, sorts the 42 matching results, and returns them. The optimizer chose hash join because the build side (800 customer rows) fits easily in memory, and the probe side (50K order rows) is too large for nested loop to be efficient.

Sort-Merge Join: Two Sorted Streams

The sort-merge join works by ensuring both inputs are sorted on the join key, then merging them in a single linear pass. Two pointers advance through the sorted streams, matching rows where the keys are equal.

When both pointers point to equal keys (both at cust_id = 5), the join emits all combinations of matching rows. Then both pointers advance past the matched key. Because both streams are sorted, each page is read exactly once, making the merge phase purely sequential I/O. If the inputs come from index scans that already produce sorted output, the sort step is free and the total cost is just $O(n + m)$.

Join Algorithm Feature Matrix

Cost-Based Optimization: How the Optimizer Chooses

The optimizer does not try every possible plan exhaustively (for a 10-table join, there are over 17 million possible join orderings). Instead, it uses dynamic programming to build optimal sub-plans bottom-up, pruning dominated alternatives early.

For each operator, the optimizer estimates cost using three inputs:

Table statistics: row counts, column cardinality (number of distinct values), histograms of value distribution, correlation between columns
Selectivity estimates: what fraction of rows pass each filter predicate
Cost model: I/O cost per page read (sequential vs random), CPU cost per row comparison, memory cost for hash tables and sort buffers

The total cost of a plan is the sum of all operator costs. The optimizer picks the plan with the lowest total estimated cost. When statistics are accurate, this works remarkably well. When statistics are stale or the optimizer's independence assumptions are wrong (correlated columns), the chosen plan can be orders of magnitude slower than optimal.

$\text{Cost}(plan) = \sum_{op \in plan} \left( \text{IO}{seq}(op) \cdot C{seq} + \text{IO}{rand}(op) \cdot C{rand} + \text{rows}(op) \cdot C_{cpu} \right)$

Where $C_{seq}$ is the cost of a sequential page read (cheap), $C_{rand}$ is the cost of a random page read (expensive, roughly 4x $C_{seq}$ on SSD, 100x on HDD), and $C_{cpu}$ is the per-row processing cost.

Complexity Analysis

All complexities assume two input tables of size $n$ (outer/left) and $m$ (inner/right), with $B$ being the branching factor of any B-tree index used.

Operation	Time Complexity	Memory	Disk I/O Pattern	Notes
Nested Loop (no index)	$O(n \cdot m)$	$O(1)$	Random	Reads inner table once per outer row
Nested Loop (indexed)	$O(n \cdot \log_B m)$	$O(1)$	Random (index)	One index traversal per outer row
Hash Join	$O(n + m)$	$O(\min(n, m))$	Sequential	Build on smaller, probe with larger
Hash Join (spill to disk)	$O(n + m)$	$O(\text{work_mem})$	Mixed	Partitions spill, then process per-partition
Sort-Merge (unsorted)	$O(n \log n + m \log m)$	$O(n + m)$	Sequential	Dominated by sort cost
Sort-Merge (pre-sorted)	$O(n + m)$	$O(1)$	Sequential	Just the merge pass
Sequential Scan	$O(n)$	$O(1)$	Sequential	Reads every page in table order
Index Scan ($k$ matches)	$O(\log_B n + k)$	$O(1)$	Random	One traversal + leaf walk

The nested loop join's $O(n \cdot m)$ cost is the reason it is catastrophic for large tables. For $n = m = 10{,}000$:

$\text{Nested Loop} = 10{,}000 \times 10{,}000 = 10^8 \text{ comparisons}$

$\text{Hash Join} = 10{,}000 + 10{,}000 = 20{,}000 \text{ operations}$

That is a 5,000x difference. With an index on the inner table, nested loop improves to $O(n \cdot \log_B m)$, which for $B = 200$ and $m = 10{,}000$ is roughly $10{,}000 \times 2 = 20{,}000$ index lookups, comparable to hash join. This is why the optimizer picks nested loop for indexed joins on small result sets but switches to hash join for large unindexed joins.

The hash join's memory requirement of $O(\min(n, m))$ is the critical constraint. If the build side exceeds work_mem (default 4 MB in PostgreSQL), the database must partition and spill to disk. Each spill roughly doubles the I/O cost. The optimizer estimates the build side size and may choose sort-merge join instead if it predicts a spill.

Sort-merge join's advantage appears when inputs are already sorted. An index scan on a B-tree index produces rows in sorted order by the index key. If both sides of the join come from index scans on the join key, the sort step is free and the merge is $O(n + m)$ with purely sequential I/O. The optimizer recognizes this "interesting order" property and may choose a more expensive access method (index scan instead of sequential scan) specifically because it eliminates a downstream sort.

Implementation

Pseudocode: Nested Loop Join

ALGORITHM NestedLoopJoin(outer, inner, predicate)
INPUT: outer: table scan or index scan iterator
       inner: table scan or index scan iterator
       predicate: join condition function
OUTPUT: stream of joined row pairs

BEGIN
  FOR EACH row_o IN outer DO
    inner.Reset()
    FOR EACH row_i IN inner DO
      IF predicate(row_o, row_i) THEN
        EMIT (row_o, row_i)
      END IF
    END FOR
  END FOR
END

Pseudocode: Hash Join

ALGORITHM HashJoin(build_input, probe_input, join_key)
INPUT: build_input: the smaller table (iterator)
       probe_input: the larger table (iterator)
       join_key: column(s) to join on
OUTPUT: stream of joined row pairs

BEGIN
  // Build phase: load smaller table into hash table
  hash_table = new HashTable()

  FOR EACH row IN build_input DO
    key = row[join_key]
    bucket = Hash(key) MOD num_buckets
    hash_table[bucket].Append(row)
  END FOR

  // Probe phase: stream larger table through hash table
  FOR EACH row_p IN probe_input DO
    key = row_p[join_key]
    bucket = Hash(key) MOD num_buckets

    FOR EACH row_b IN hash_table[bucket] DO
      IF row_b[join_key] = row_p[join_key] THEN
        EMIT (row_b, row_p)
      END IF
    END FOR
  END FOR
END

Pseudocode: Sort-Merge Join

ALGORITHM SortMergeJoin(left, right, join_key)
INPUT: left: sorted iterator on join_key
       right: sorted iterator on join_key
OUTPUT: stream of joined row pairs

BEGIN
  row_l = left.Next()
  row_r = right.Next()

  WHILE row_l IS NOT NULL AND row_r IS NOT NULL DO
    IF row_l[join_key] < row_r[join_key] THEN
      row_l = left.Next()
    ELSE IF row_l[join_key] > row_r[join_key] THEN
      row_r = right.Next()
    ELSE
      // Keys match: emit all combinations for this key value
      mark = right.Mark()

      WHILE row_r IS NOT NULL AND row_r[join_key] = row_l[join_key] DO
        EMIT (row_l, row_r)
        row_r = right.Next()
      END WHILE

      row_l = left.Next()
      right.RestoreToMark(mark)
      row_r = right.Next()

      // Advance past duplicates on left
      IF row_l IS NOT NULL AND row_l[join_key] = previous_left_key THEN
        // Continue matching with same right group
        right.RestoreToMark(mark)
        row_r = right.Current()
      END IF
    END IF
  END WHILE
END

Pseudocode: Cost-Based Plan Selection

ALGORITHM ChooseBestPlan(logical_plan, statistics)
INPUT: logical_plan: relational algebra tree
       statistics: table row counts, histograms, index info
OUTPUT: physical plan with lowest estimated cost

BEGIN
  candidates = empty list

  // Generate candidate physical plans
  FOR EACH join_order IN EnumerateJoinOrders(logical_plan) DO
    FOR EACH access_method IN [SeqScan, IndexScan] DO
      FOR EACH join_algo IN [NestedLoop, HashJoin, MergeJoin] DO
        plan = BuildPhysicalPlan(join_order, access_method, join_algo)
        plan.cost = EstimateCost(plan, statistics)
        candidates.Append(plan)
      END FOR
    END FOR
  END FOR

  // Pick the plan with lowest estimated cost
  best = candidates[0]
  FOR EACH plan IN candidates DO
    IF plan.cost < best.cost THEN
      best = plan
    END IF
  END FOR

  RETURN best
END

Real-World Applications

PostgreSQL's optimizer uses a genetic algorithm (GEQO) for queries joining more than 12 tables, because the dynamic programming approach becomes too expensive with that many join orderings.
MySQL 8.0 introduced a hash join executor, replacing the old "block nested loop" that was the only non-indexed join algorithm for decades. Before 8.0, joining two large unindexed tables was painfully slow.
Data warehouses like Snowflake, BigQuery, and Redshift rely almost exclusively on hash joins for their analytical workloads, because the tables are large, unsorted, and stored in columnar format.
EXPLAIN ANALYZE in PostgreSQL and EXPLAIN FORMAT=TREE in MySQL are the primary tools for diagnosing slow queries. The gap between estimated rows and actual rows at any node is the first thing to check.
Adaptive query execution in Spark 3.0+ re-optimizes the plan at runtime based on actual row counts from completed stages, fixing bad cardinality estimates mid-query.
Covering indexes (including all queried columns in the index) eliminate bookmark lookups entirely, turning what would be a nested loop with random I/O into a pure index-only scan.
Join order matters enormously: joining a 10-row filtered result to a billion-row table (small drives large) is fast with nested loop + index, but joining the billion-row table first (large drives small) is catastrophic.

Key Takeaways

The query optimizer transforms SQL into a physical execution plan by estimating costs using table statistics; stale statistics are the most common cause of bad plans.
Three join algorithms cover all cases: nested loop for small/indexed joins, hash join for large equality joins, sort-merge join for pre-sorted inputs.
Hash join is $O(n + m)$ but requires $O(\min(n, m))$ memory; if the build side exceeds work_mem, performance degrades from disk spills.
Nested loop without an index is $O(n \cdot m)$ and should never be used on large tables; with an index it becomes $O(n \cdot \log m)$ and is the best choice for selective OLTP queries.
EXPLAIN ANALYZE shows you the actual execution plan with real row counts and timings; the gap between estimated and actual rows reveals where the optimizer went wrong.
Join order is as important as join algorithm: always let the optimizer put the smaller, more selective table on the driving side of the join.

2025-10-05

Transaction Isolation and MVCC: How Postgres Reads Without Locking

How transaction isolation levels prevent anomalies, how Postgres uses Multi-Version Concurrency Control to let readers and writers coexist, and why write skew still sneaks through snapshot isolation.

2025-10-15

The Physical I/O Path: How Bytes Hit the Disk

fsync semantics, O_DIRECT vs buffered I/O, the OS page cache, SSD vs HDD access patterns, torn write protection, and the fsync-gate incident that shook PostgreSQL.

2025-10-01

SQL vs NoSQL vs NewSQL: Choosing the Right Database Paradigm

Understanding the relational model, document/key-value/graph/column-family stores, ACID vs BASE guarantees, and how to pick the right database for your workload.