CPU Architecture: Pipelines, Caches, and Branch Prediction
How modern processors execute billions of instructions per second using pipelining, cache hierarchies, branch prediction, and instruction-level parallelism.
Terminology
- CPU (Central Processing Unit): the primary processor in a computer that executes instructions from a program by performing arithmetic, logic, control, and input/output operations
- Instruction: a single operation that the CPU can execute, such as adding two numbers, loading data from memory, or jumping to a different part of the program
- Pipeline: a technique where multiple instructions are overlapped in execution by dividing the processor into stages, each handling a different phase of an instruction simultaneously
- Pipeline stage: one step in the pipeline sequence; classic stages include fetch, decode, execute, memory access, and write-back
- Clock cycle: the basic unit of time for the CPU, determined by the clock frequency; one tick of the processor's internal clock
- Throughput: the number of instructions completed per unit of time, often measured in instructions per cycle (IPC)
- Latency: the total time from when an instruction enters the pipeline to when it completes
- Hazard: a situation that prevents the next instruction from executing in the next clock cycle, causing a pipeline stall
- Data hazard: occurs when an instruction depends on the result of a previous instruction that has not yet completed
- Control hazard: occurs when the pipeline must make a decision based on the result of a branch instruction that has not yet resolved
- Structural hazard: occurs when two instructions need the same hardware resource at the same time
- Branch prediction: a technique where the CPU guesses the outcome of a conditional branch before it is resolved, allowing the pipeline to continue fetching instructions speculatively
- Cache: a small, fast memory located close to the CPU that stores copies of frequently accessed data to reduce the time needed to access main memory
- Cache hit: when the requested data is found in the cache
- Cache miss: when the requested data is not in the cache and must be fetched from a slower level of memory
- ILP (Instruction-Level Parallelism): the ability to execute multiple instructions simultaneously by exploiting independence between instructions
- Superscalar: a CPU design that can issue and execute more than one instruction per clock cycle using multiple execution units
- Out-of-order execution: a technique where the CPU executes instructions in an order determined by data dependencies rather than the original program order, to maximize utilization of execution units
- Register: a very small, very fast storage location inside the CPU used to hold data currently being processed
- ALU (Arithmetic Logic Unit): the component of the CPU that performs arithmetic and logical operations
- Forwarding (bypassing): a hardware optimization that routes the result of one pipeline stage directly to an earlier stage that needs it, avoiding a stall
- Speculative execution: executing instructions before it is certain they are needed, based on predictions; results are discarded if the prediction was wrong
What & Why
A modern CPU executes billions of instructions per second, yet each individual instruction takes multiple clock cycles to complete. The key insight behind modern processor design is that you do not need to finish one instruction before starting the next. By breaking instruction execution into stages and overlapping them, a processor can start a new instruction every cycle even though each instruction takes several cycles to finish. This is pipelining, and it is the foundation of high-performance CPU design.
But pipelining alone is not enough. Memory is slow compared to the CPU, so caches store frequently used data close to the processor. Branch instructions create uncertainty about which instruction comes next, so branch predictors guess the outcome to keep the pipeline full. And because programs contain many independent operations, superscalar processors and out-of-order execution exploit instruction-level parallelism to do multiple things at once.
Understanding CPU architecture matters for software engineers because the performance of your code depends on how well it interacts with these hardware mechanisms. Cache-friendly data access patterns, predictable branch behavior, and code that exposes parallelism all run dramatically faster than code that fights the hardware.
How It Works
The Classic 5-Stage Pipeline
The simplest pipeline divides instruction execution into five stages. Each stage takes one clock cycle, and the processor works on five different instructions simultaneously, one in each stage.
Fetch (IF): The CPU reads the next instruction from memory using the program counter (PC). The PC is incremented to point to the following instruction.
Decode (ID): The instruction is decoded to determine what operation to perform and which registers to read. Operand values are read from the register file.
Execute (EX): The ALU performs the computation (addition, comparison, address calculation, etc.).
Memory (MEM): If the instruction is a load or store, data is read from or written to memory. Other instructions pass through this stage without action.
Write-Back (WB): The result is written back to the destination register.
Pipeline Hazards and Solutions
Hazards disrupt the smooth flow of the pipeline. There are three types:
Data hazards occur when an instruction needs a value that a previous instruction has not yet produced. For example, if instruction 1 writes to register R1 and instruction 2 reads R1, instruction 2 cannot proceed until instruction 1 finishes. The solution is forwarding (bypassing): the hardware routes the result from the execute stage directly to the instruction that needs it, without waiting for write-back.
Control hazards occur at branch instructions. The CPU does not know which instruction to fetch next until the branch condition is evaluated. The solution is branch prediction: the CPU guesses the branch outcome and speculatively fetches instructions along the predicted path. If the guess is wrong, the speculative instructions are flushed and the correct path is fetched.
Structural hazards occur when two instructions need the same hardware unit. For example, if the pipeline has a single memory port, a fetch and a memory-stage load cannot happen simultaneously. The solution is to duplicate hardware resources (separate instruction and data caches, multiple ALUs).
Branch Prediction
Modern CPUs use sophisticated branch predictors that achieve over 95% accuracy. The simplest predictor is a 1-bit predictor that remembers whether the branch was taken last time. A 2-bit saturating counter is more robust: it requires two consecutive mispredictions before changing its prediction. Modern processors use correlated predictors and tournament predictors that combine multiple prediction strategies.
When a branch is mispredicted, all speculatively executed instructions must be flushed from the pipeline. For a pipeline with k stages, a misprediction wastes approximately k cycles. This is why predictable branch patterns (like loop conditions) perform much better than random branches.
Cache Hierarchy
The CPU cache bridges the speed gap between the processor and main memory. Modern processors use a multi-level cache hierarchy:
L1 Cache: Smallest (32-64 KB), fastest (1-2 cycles), split into separate instruction cache (L1i) and data cache (L1d). Located on the CPU core itself.
L2 Cache: Larger (256 KB - 1 MB), slower (4-10 cycles), unified (holds both instructions and data). Usually per-core.
L3 Cache: Largest (4-64 MB), slowest cache level (20-40 cycles), shared across all cores.
Main Memory (RAM): Much larger (GBs), much slower (100-300 cycles).
When the CPU needs data, it checks L1 first, then L2, then L3, then RAM. Each level is larger but slower. The principle of locality makes this effective: programs tend to access the same data repeatedly (temporal locality) and access data near recently accessed data (spatial locality).
Instruction-Level Parallelism
Superscalar processors have multiple execution units and can issue several instructions per cycle. Out-of-order execution reorders instructions to fill execution units even when some instructions are stalled. The processor maintains the illusion of sequential execution by committing results in program order.
For example, if instruction A depends on a slow memory load but instructions B and C are independent, an out-of-order processor executes B and C while waiting for A's data. This keeps the execution units busy and improves throughput.
Complexity Analysis
| Metric | Without Pipelining | With Pipelining |
|---|---|---|
| Instruction latency | $k$ cycles | $k$ cycles (unchanged) |
| Throughput (ideal) | $\frac{1}{k}$ instructions/cycle | $1$ instruction/cycle |
| Time for $n$ instructions | $n \times k$ cycles | $k + (n - 1)$ cycles |
| Speedup | baseline | $\frac{n \times k}{k + (n-1)} \approx k$ for large $n$ |
For a $k$-stage pipeline executing $n$ instructions, the ideal speedup approaches $k$ as $n$ grows large:
As $n \to \infty$:
Branch misprediction penalty for a $k$-stage pipeline where the branch resolves at stage $b$:
Effective CPI (Cycles Per Instruction) with branch mispredictions:
Where $f_b$ is the fraction of branch instructions, $m$ is the misprediction rate, and $p$ is the misprediction penalty in cycles.
Cache performance impact on effective memory access time:
Where $h_i$ is the hit rate and $t_i$ is the access time for cache level $i$.
Implementation
ALGORITHM SimulatePipeline(instructions)
INPUT: instructions: list of instruction objects
OUTPUT: total cycles to execute all instructions
BEGIN
k <- 5 // number of pipeline stages
stages <- array of k empty slots
cycle <- 0
completed <- 0
pc <- 0 // program counter (index into instructions)
WHILE completed < LENGTH(instructions) DO
cycle <- cycle + 1
// Write-Back stage: commit result
IF stages[4] is not empty THEN
WRITE_REGISTER(stages[4].dest, stages[4].result)
completed <- completed + 1
END IF
// Shift instructions through pipeline (from back to front)
stages[4] <- stages[3]
stages[3] <- stages[2]
stages[2] <- stages[1]
stages[1] <- stages[0]
// Fetch stage: load next instruction
IF pc < LENGTH(instructions) THEN
stages[0] <- instructions[pc]
pc <- pc + 1
ELSE
stages[0] <- empty
END IF
// Check for data hazard (simplified)
IF stages[1] is not empty AND stages[2] is not empty THEN
IF stages[1].source OVERLAPS stages[2].dest THEN
IF FORWARDING_POSSIBLE(stages[2], stages[1]) THEN
FORWARD(stages[2].result, stages[1])
ELSE
INSERT_BUBBLE(stages, 1) // stall for one cycle
pc <- pc - 1 // re-fetch
END IF
END IF
END IF
END WHILE
RETURN cycle
END
ALGORITHM TwoBitBranchPredictor(branchHistory, branchAddress)
INPUT: branchHistory: map of address to 2-bit counter, branchAddress: address of branch
OUTPUT: prediction (TAKEN or NOT_TAKEN)
BEGIN
IF branchAddress NOT IN branchHistory THEN
branchHistory[branchAddress] <- 2 // weakly taken (states: 0,1,2,3)
END IF
counter <- branchHistory[branchAddress]
IF counter >= 2 THEN
RETURN TAKEN
ELSE
RETURN NOT_TAKEN
END IF
END
ALGORITHM UpdateBranchPredictor(branchHistory, branchAddress, actualOutcome)
INPUT: branchHistory: map, branchAddress: address, actualOutcome: TAKEN or NOT_TAKEN
OUTPUT: updated branchHistory
BEGIN
counter <- branchHistory[branchAddress]
IF actualOutcome = TAKEN THEN
counter <- MIN(counter + 1, 3)
ELSE
counter <- MAX(counter - 1, 0)
END IF
branchHistory[branchAddress] <- counter
END
ALGORITHM CacheLookup(address, cache, nextLevel)
INPUT: address: memory address, cache: cache structure, nextLevel: next cache level or RAM
OUTPUT: data at address
BEGIN
setIndex <- (address / BLOCK_SIZE) MOD NUM_SETS
tag <- address / (BLOCK_SIZE * NUM_SETS)
set <- cache.sets[setIndex]
// Search for tag in the set (associative lookup)
FOR EACH line IN set.lines DO
IF line.valid AND line.tag = tag THEN
// Cache hit
UPDATE_LRU(set, line)
RETURN line.data[address MOD BLOCK_SIZE]
END IF
END FOR
// Cache miss: fetch from next level
block <- FETCH(nextLevel, ALIGN(address, BLOCK_SIZE))
// Evict LRU line and insert new block
victim <- GET_LRU_LINE(set)
IF victim.dirty THEN
WRITE_BACK(nextLevel, victim)
END IF
victim.tag <- tag
victim.data <- block
victim.valid <- true
victim.dirty <- false
UPDATE_LRU(set, victim)
RETURN block[address MOD BLOCK_SIZE]
END
ALGORITHM OutOfOrderExecute(instructions, executionUnits)
INPUT: instructions: list of decoded instructions, executionUnits: available ALUs/units
OUTPUT: committed results in program order
BEGIN
issueQueue <- empty priority queue
reorderBuffer <- empty circular buffer
commitPointer <- 0
FOR EACH instr IN instructions DO
entry <- CREATE_ROB_ENTRY(instr)
APPEND entry TO reorderBuffer
IF ALL_OPERANDS_READY(instr) THEN
ADD instr TO issueQueue
ELSE
MARK_WAITING(instr)
END IF
END FOR
WHILE reorderBuffer is not fully committed DO
// Issue ready instructions to available units
FOR EACH unit IN executionUnits DO
IF unit is idle AND issueQueue is not empty THEN
instr <- DEQUEUE(issueQueue)
DISPATCH(instr, unit)
END IF
END FOR
// Complete execution, broadcast results
FOR EACH unit IN executionUnits DO
IF unit.finished THEN
result <- unit.result
MARK_COMPLETE(reorderBuffer, result.instrID, result.value)
// Wake up dependent instructions
FOR EACH waiting IN reorderBuffer DO
IF DEPENDS_ON(waiting, result.instrID) THEN
FORWARD_OPERAND(waiting, result.value)
IF ALL_OPERANDS_READY(waiting) THEN
ADD waiting TO issueQueue
END IF
END IF
END FOR
END IF
END FOR
// Commit in program order
WHILE reorderBuffer[commitPointer].complete DO
COMMIT(reorderBuffer[commitPointer])
commitPointer <- commitPointer + 1
END WHILE
END WHILE
END
Real-World Applications
- Compiler optimizations: compilers reorder instructions, unroll loops, and schedule operations to maximize pipeline utilization and minimize stalls; understanding the pipeline helps developers write code that compilers can optimize effectively
- Database query engines: high-performance databases like ClickHouse and DuckDB design their data processing to be cache-friendly, processing columns in blocks that fit in L1/L2 cache rather than row-at-a-time
- Game engines: game developers structure entity data using data-oriented design (arrays of components rather than arrays of objects) to exploit spatial locality and keep the cache hot during frame updates
- High-frequency trading: trading systems minimize branch mispredictions by using branchless code patterns and ensure critical data structures fit in L1 cache to achieve sub-microsecond latencies
- Operating system schedulers: OS kernels are designed to minimize cache pollution during context switches, and modern CPUs tag cache lines with process IDs to avoid flushing caches on every switch
- Cryptography: constant-time cryptographic implementations avoid data-dependent branches to prevent timing side-channel attacks that exploit branch prediction behavior
- Machine learning inference: ML frameworks like TensorFlow and PyTorch optimize matrix operations to match cache line sizes and exploit SIMD (Single Instruction, Multiple Data) execution units in modern CPUs
Key Takeaways
- Pipelining overlaps instruction execution across stages, achieving up to $k\times$ speedup for a $k$-stage pipeline by completing one instruction per cycle in the ideal case
- Pipeline hazards (data, control, structural) reduce throughput; forwarding, branch prediction, and resource duplication are the primary solutions
- Branch prediction accuracy exceeds 95% on modern CPUs; mispredictions flush the pipeline and waste cycles proportional to the pipeline depth
- The cache hierarchy (L1, L2, L3) bridges the speed gap between the CPU and main memory; programs that exploit temporal and spatial locality run dramatically faster
- Superscalar and out-of-order execution exploit instruction-level parallelism to issue multiple independent instructions per cycle, further increasing throughput
- Software performance depends heavily on hardware interaction: cache-friendly access patterns, predictable branches, and exposing parallelism are key to writing fast code