Back to Blog

CPU Architecture: Pipelines, Caches, and Branch Prediction

How modern processors execute billions of instructions per second using pipelining, cache hierarchies, branch prediction, and instruction-level parallelism.

2021-03-05
Share
Computer Engineeringcpuarchitecture

Terminology

  • CPU (Central Processing Unit): the primary processor in a computer that executes instructions from a program by performing arithmetic, logic, control, and input/output operations
  • Instruction: a single operation that the CPU can execute, such as adding two numbers, loading data from memory, or jumping to a different part of the program
  • Pipeline: a technique where multiple instructions are overlapped in execution by dividing the processor into stages, each handling a different phase of an instruction simultaneously
  • Pipeline stage: one step in the pipeline sequence; classic stages include fetch, decode, execute, memory access, and write-back
  • Clock cycle: the basic unit of time for the CPU, determined by the clock frequency; one tick of the processor's internal clock
  • Throughput: the number of instructions completed per unit of time, often measured in instructions per cycle (IPC)
  • Latency: the total time from when an instruction enters the pipeline to when it completes
  • Hazard: a situation that prevents the next instruction from executing in the next clock cycle, causing a pipeline stall
  • Data hazard: occurs when an instruction depends on the result of a previous instruction that has not yet completed
  • Control hazard: occurs when the pipeline must make a decision based on the result of a branch instruction that has not yet resolved
  • Structural hazard: occurs when two instructions need the same hardware resource at the same time
  • Branch prediction: a technique where the CPU guesses the outcome of a conditional branch before it is resolved, allowing the pipeline to continue fetching instructions speculatively
  • Cache: a small, fast memory located close to the CPU that stores copies of frequently accessed data to reduce the time needed to access main memory
  • Cache hit: when the requested data is found in the cache
  • Cache miss: when the requested data is not in the cache and must be fetched from a slower level of memory
  • ILP (Instruction-Level Parallelism): the ability to execute multiple instructions simultaneously by exploiting independence between instructions
  • Superscalar: a CPU design that can issue and execute more than one instruction per clock cycle using multiple execution units
  • Out-of-order execution: a technique where the CPU executes instructions in an order determined by data dependencies rather than the original program order, to maximize utilization of execution units
  • Register: a very small, very fast storage location inside the CPU used to hold data currently being processed
  • ALU (Arithmetic Logic Unit): the component of the CPU that performs arithmetic and logical operations
  • Forwarding (bypassing): a hardware optimization that routes the result of one pipeline stage directly to an earlier stage that needs it, avoiding a stall
  • Speculative execution: executing instructions before it is certain they are needed, based on predictions; results are discarded if the prediction was wrong

What & Why

A modern CPU executes billions of instructions per second, yet each individual instruction takes multiple clock cycles to complete. The key insight behind modern processor design is that you do not need to finish one instruction before starting the next. By breaking instruction execution into stages and overlapping them, a processor can start a new instruction every cycle even though each instruction takes several cycles to finish. This is pipelining, and it is the foundation of high-performance CPU design.

But pipelining alone is not enough. Memory is slow compared to the CPU, so caches store frequently used data close to the processor. Branch instructions create uncertainty about which instruction comes next, so branch predictors guess the outcome to keep the pipeline full. And because programs contain many independent operations, superscalar processors and out-of-order execution exploit instruction-level parallelism to do multiple things at once.

Understanding CPU architecture matters for software engineers because the performance of your code depends on how well it interacts with these hardware mechanisms. Cache-friendly data access patterns, predictable branch behavior, and code that exposes parallelism all run dramatically faster than code that fights the hardware.

How It Works

The Classic 5-Stage Pipeline

The simplest pipeline divides instruction execution into five stages. Each stage takes one clock cycle, and the processor works on five different instructions simultaneously, one in each stage.

Fetch IF Decode ID Execute EX Memory MEM Write-Back WB Read instruction from memory Determine opcode and operands Perform ALU operation Load/store from RAM Write result to register Each stage processes a different instruction every cycle Ideal throughput: 1 instruction completed per cycle

Fetch (IF): The CPU reads the next instruction from memory using the program counter (PC). The PC is incremented to point to the following instruction.

Decode (ID): The instruction is decoded to determine what operation to perform and which registers to read. Operand values are read from the register file.

Execute (EX): The ALU performs the computation (addition, comparison, address calculation, etc.).

Memory (MEM): If the instruction is a load or store, data is read from or written to memory. Other instructions pass through this stage without action.

Write-Back (WB): The result is written back to the destination register.

Pipeline Hazards and Solutions

Hazards disrupt the smooth flow of the pipeline. There are three types:

Data hazards occur when an instruction needs a value that a previous instruction has not yet produced. For example, if instruction 1 writes to register R1 and instruction 2 reads R1, instruction 2 cannot proceed until instruction 1 finishes. The solution is forwarding (bypassing): the hardware routes the result from the execute stage directly to the instruction that needs it, without waiting for write-back.

Control hazards occur at branch instructions. The CPU does not know which instruction to fetch next until the branch condition is evaluated. The solution is branch prediction: the CPU guesses the branch outcome and speculatively fetches instructions along the predicted path. If the guess is wrong, the speculative instructions are flushed and the correct path is fetched.

Structural hazards occur when two instructions need the same hardware unit. For example, if the pipeline has a single memory port, a fetch and a memory-stage load cannot happen simultaneously. The solution is to duplicate hardware resources (separate instruction and data caches, multiple ALUs).

Branch Prediction

Modern CPUs use sophisticated branch predictors that achieve over 95% accuracy. The simplest predictor is a 1-bit predictor that remembers whether the branch was taken last time. A 2-bit saturating counter is more robust: it requires two consecutive mispredictions before changing its prediction. Modern processors use correlated predictors and tournament predictors that combine multiple prediction strategies.

When a branch is mispredicted, all speculatively executed instructions must be flushed from the pipeline. For a pipeline with k stages, a misprediction wastes approximately k cycles. This is why predictable branch patterns (like loop conditions) perform much better than random branches.

Cache Hierarchy

The CPU cache bridges the speed gap between the processor and main memory. Modern processors use a multi-level cache hierarchy:

L1 Cache: Smallest (32-64 KB), fastest (1-2 cycles), split into separate instruction cache (L1i) and data cache (L1d). Located on the CPU core itself.

L2 Cache: Larger (256 KB - 1 MB), slower (4-10 cycles), unified (holds both instructions and data). Usually per-core.

L3 Cache: Largest (4-64 MB), slowest cache level (20-40 cycles), shared across all cores.

Main Memory (RAM): Much larger (GBs), much slower (100-300 cycles).

When the CPU needs data, it checks L1 first, then L2, then L3, then RAM. Each level is larger but slower. The principle of locality makes this effective: programs tend to access the same data repeatedly (temporal locality) and access data near recently accessed data (spatial locality).

Instruction-Level Parallelism

Superscalar processors have multiple execution units and can issue several instructions per cycle. Out-of-order execution reorders instructions to fill execution units even when some instructions are stalled. The processor maintains the illusion of sequential execution by committing results in program order.

For example, if instruction A depends on a slow memory load but instructions B and C are independent, an out-of-order processor executes B and C while waiting for A's data. This keeps the execution units busy and improves throughput.

Complexity Analysis

Metric Without Pipelining With Pipelining
Instruction latency $k$ cycles $k$ cycles (unchanged)
Throughput (ideal) $\frac{1}{k}$ instructions/cycle $1$ instruction/cycle
Time for $n$ instructions $n \times k$ cycles $k + (n - 1)$ cycles
Speedup baseline $\frac{n \times k}{k + (n-1)} \approx k$ for large $n$

For a $k$-stage pipeline executing $n$ instructions, the ideal speedup approaches $k$ as $n$ grows large:

$\text{Speedup} = \frac{n \times k}{k + (n - 1)}$

As $n \to \infty$:

$\lim_{n \to \infty} \frac{n \times k}{k + n - 1} = k$

Branch misprediction penalty for a $k$-stage pipeline where the branch resolves at stage $b$:

$\text{Misprediction penalty} = b - 1 \text{ cycles}$

Effective CPI (Cycles Per Instruction) with branch mispredictions:

$\text{CPI}_{\text{eff}} = \text{CPI}_{\text{ideal}} + f_b \times m \times p$

Where $f_b$ is the fraction of branch instructions, $m$ is the misprediction rate, and $p$ is the misprediction penalty in cycles.

Cache performance impact on effective memory access time:

$T_{\text{eff}} = h_1 \times t_1 + (1 - h_1)(h_2 \times t_2 + (1 - h_2)(h_3 \times t_3 + (1 - h_3) \times t_{\text{mem}}))$

Where $h_i$ is the hit rate and $t_i$ is the access time for cache level $i$.

Implementation

ALGORITHM SimulatePipeline(instructions)
INPUT: instructions: list of instruction objects
OUTPUT: total cycles to execute all instructions
BEGIN
  k <- 5  // number of pipeline stages
  stages <- array of k empty slots
  cycle <- 0
  completed <- 0
  pc <- 0  // program counter (index into instructions)

  WHILE completed < LENGTH(instructions) DO
    cycle <- cycle + 1

    // Write-Back stage: commit result
    IF stages[4] is not empty THEN
      WRITE_REGISTER(stages[4].dest, stages[4].result)
      completed <- completed + 1
    END IF

    // Shift instructions through pipeline (from back to front)
    stages[4] <- stages[3]
    stages[3] <- stages[2]
    stages[2] <- stages[1]
    stages[1] <- stages[0]

    // Fetch stage: load next instruction
    IF pc < LENGTH(instructions) THEN
      stages[0] <- instructions[pc]
      pc <- pc + 1
    ELSE
      stages[0] <- empty
    END IF

    // Check for data hazard (simplified)
    IF stages[1] is not empty AND stages[2] is not empty THEN
      IF stages[1].source OVERLAPS stages[2].dest THEN
        IF FORWARDING_POSSIBLE(stages[2], stages[1]) THEN
          FORWARD(stages[2].result, stages[1])
        ELSE
          INSERT_BUBBLE(stages, 1)  // stall for one cycle
          pc <- pc - 1  // re-fetch
        END IF
      END IF
    END IF
  END WHILE

  RETURN cycle
END

ALGORITHM TwoBitBranchPredictor(branchHistory, branchAddress)
INPUT: branchHistory: map of address to 2-bit counter, branchAddress: address of branch
OUTPUT: prediction (TAKEN or NOT_TAKEN)
BEGIN
  IF branchAddress NOT IN branchHistory THEN
    branchHistory[branchAddress] <- 2  // weakly taken (states: 0,1,2,3)
  END IF

  counter <- branchHistory[branchAddress]

  IF counter >= 2 THEN
    RETURN TAKEN
  ELSE
    RETURN NOT_TAKEN
  END IF
END

ALGORITHM UpdateBranchPredictor(branchHistory, branchAddress, actualOutcome)
INPUT: branchHistory: map, branchAddress: address, actualOutcome: TAKEN or NOT_TAKEN
OUTPUT: updated branchHistory
BEGIN
  counter <- branchHistory[branchAddress]

  IF actualOutcome = TAKEN THEN
    counter <- MIN(counter + 1, 3)
  ELSE
    counter <- MAX(counter - 1, 0)
  END IF

  branchHistory[branchAddress] <- counter
END

ALGORITHM CacheLookup(address, cache, nextLevel)
INPUT: address: memory address, cache: cache structure, nextLevel: next cache level or RAM
OUTPUT: data at address
BEGIN
  setIndex <- (address / BLOCK_SIZE) MOD NUM_SETS
  tag <- address / (BLOCK_SIZE * NUM_SETS)
  set <- cache.sets[setIndex]

  // Search for tag in the set (associative lookup)
  FOR EACH line IN set.lines DO
    IF line.valid AND line.tag = tag THEN
      // Cache hit
      UPDATE_LRU(set, line)
      RETURN line.data[address MOD BLOCK_SIZE]
    END IF
  END FOR

  // Cache miss: fetch from next level
  block <- FETCH(nextLevel, ALIGN(address, BLOCK_SIZE))

  // Evict LRU line and insert new block
  victim <- GET_LRU_LINE(set)
  IF victim.dirty THEN
    WRITE_BACK(nextLevel, victim)
  END IF
  victim.tag <- tag
  victim.data <- block
  victim.valid <- true
  victim.dirty <- false
  UPDATE_LRU(set, victim)

  RETURN block[address MOD BLOCK_SIZE]
END

ALGORITHM OutOfOrderExecute(instructions, executionUnits)
INPUT: instructions: list of decoded instructions, executionUnits: available ALUs/units
OUTPUT: committed results in program order
BEGIN
  issueQueue <- empty priority queue
  reorderBuffer <- empty circular buffer
  commitPointer <- 0

  FOR EACH instr IN instructions DO
    entry <- CREATE_ROB_ENTRY(instr)
    APPEND entry TO reorderBuffer

    IF ALL_OPERANDS_READY(instr) THEN
      ADD instr TO issueQueue
    ELSE
      MARK_WAITING(instr)
    END IF
  END FOR

  WHILE reorderBuffer is not fully committed DO
    // Issue ready instructions to available units
    FOR EACH unit IN executionUnits DO
      IF unit is idle AND issueQueue is not empty THEN
        instr <- DEQUEUE(issueQueue)
        DISPATCH(instr, unit)
      END IF
    END FOR

    // Complete execution, broadcast results
    FOR EACH unit IN executionUnits DO
      IF unit.finished THEN
        result <- unit.result
        MARK_COMPLETE(reorderBuffer, result.instrID, result.value)
        // Wake up dependent instructions
        FOR EACH waiting IN reorderBuffer DO
          IF DEPENDS_ON(waiting, result.instrID) THEN
            FORWARD_OPERAND(waiting, result.value)
            IF ALL_OPERANDS_READY(waiting) THEN
              ADD waiting TO issueQueue
            END IF
          END IF
        END FOR
      END IF
    END FOR

    // Commit in program order
    WHILE reorderBuffer[commitPointer].complete DO
      COMMIT(reorderBuffer[commitPointer])
      commitPointer <- commitPointer + 1
    END WHILE
  END WHILE
END

Real-World Applications

  • Compiler optimizations: compilers reorder instructions, unroll loops, and schedule operations to maximize pipeline utilization and minimize stalls; understanding the pipeline helps developers write code that compilers can optimize effectively
  • Database query engines: high-performance databases like ClickHouse and DuckDB design their data processing to be cache-friendly, processing columns in blocks that fit in L1/L2 cache rather than row-at-a-time
  • Game engines: game developers structure entity data using data-oriented design (arrays of components rather than arrays of objects) to exploit spatial locality and keep the cache hot during frame updates
  • High-frequency trading: trading systems minimize branch mispredictions by using branchless code patterns and ensure critical data structures fit in L1 cache to achieve sub-microsecond latencies
  • Operating system schedulers: OS kernels are designed to minimize cache pollution during context switches, and modern CPUs tag cache lines with process IDs to avoid flushing caches on every switch
  • Cryptography: constant-time cryptographic implementations avoid data-dependent branches to prevent timing side-channel attacks that exploit branch prediction behavior
  • Machine learning inference: ML frameworks like TensorFlow and PyTorch optimize matrix operations to match cache line sizes and exploit SIMD (Single Instruction, Multiple Data) execution units in modern CPUs

Key Takeaways

  • Pipelining overlaps instruction execution across stages, achieving up to $k\times$ speedup for a $k$-stage pipeline by completing one instruction per cycle in the ideal case
  • Pipeline hazards (data, control, structural) reduce throughput; forwarding, branch prediction, and resource duplication are the primary solutions
  • Branch prediction accuracy exceeds 95% on modern CPUs; mispredictions flush the pipeline and waste cycles proportional to the pipeline depth
  • The cache hierarchy (L1, L2, L3) bridges the speed gap between the CPU and main memory; programs that exploit temporal and spatial locality run dramatically faster
  • Superscalar and out-of-order execution exploit instruction-level parallelism to issue multiple independent instructions per cycle, further increasing throughput
  • Software performance depends heavily on hardware interaction: cache-friendly access patterns, predictable branches, and exposing parallelism are key to writing fast code