Notes/GPU Architecture: Streaming Multiprocessors, Warps, and the SIMT Model

GPU Architecture: Streaming Multiprocessors, Warps, and the SIMT Model

How GPUs achieve massive parallelism through thousands of lightweight cores organized into streaming multiprocessors, executing warps of threads in lockstep under the SIMT execution model.

2025-08-20AI-Synthesized from Personal Notes

Computer SystemsGpusArchitectureParallel Computing

Terminology

Term	Definition
GPU (Graphics Processing Unit)	A processor designed for massively parallel computation, containing thousands of small cores optimized for executing the same operation across many data elements simultaneously
Streaming Multiprocessor (SM)	The fundamental processing block on a GPU, containing a set of CUDA cores, warp schedulers, register files, and shared memory that together execute groups of threads
CUDA core	A single arithmetic execution unit within an SM capable of performing one floating-point or integer operation per clock cycle
Warp	A group of 32 threads that execute the same instruction simultaneously on an SM; the basic scheduling unit of GPU execution
SIMT (Single Instruction, Multiple Threads)	The GPU execution model where a single instruction is broadcast to all threads in a warp, each operating on different data; similar to SIMD but with per-thread branching capability

Term	Definition
Warp divergence	A performance penalty that occurs when threads within the same warp take different execution paths at a branch, forcing the hardware to serialize the divergent paths
Occupancy	The ratio of active warps on an SM to the maximum number of warps the SM can support; higher occupancy helps hide memory latency
Global memory	The largest but slowest memory on the GPU (several GB of DRAM), accessible by all threads across all SMs, with latency of 400-800 cycles
Shared memory	A fast, programmer-managed on-chip memory (typically 48-164 KB per SM) shared among all threads in a thread block, with latency comparable to L1 cache
Latency hiding	The GPU technique of switching to another warp's instructions while the current warp waits for a long-latency operation (like a memory fetch) to complete, keeping execution units busy

What & Why

A CPU is designed to minimize the latency of a single thread of execution. It dedicates most of its transistor budget to large caches, branch predictors, and out-of-order execution logic so that one thread runs as fast as possible. A GPU takes the opposite approach: it maximizes the throughput of thousands of threads running in parallel, trading single-thread performance for massive aggregate bandwidth.

This design makes GPUs ideal for workloads where the same operation must be applied to millions of data elements independently. Matrix multiplication, image processing, physics simulations, and neural network training all fit this pattern. A modern GPU can have over 10,000 CUDA cores organized into dozens of streaming multiprocessors, achieving teraflops of compute throughput.

Understanding GPU architecture matters because writing efficient GPU code requires thinking about hardware constraints that do not exist on CPUs. Warp divergence, memory coalescing, shared memory bank conflicts, and occupancy all directly affect performance. Code that ignores these constraints can run 10-100x slower than code that respects them.

How It Works

The GPU Hierarchy: From Chip to Thread

A GPU is organized as a hierarchy of processing units. At the top level, the chip contains an array of streaming multiprocessors (SMs). Each SM is an independent processor with its own cores, registers, shared memory, and warp schedulers. Within each SM, threads are grouped into warps of 32 threads that execute in lockstep.

Streaming Multiprocessors (SMs)

Each SM is a self-contained processing engine. A typical modern SM contains:

64-128 CUDA cores (FP32 units) for single-precision floating-point and integer arithmetic
Warp schedulers (2-4 per SM) that select which warps to execute each cycle
A large register file (typically 64K 32-bit registers) partitioned among active threads
Shared memory (48-164 KB) that acts as a programmer-controlled scratchpad
L1 cache and texture cache for automatic caching of global memory accesses
Special function units (SFUs) for transcendental operations like sine, cosine, and reciprocal square root
Tensor cores (on newer architectures) for accelerated matrix multiply-accumulate operations

The SM can have many more threads assigned to it than it has cores. While some warps wait for memory, others execute. This is the core of the GPU's latency-hiding strategy.

Warps and SIMT Execution

When a kernel launches, the GPU assigns thread blocks to SMs. Each thread block is divided into warps of 32 threads. All 32 threads in a warp execute the same instruction at the same time, but each thread operates on different data (its own registers, its own portion of the input array).

This is the SIMT (Single Instruction, Multiple Threads) model. It resembles SIMD (Single Instruction, Multiple Data) on CPUs, but with an important difference: each thread has its own program counter and can branch independently. When threads in a warp diverge at a conditional branch, the hardware serializes the divergent paths. Threads not on the active path are masked (disabled), and the warp executes each path in turn. Once the paths reconverge, all threads resume executing together.

Warp divergence is a major performance concern. If half the threads in a warp take one branch and half take the other, the warp effectively runs at half throughput because it must execute both paths sequentially.

GPU Memory Hierarchy

The GPU memory system is organized in layers, from fast-but-small to slow-but-large:

Registers: Each thread has private registers. The register file is the fastest storage on the GPU (zero extra latency). A typical SM has 65,536 32-bit registers shared among all active threads.

Shared memory: On-chip memory shared by all threads in a thread block. Access latency is roughly 20-30 cycles, comparable to L1 cache. Programmers explicitly load data into shared memory and synchronize threads to coordinate access.

L1/L2 cache: L1 is per-SM and caches global memory accesses automatically. L2 is shared across all SMs. These caches reduce the effective latency of global memory accesses for data with spatial or temporal locality.

Global memory: Off-chip DRAM (HBM2 or GDDR6), several gigabytes in size. Access latency is 400-800 cycles, but bandwidth is very high (hundreds of GB/s to over 1 TB/s). Efficient access requires coalesced memory patterns where consecutive threads access consecutive addresses.

Constant memory: A read-only region of global memory with a dedicated cache. Efficient when all threads in a warp read the same address.

Texture memory: Read-only memory with a specialized cache optimized for 2D spatial locality, used primarily in graphics and image processing workloads.

Latency Hiding Through Warp Switching

A CPU hides memory latency with caches and out-of-order execution. A GPU hides latency by switching between warps. When a warp issues a memory request and stalls, the warp scheduler immediately switches to another warp that has its data ready. This context switch is essentially free because each warp's state (registers, program counter) is always resident on the SM.

For this to work, the SM needs enough active warps to keep the execution units busy while others wait. This is why occupancy matters: if too few warps are active (because each thread uses too many registers or too much shared memory), the SM cannot hide memory latency effectively.

Complexity Analysis

Metric	CPU (Multicore)	GPU
Core count	4-64 complex cores	Thousands of simple cores
Single-thread latency	Low (optimized)	High (not optimized)
Aggregate throughput	Moderate	Very high
Memory bandwidth	50-100 GB/s	500-3000 GB/s
Latency hiding	Caches, OoO execution	Warp switching (thousands of threads)

Peak theoretical throughput for a GPU with $N_{SM}$ streaming multiprocessors, each containing $C$ CUDA cores running at clock frequency $f$:

$\text{FLOPS}_{\text{peak}} = N_{SM} \times C \times f \times 2$

The factor of 2 accounts for fused multiply-add (FMA) operations that perform a multiply and an add in a single cycle.

Effective throughput with warp divergence, where $p$ is the fraction of threads taking the active branch:

$\text{Throughput}_{\text{eff}} = p \times \text{Throughput}_{\text{peak}}$

In the worst case, where exactly one thread in a 32-thread warp diverges:

$\text{Throughput}_{\text{worst}} = \frac{1}{32} \times \text{Throughput}_{\text{peak}}$

Occupancy is defined as:

$\text{Occupancy} = \frac{\text{Active warps per SM}}{\text{Maximum warps per SM}}$

The maximum warps per SM is constrained by three resources. If each thread uses $R$ registers and the SM has $R_{\text{total}}$ registers, the register-limited warp count is:

$W_{\text{reg}} = \left\lfloor \frac{R_{\text{total}}}{R \times 32} \right\rfloor$

Similarly, if each block uses $S$ bytes of shared memory and the SM has $S_{\text{total}}$ bytes:

$W_{\text{smem}} = \left\lfloor \frac{S_{\text{total}}}{S} \right\rfloor \times \frac{\text{warps}}{\text{block}}$

The actual occupancy is the minimum of these resource limits and the hardware maximum.

Implementation

ALGORITHM SimulateWarpScheduler(warps, numCycles)
INPUT: warps: list of warp objects (each with instruction stream and state),
       numCycles: number of clock cycles to simulate
OUTPUT: total instructions executed across all warps

BEGIN
  totalExecuted <- 0

  FOR cycle <- 1 TO numCycles DO
    // Find all warps that are ready (not stalled on memory)
    readyWarps <- FILTER warps WHERE warp.state = READY

    IF readyWarps is empty THEN
      CONTINUE  // all warps stalled, cycle is wasted
    END IF

    // Select the highest-priority ready warp (round-robin or oldest-first)
    selectedWarp <- SELECT_HIGHEST_PRIORITY(readyWarps)

    // Fetch the next instruction for this warp
    instruction <- selectedWarp.instructionStream[selectedWarp.pc]

    IF instruction.type = MEMORY_LOAD THEN
      // Issue memory request, mark warp as stalled
      ISSUE_MEMORY_REQUEST(selectedWarp, instruction.address)
      selectedWarp.state <- STALLED
      selectedWarp.stallCyclesRemaining <- GLOBAL_MEMORY_LATENCY
    ELSE IF instruction.type = BRANCH THEN
      // Evaluate branch for all 32 threads
      activeMask <- EVALUATE_BRANCH(selectedWarp.threads, instruction.condition)

      IF ALL_SAME(activeMask) THEN
        // No divergence: all threads go the same way
        UPDATE_PC(selectedWarp, instruction, activeMask[0])
      ELSE
        // Divergence: push reconvergence point, take first path
        PUSH_DIVERGENCE_STACK(selectedWarp, instruction.reconvergePC)
        selectedWarp.activeMask <- activeMask
        selectedWarp.pc <- instruction.takenTarget
      END IF
    ELSE
      // Arithmetic or other instruction: execute on all active threads
      FOR EACH thread IN selectedWarp.threads DO
        IF thread.active THEN
          EXECUTE(thread, instruction)
        END IF
      END FOR
    END IF

    selectedWarp.pc <- selectedWarp.pc + 1
    totalExecuted <- totalExecuted + 1

    // Check stalled warps for completion
    FOR EACH warp IN warps DO
      IF warp.state = STALLED THEN
        warp.stallCyclesRemaining <- warp.stallCyclesRemaining - 1
        IF warp.stallCyclesRemaining = 0 THEN
          warp.state <- READY
        END IF
      END IF
    END FOR
  END FOR

  RETURN totalExecuted
END

ALGORITHM ComputeOccupancy(threadsPerBlock, registersPerThread, sharedMemPerBlock, maxThreadsPerSM, maxRegistersPerSM, maxSharedMemPerSM, maxBlocksPerSM) INPUT: resource usage per block and SM hardware limits OUTPUT: occupancy as a fraction between 0 and 1

BEGIN warpsPerBlock <- CEIL(threadsPerBlock / 32) maxWarpsPerSM <- maxThreadsPerSM / 32

// Constraint 1: block limit blocksByBlockLimit <- maxBlocksPerSM

// Constraint 2: register limit regsNeededPerBlock <- registersPerThread * threadsPerBlock IF regsNeededPerBlock > 0 THEN blocksByRegs <- FLOOR(maxRegistersPerSM / regsNeededPerBlock) ELSE blocksByRegs <- maxBlocksPerSM END IF

// Constraint 3: shared memory limit IF sharedMemPerBlock > 0 THEN blocksBySmem <- FLOOR(maxSharedMemPerSM / sharedMemPerBlock) ELSE blocksBySmem <- maxBlocksPerSM END IF

// Actual blocks limited by the tightest constraint actualBlocks <- MIN(blocksByBlockLimit, blocksByRegs, blocksBySmem) activeWarps <- actualBlocks * warpsPerBlock

// Occupancy is active warps divided by hardware max occupancy <- activeWarps / maxWarpsPerSM RETURN MIN(occupancy, 1.0) END


## Real-World Applications

<ul>
<li><strong>Deep learning training</strong>: neural network training involves billions of matrix multiply-accumulate operations per batch; GPUs with thousands of cores and tensor units can process these in parallel, reducing training time from weeks on CPUs to hours on GPUs</li>
<li><strong>Scientific simulation</strong>: molecular dynamics, fluid dynamics (CFD), and climate modeling all involve applying the same physics equations to millions of particles or grid cells, a pattern that maps naturally to the GPU's SIMT execution model</li>
<li><strong>Real-time ray tracing</strong>: modern GPUs include dedicated RT cores that accelerate ray-triangle intersection tests, enabling photorealistic lighting in games and film production at interactive frame rates</li>
<li><strong>Cryptocurrency mining</strong>: proof-of-work algorithms require computing billions of hash operations; the GPU's massive parallelism makes it far more efficient than CPUs for this brute-force search</li>
<li><strong>Video encoding and transcoding</strong>: encoding a video frame involves applying transforms, quantization, and motion estimation to thousands of pixel blocks independently, which maps well to GPU parallel execution</li>
<li><strong>Large language model inference</strong>: serving transformer models requires large matrix-vector multiplications for each token generated; GPUs with high memory bandwidth (HBM2e/HBM3) can feed data to thousands of cores fast enough to achieve low-latency inference</li>
</ul>

## Key Takeaways

<ul>
<li>GPUs achieve massive parallelism by packing thousands of simple cores into streaming multiprocessors (SMs), trading single-thread performance for aggregate throughput across many threads</li>
<li>The warp is the fundamental execution unit: 32 threads executing the same instruction in lockstep under the SIMT model; warp divergence at branches serializes execution paths and reduces throughput</li>
<li>The GPU memory hierarchy spans registers (fastest), shared memory (programmer-managed scratchpad), L1/L2 caches, and global memory (high bandwidth but high latency DRAM)</li>
<li>Latency hiding is the GPU's primary strategy for dealing with slow memory: while one warp waits for data, the scheduler switches to another ready warp at near-zero cost</li>
<li>Occupancy, the ratio of active warps to maximum warps per SM, determines how effectively the GPU can hide memory latency; it is constrained by register usage, shared memory usage, and block size</li>
</ul>

2025-08-21

CUDA Basics: Kernels, Threads, Blocks, and Grids

How the CUDA programming model maps parallel work onto GPU hardware through a hierarchy of threads, blocks, and grids, and how kernels launch thousands of threads to execute in parallel.

2025-08-22

Parallel Computing Patterns: Map, Reduce, Scan, and Gather/Scatter

The fundamental parallel algorithmic patterns that form the building blocks of GPU programming, from embarrassingly parallel map operations to the surprisingly tricky parallel prefix sum.

2025-08-23

GPU Memory Optimization: Shared Memory, Coalescing, Bank Conflicts, and Occupancy

How to write GPU code that makes efficient use of the memory hierarchy through coalesced global memory access, shared memory tiling, bank conflict avoidance, and occupancy tuning.