GPU Architecture: Streaming Multiprocessors, Warps, and the SIMT Model
How GPUs achieve massive parallelism through thousands of lightweight cores organized into streaming multiprocessors, executing warps of threads in lockstep under the SIMT execution model.
Terminology
| Term | Definition |
|---|---|
| GPU (Graphics Processing Unit) | A processor designed for massively parallel computation, containing thousands of small cores optimized for executing the same operation across many data elements simultaneously |
| Streaming Multiprocessor (SM) | The fundamental processing block on a GPU, containing a set of CUDA cores, warp schedulers, register files, and shared memory that together execute groups of threads |
| CUDA core | A single arithmetic execution unit within an SM capable of performing one floating-point or integer operation per clock cycle |
| Warp | A group of 32 threads that execute the same instruction simultaneously on an SM; the basic scheduling unit of GPU execution |
| SIMT (Single Instruction, Multiple Threads) | The GPU execution model where a single instruction is broadcast to all threads in a warp, each operating on different data; similar to SIMD but with per-thread branching capability |
| Term | Definition |
|---|---|
| Warp divergence | A performance penalty that occurs when threads within the same warp take different execution paths at a branch, forcing the hardware to serialize the divergent paths |
| Occupancy | The ratio of active warps on an SM to the maximum number of warps the SM can support; higher occupancy helps hide memory latency |
| Global memory | The largest but slowest memory on the GPU (several GB of DRAM), accessible by all threads across all SMs, with latency of 400-800 cycles |
| Shared memory | A fast, programmer-managed on-chip memory (typically 48-164 KB per SM) shared among all threads in a thread block, with latency comparable to L1 cache |
| Latency hiding | The GPU technique of switching to another warp's instructions while the current warp waits for a long-latency operation (like a memory fetch) to complete, keeping execution units busy |
What & Why
A CPU is designed to minimize the latency of a single thread of execution. It dedicates most of its transistor budget to large caches, branch predictors, and out-of-order execution logic so that one thread runs as fast as possible. A GPU takes the opposite approach: it maximizes the throughput of thousands of threads running in parallel, trading single-thread performance for massive aggregate bandwidth.
This design makes GPUs ideal for workloads where the same operation must be applied to millions of data elements independently. Matrix multiplication, image processing, physics simulations, and neural network training all fit this pattern. A modern GPU can have over 10,000 CUDA cores organized into dozens of streaming multiprocessors, achieving teraflops of compute throughput.
Understanding GPU architecture matters because writing efficient GPU code requires thinking about hardware constraints that do not exist on CPUs. Warp divergence, memory coalescing, shared memory bank conflicts, and occupancy all directly affect performance. Code that ignores these constraints can run 10-100x slower than code that respects them.
How It Works
The GPU Hierarchy: From Chip to Thread
A GPU is organized as a hierarchy of processing units. At the top level, the chip contains an array of streaming multiprocessors (SMs). Each SM is an independent processor with its own cores, registers, shared memory, and warp schedulers. Within each SM, threads are grouped into warps of 32 threads that execute in lockstep.
Streaming Multiprocessors (SMs)
Each SM is a self-contained processing engine. A typical modern SM contains:
- 64-128 CUDA cores (FP32 units) for single-precision floating-point and integer arithmetic
- Warp schedulers (2-4 per SM) that select which warps to execute each cycle
- A large register file (typically 64K 32-bit registers) partitioned among active threads
- Shared memory (48-164 KB) that acts as a programmer-controlled scratchpad
- L1 cache and texture cache for automatic caching of global memory accesses
- Special function units (SFUs) for transcendental operations like sine, cosine, and reciprocal square root
- Tensor cores (on newer architectures) for accelerated matrix multiply-accumulate operations
The SM can have many more threads assigned to it than it has cores. While some warps wait for memory, others execute. This is the core of the GPU's latency-hiding strategy.
Warps and SIMT Execution
When a kernel launches, the GPU assigns thread blocks to SMs. Each thread block is divided into warps of 32 threads. All 32 threads in a warp execute the same instruction at the same time, but each thread operates on different data (its own registers, its own portion of the input array).
This is the SIMT (Single Instruction, Multiple Threads) model. It resembles SIMD (Single Instruction, Multiple Data) on CPUs, but with an important difference: each thread has its own program counter and can branch independently. When threads in a warp diverge at a conditional branch, the hardware serializes the divergent paths. Threads not on the active path are masked (disabled), and the warp executes each path in turn. Once the paths reconverge, all threads resume executing together.
Warp divergence is a major performance concern. If half the threads in a warp take one branch and half take the other, the warp effectively runs at half throughput because it must execute both paths sequentially.
GPU Memory Hierarchy
The GPU memory system is organized in layers, from fast-but-small to slow-but-large:
Registers: Each thread has private registers. The register file is the fastest storage on the GPU (zero extra latency). A typical SM has 65,536 32-bit registers shared among all active threads.
Shared memory: On-chip memory shared by all threads in a thread block. Access latency is roughly 20-30 cycles, comparable to L1 cache. Programmers explicitly load data into shared memory and synchronize threads to coordinate access.
L1/L2 cache: L1 is per-SM and caches global memory accesses automatically. L2 is shared across all SMs. These caches reduce the effective latency of global memory accesses for data with spatial or temporal locality.
Global memory: Off-chip DRAM (HBM2 or GDDR6), several gigabytes in size. Access latency is 400-800 cycles, but bandwidth is very high (hundreds of GB/s to over 1 TB/s). Efficient access requires coalesced memory patterns where consecutive threads access consecutive addresses.
Constant memory: A read-only region of global memory with a dedicated cache. Efficient when all threads in a warp read the same address.
Texture memory: Read-only memory with a specialized cache optimized for 2D spatial locality, used primarily in graphics and image processing workloads.
Latency Hiding Through Warp Switching
A CPU hides memory latency with caches and out-of-order execution. A GPU hides latency by switching between warps. When a warp issues a memory request and stalls, the warp scheduler immediately switches to another warp that has its data ready. This context switch is essentially free because each warp's state (registers, program counter) is always resident on the SM.
For this to work, the SM needs enough active warps to keep the execution units busy while others wait. This is why occupancy matters: if too few warps are active (because each thread uses too many registers or too much shared memory), the SM cannot hide memory latency effectively.
Complexity Analysis
| Metric | CPU (Multicore) | GPU |
|---|---|---|
| Core count | 4-64 complex cores | Thousands of simple cores |
| Single-thread latency | Low (optimized) | High (not optimized) |
| Aggregate throughput | Moderate | Very high |
| Memory bandwidth | 50-100 GB/s | 500-3000 GB/s |
| Latency hiding | Caches, OoO execution | Warp switching (thousands of threads) |
Peak theoretical throughput for a GPU with N_{SM} streaming multiprocessors, each containing C CUDA cores running at clock frequency f:
The factor of 2 accounts for fused multiply-add (FMA) operations that perform a multiply and an add in a single cycle.
Effective throughput with warp divergence, where p is the fraction of threads taking the active branch:
In the worst case, where exactly one thread in a 32-thread warp diverges:
Occupancy is defined as:
The maximum warps per SM is constrained by three resources. If each thread uses R registers and the SM has R_{\text{total}} registers, the register-limited warp count is:
Similarly, if each block uses S bytes of shared memory and the SM has S_{\text{total}} bytes:
The actual occupancy is the minimum of these resource limits and the hardware maximum.
Implementation
ALGORITHM SimulateWarpScheduler(warps, numCycles)
INPUT: warps: list of warp objects (each with instruction stream and state),
numCycles: number of clock cycles to simulate
OUTPUT: total instructions executed across all warps
BEGIN
totalExecuted <- 0
FOR cycle <- 1 TO numCycles DO
// Find all warps that are ready (not stalled on memory)
readyWarps <- FILTER warps WHERE warp.state = READY
IF readyWarps is empty THEN
CONTINUE // all warps stalled, cycle is wasted
END IF
// Select the highest-priority ready warp (round-robin or oldest-first)
selectedWarp <- SELECT_HIGHEST_PRIORITY(readyWarps)
// Fetch the next instruction for this warp
instruction <- selectedWarp.instructionStream[selectedWarp.pc]
IF instruction.type = MEMORY_LOAD THEN
// Issue memory request, mark warp as stalled
ISSUE_MEMORY_REQUEST(selectedWarp, instruction.address)
selectedWarp.state <- STALLED
selectedWarp.stallCyclesRemaining <- GLOBAL_MEMORY_LATENCY
ELSE IF instruction.type = BRANCH THEN
// Evaluate branch for all 32 threads
activeMask <- EVALUATE_BRANCH(selectedWarp.threads, instruction.condition)
IF ALL_SAME(activeMask) THEN
// No divergence: all threads go the same way
UPDATE_PC(selectedWarp, instruction, activeMask[0])
ELSE
// Divergence: push reconvergence point, take first path
PUSH_DIVERGENCE_STACK(selectedWarp, instruction.reconvergePC)
selectedWarp.activeMask <- activeMask
selectedWarp.pc <- instruction.takenTarget
END IF
ELSE
// Arithmetic or other instruction: execute on all active threads
FOR EACH thread IN selectedWarp.threads DO
IF thread.active THEN
EXECUTE(thread, instruction)
END IF
END FOR
END IF
selectedWarp.pc <- selectedWarp.pc + 1
totalExecuted <- totalExecuted + 1
// Check stalled warps for completion
FOR EACH warp IN warps DO
IF warp.state = STALLED THEN
warp.stallCyclesRemaining <- warp.stallCyclesRemaining - 1
IF warp.stallCyclesRemaining = 0 THEN
warp.state <- READY
END IF
END IF
END FOR
END FOR
RETURN totalExecuted
END
ALGORITHM ComputeOccupancy(threadsPerBlock, registersPerThread, sharedMemPerBlock, maxThreadsPerSM, maxRegistersPerSM, maxSharedMemPerSM, maxBlocksPerSM) INPUT: resource usage per block and SM hardware limits OUTPUT: occupancy as a fraction between 0 and 1
BEGIN warpsPerBlock <- CEIL(threadsPerBlock / 32) maxWarpsPerSM <- maxThreadsPerSM / 32
// Constraint 1: block limit blocksByBlockLimit <- maxBlocksPerSM
// Constraint 2: register limit regsNeededPerBlock <- registersPerThread * threadsPerBlock IF regsNeededPerBlock > 0 THEN blocksByRegs <- FLOOR(maxRegistersPerSM / regsNeededPerBlock) ELSE blocksByRegs <- maxBlocksPerSM END IF
// Constraint 3: shared memory limit IF sharedMemPerBlock > 0 THEN blocksBySmem <- FLOOR(maxSharedMemPerSM / sharedMemPerBlock) ELSE blocksBySmem <- maxBlocksPerSM END IF
// Actual blocks limited by the tightest constraint actualBlocks <- MIN(blocksByBlockLimit, blocksByRegs, blocksBySmem) activeWarps <- actualBlocks * warpsPerBlock
// Occupancy is active warps divided by hardware max occupancy <- activeWarps / maxWarpsPerSM RETURN MIN(occupancy, 1.0) END
## Real-World Applications
<ul>
<li><strong>Deep learning training</strong>: neural network training involves billions of matrix multiply-accumulate operations per batch; GPUs with thousands of cores and tensor units can process these in parallel, reducing training time from weeks on CPUs to hours on GPUs</li>
<li><strong>Scientific simulation</strong>: molecular dynamics, fluid dynamics (CFD), and climate modeling all involve applying the same physics equations to millions of particles or grid cells, a pattern that maps naturally to the GPU's SIMT execution model</li>
<li><strong>Real-time ray tracing</strong>: modern GPUs include dedicated RT cores that accelerate ray-triangle intersection tests, enabling photorealistic lighting in games and film production at interactive frame rates</li>
<li><strong>Cryptocurrency mining</strong>: proof-of-work algorithms require computing billions of hash operations; the GPU's massive parallelism makes it far more efficient than CPUs for this brute-force search</li>
<li><strong>Video encoding and transcoding</strong>: encoding a video frame involves applying transforms, quantization, and motion estimation to thousands of pixel blocks independently, which maps well to GPU parallel execution</li>
<li><strong>Large language model inference</strong>: serving transformer models requires large matrix-vector multiplications for each token generated; GPUs with high memory bandwidth (HBM2e/HBM3) can feed data to thousands of cores fast enough to achieve low-latency inference</li>
</ul>
## Key Takeaways
<ul>
<li>GPUs achieve massive parallelism by packing thousands of simple cores into streaming multiprocessors (SMs), trading single-thread performance for aggregate throughput across many threads</li>
<li>The warp is the fundamental execution unit: 32 threads executing the same instruction in lockstep under the SIMT model; warp divergence at branches serializes execution paths and reduces throughput</li>
<li>The GPU memory hierarchy spans registers (fastest), shared memory (programmer-managed scratchpad), L1/L2 caches, and global memory (high bandwidth but high latency DRAM)</li>
<li>Latency hiding is the GPU's primary strategy for dealing with slow memory: while one warp waits for data, the scheduler switches to another ready warp at near-zero cost</li>
<li>Occupancy, the ratio of active warps to maximum warps per SM, determines how effectively the GPU can hide memory latency; it is constrained by register usage, shared memory usage, and block size</li>
</ul>