Notes/GPU Memory Optimization: Shared Memory, Coalescing, Bank Conflicts, and Occupancy

GPU Memory Optimization: Shared Memory, Coalescing, Bank Conflicts, and Occupancy

How to write GPU code that makes efficient use of the memory hierarchy through coalesced global memory access, shared memory tiling, bank conflict avoidance, and occupancy tuning.

2025-08-23AI-Synthesized from Personal Notes

Computer SystemsGpusMemory OptimizationParallel Computing

Terminology

Term	Definition
Memory coalescing	The hardware's ability to combine multiple memory requests from threads in a warp into a single wide memory transaction when the threads access consecutive addresses
Shared memory	Fast on-chip memory (typically 48-164 KB per SM) shared among all threads in a block, with latency of roughly 20-30 cycles compared to 400-800 cycles for global memory
Bank conflict	A performance penalty that occurs when two or more threads in a warp access different addresses that map to the same shared memory bank, forcing the accesses to be serialized
Memory bank	Shared memory is divided into 32 equally-sized banks (one per warp lane); consecutive 4-byte words map to consecutive banks, allowing 32 simultaneous accesses if each targets a different bank
Occupancy	The ratio of active warps to the maximum warps an SM can support; higher occupancy provides more warps for the scheduler to switch between, improving latency hiding

Term	Definition
Memory transaction	A single read or write operation between the GPU and global memory, typically 32 or 128 bytes wide; fewer transactions for the same data means higher effective bandwidth
Stride-1 access	A memory access pattern where consecutive threads access consecutive memory addresses (stride of 1 element); this is the ideal pattern for coalescing
Register pressure	When a kernel uses many registers per thread, fewer threads can be active on the SM simultaneously because the register file is a fixed shared resource
Tiling	A technique where data is loaded from global memory into shared memory in chunks (tiles), processed locally, then the next tile is loaded; reduces global memory traffic by reusing data in fast shared memory
Arithmetic intensity	The ratio of compute operations to memory operations (FLOPS per byte transferred); kernels with high arithmetic intensity are compute-bound, while low-intensity kernels are memory-bound

What & Why

The single biggest bottleneck in GPU programming is memory. A modern GPU can perform trillions of floating-point operations per second, but it can only move data from global memory at hundreds of gigabytes per second. If your kernel spends most of its time waiting for data, all those CUDA cores sit idle.

Memory optimization is about closing this gap. The three main techniques are:

Coalescing: arranging global memory accesses so the hardware can combine them into fewer, wider transactions
Shared memory tiling: loading data into fast on-chip shared memory and reusing it multiple times before going back to global memory
Occupancy tuning: balancing resource usage (registers, shared memory) so enough warps are active to hide memory latency through warp switching

Getting memory access patterns right can easily yield 10-50x speedups over naive implementations. This is not micro-optimization; it is the difference between code that uses 5% of the GPU's capability and code that uses 80%.

How It Works

Memory Coalescing

When a warp of 32 threads issues a memory request, the hardware inspects the addresses and tries to combine them into as few transactions as possible. A 128-byte cache line can serve 32 consecutive 4-byte floats in a single transaction.

Coalesced access (ideal): Thread $i$ reads address $\text{base} + i \times 4$. All 32 threads hit the same 128-byte cache line. One transaction serves the entire warp.

Strided access (bad): Thread $i$ reads address $\text{base} + i \times \text{stride} \times 4$ where stride > 1. Threads access addresses spread across multiple cache lines, requiring multiple transactions. A stride of 32 is the worst case: every thread hits a different cache line, requiring 32 separate transactions.

Random access (worst): Each thread reads from an unrelated address. Every access may require its own transaction.

A common scenario where coalescing matters is matrix access. If a matrix is stored in row-major order and threads in a warp process consecutive columns of the same row, the access is coalesced. If threads process consecutive rows of the same column, the access is strided by the row width, which is terrible for coalescing. The fix is often to transpose the matrix or use shared memory tiling.

Shared Memory and Tiling

Shared memory is the programmer's tool for converting uncoalesced global memory patterns into efficient access. The strategy is called tiling:

A block of threads cooperatively loads a tile of data from global memory into shared memory (using coalesced reads)
Threads process the tile from shared memory (fast, no global memory traffic)
Repeat for the next tile

The classic example is tiled matrix multiplication. A naive matrix multiply has each thread compute one element of the output by reading an entire row of A and an entire column of B from global memory. With $N \times N$ matrices, each element of A and B is read $N$ times across all threads.

With tiling, threads in a block load small tiles (e.g., 16x16) of A and B into shared memory, compute partial products, then load the next tiles. Each element is read from global memory only once per tile load, then reused 16 times from shared memory. This reduces global memory traffic by a factor equal to the tile size.

Bank Conflicts

Shared memory is divided into 32 banks. Consecutive 4-byte words map to consecutive banks (word $i$ maps to bank $i \bmod 32$). When all 32 threads in a warp access different banks, all accesses happen simultaneously. When two or more threads access different addresses in the same bank, the accesses are serialized.

No conflict: Thread $i$ accesses word $i$ (stride-1 pattern). Each thread hits a different bank.

2-way conflict: Thread $i$ accesses word $2i$. Even-numbered banks get two accesses each, odd banks get none. Throughput is halved.

32-way conflict: All threads access the same bank (e.g., stride-32 pattern). Accesses are fully serialized, reducing throughput to 1/32.

Exception: If all threads access the exact same address, the hardware broadcasts the value in a single transaction (no conflict).

The fix for bank conflicts is usually padding. For example, when loading a 32-element row into shared memory, declaring the array as shared[32 + 1] shifts each row by one element, breaking the stride pattern that causes conflicts.

Occupancy and Resource Balancing

Occupancy is the ratio of active warps to the SM's maximum warp capacity. Higher occupancy means more warps available for the scheduler to switch between when a warp stalls on memory.

Three resources limit occupancy:

Registers: Each SM has a fixed register file (e.g., 65,536 registers). If a kernel uses 64 registers per thread and a block has 256 threads, one block needs 16,384 registers. The SM can fit 4 such blocks (65,536 / 16,384), giving 32 warps (4 blocks x 8 warps/block).

Shared memory: If a kernel uses 48 KB of shared memory per block and the SM has 96 KB, only 2 blocks can run concurrently.

Block size: The SM has a maximum block count (e.g., 16 or 32 blocks). Very small blocks may hit this limit before exhausting registers or shared memory.

The optimal occupancy is not always 100%. Sometimes using more registers per thread (lower occupancy) allows the compiler to keep more values in registers instead of spilling to slow local memory, resulting in faster execution despite fewer active warps.

Complexity Analysis

Access Pattern	Transactions per Warp	Effective Bandwidth
Coalesced (stride-1)	1	100% of peak
Stride-2	2	50% of peak
Stride-$s$ (general)	$\min(s, 32)$	$\frac{1}{\min(s, 32)}$ of peak
Random	Up to 32	$\frac{1}{32}$ of peak (worst case)

Tiling reduces global memory traffic for matrix multiplication. For $N \times N$ matrices with tile size $T$:

Without tiling, each thread reads $2N$ elements from global memory. Total global memory reads:

$\text{Reads}_{\text{naive}} = 2N^3$

With tiling, each tile load brings $T^2$ elements that are reused $T$ times. Total global memory reads:

$\text{Reads}_{\text{tiled}} = \frac{2N^3}{T}$

The reduction factor is $T$, so a 16x16 tile reduces global memory traffic by 16x.

Shared memory bank conflict penalty. For a $k$-way bank conflict (where $k$ threads hit the same bank):

$T_{\text{access}} = k \times T_{\text{single}}$

Occupancy calculation. Given $R$ registers per thread, block size $B$, shared memory per block $S$, and SM limits $R_{\text{max}}$, $S_{\text{max}}$, $W_{\text{max}}$:

$\text{Blocks}_{\text{reg}} = \left\lfloor \frac{R_{\text{max}}}{R \times B} \right\rfloor$

$\text{Blocks}_{\text{smem}} = \left\lfloor \frac{S_{\text{max}}}{S} \right\rfloor$

$\text{Active warps} = \min(\text{Blocks}_{\text{reg}}, \text{Blocks}_{\text{smem}}, \text{Blocks}_{\text{max}}) \times \frac{B}{32}$

$\text{Occupancy} = \frac{\text{Active warps}}{W_{\text{max}}}$

Implementation

ALGORITHM CoalescedMatrixTranspose(input, output, N)
INPUT: input: N x N matrix in row-major order
OUTPUT: output: transposed N x N matrix
NOTE: uses shared memory to convert column reads into coalesced writes

BEGIN
  SHARED tile[TILE_DIM][TILE_DIM + 1]  // +1 padding to avoid bank conflicts

  // Compute input coordinates
  xIn <- blockIdx.x * TILE_DIM + threadIdx.x
  yIn <- blockIdx.y * TILE_DIM + threadIdx.y

  // Load tile from global memory (coalesced read: consecutive threads read consecutive columns)
  IF xIn < N AND yIn < N THEN
    tile[threadIdx.y][threadIdx.x] <- input[yIn * N + xIn]
  END IF

  BARRIER_SYNC()

  // Compute output coordinates (swapped block indices)
  xOut <- blockIdx.y * TILE_DIM + threadIdx.x
  yOut <- blockIdx.x * TILE_DIM + threadIdx.y

  // Write transposed tile to global memory (coalesced write: consecutive threads write consecutive columns)
  IF xOut < N AND yOut < N THEN
    output[yOut * N + xOut] <- tile[threadIdx.x][threadIdx.y]
  END IF
END


ALGORITHM TiledMatMulOptimized(A, B, C, N)
INPUT: A, B: N x N matrices in global memory
OUTPUT: C: result matrix
NOTE: uses shared memory tiling with bank conflict avoidance

BEGIN
  SHARED tileA[TILE_SIZE][TILE_SIZE + 1]  // +1 padding avoids bank conflicts
  SHARED tileB[TILE_SIZE][TILE_SIZE + 1]

  row <- blockIdx.y * TILE_SIZE + threadIdx.y
  col <- blockIdx.x * TILE_SIZE + threadIdx.x
  sum <- 0.0

  numTiles <- CEIL(N / TILE_SIZE)

  FOR t <- 0 TO numTiles - 1 DO
    // Load tile of A: coalesced (consecutive threads read consecutive columns)
    aCol <- t * TILE_SIZE + threadIdx.x
    IF row < N AND aCol < N THEN
      tileA[threadIdx.y][threadIdx.x] <- A[row * N + aCol]
    ELSE
      tileA[threadIdx.y][threadIdx.x] <- 0.0
    END IF

    // Load tile of B: coalesced
    bRow <- t * TILE_SIZE + threadIdx.y
    IF bRow < N AND col < N THEN
      tileB[threadIdx.y][threadIdx.x] <- B[bRow * N + col]
    ELSE
      tileB[threadIdx.y][threadIdx.x] <- 0.0
    END IF

    BARRIER_SYNC()

    // Compute partial dot product from shared memory (fast, no global traffic)
    FOR k <- 0 TO TILE_SIZE - 1 DO
      sum <- sum + tileA[threadIdx.y][k] * tileB[k][threadIdx.x]
    END FOR

    BARRIER_SYNC()
  END FOR

  IF row < N AND col < N THEN
    C[row * N + col] <- sum
  END IF
END


ALGORITHM AnalyzeOccupancy(registersPerThread, blockSize, sharedMemPerBlock,
                            smMaxRegisters, smMaxSharedMem, smMaxWarps, smMaxBlocks)
INPUT: kernel resource usage and SM hardware limits
OUTPUT: occupancy percentage and the limiting resource

BEGIN
  warpsPerBlock <- CEIL(blockSize / 32)

  // Register limit
  regsPerBlock <- registersPerThread * blockSize
  IF regsPerBlock > 0 THEN
    blocksByRegs <- FLOOR(smMaxRegisters / regsPerBlock)
  ELSE
    blocksByRegs <- smMaxBlocks
  END IF

  // Shared memory limit
  IF sharedMemPerBlock > 0 THEN
    blocksBySmem <- FLOOR(smMaxSharedMem / sharedMemPerBlock)
  ELSE
    blocksBySmem <- smMaxBlocks
  END IF

  // Block count limit
  blocksByLimit <- smMaxBlocks

  // Find the tightest constraint
  actualBlocks <- MIN(blocksByRegs, blocksBySmem, blocksByLimit)
  activeWarps <- actualBlocks * warpsPerBlock
  occupancy <- activeWarps / smMaxWarps

  // Identify limiting factor
  IF actualBlocks = blocksByRegs THEN
    limiter <- "registers"
  ELSE IF actualBlocks = blocksBySmem THEN
    limiter <- "shared memory"
  ELSE
    limiter <- "block limit"
  END IF

  RETURN (MIN(occupancy, 1.0), limiter)
END

Real-World Applications

Deep learning frameworks: cuBLAS and cuDNN use heavily optimized tiled matrix multiplication kernels with shared memory, achieving over 90% of peak GPU throughput for the GEMM operations that dominate neural network training
Image and video processing: convolution filters load image tiles into shared memory so each pixel's neighborhood is accessed from fast on-chip storage rather than global memory, critical for real-time 4K video processing
Molecular dynamics: particle simulations use shared memory to cache neighbor lists and force calculations within spatial tiles, reducing redundant global memory reads when computing pairwise interactions
Database query engines: GPU-accelerated databases like BlazingSQL and RAPIDS cuDF optimize memory access patterns for columnar data scans, joins, and aggregations to maximize memory bandwidth utilization
Cryptographic hashing: mining and password cracking kernels carefully manage register usage and occupancy to maximize the number of concurrent hash computations per SM
Computational fluid dynamics: stencil computations on 3D grids use shared memory tiling to cache the halo regions needed by each block, reducing global memory reads from 7 per cell (3D stencil) to roughly 1 per cell

Key Takeaways

Memory coalescing is the most important GPU optimization: consecutive threads must access consecutive memory addresses so the hardware can combine requests into single wide transactions; strided or random access can reduce effective bandwidth by up to 32x
Shared memory tiling loads data from slow global memory once, then reuses it many times from fast on-chip storage; for matrix multiplication, tiling reduces global memory traffic by a factor equal to the tile dimension
Bank conflicts in shared memory occur when multiple threads access different addresses in the same bank; padding arrays by one element (e.g., declaring shared[N][N+1] instead of shared[N][N]) is the standard fix
Occupancy is determined by the tightest resource constraint among registers, shared memory, and block count; higher occupancy helps hide memory latency, but maximum occupancy is not always optimal if it forces register spilling
The roofline model helps identify whether a kernel is compute-bound or memory-bound: kernels with low arithmetic intensity benefit most from memory optimizations, while high-intensity kernels benefit from maximizing compute throughput

2025-07-20

Cluster Architecture: Nodes, Interconnects, and the Anatomy of a Supercomputing Cluster

Understanding how compute clusters are built from nodes, interconnects, and memory architectures, from Beowulf clusters to modern blade servers.

2025-07-21

Job Scheduling and Resource Management: SLURM, PBS, and the Art of Sharing a Supercomputer

How job schedulers like SLURM and PBS allocate compute resources across hundreds of users, balancing fairness, throughput, and utilization on shared HPC clusters.

2025-07-22

MPI and Distributed Computing: Message Passing Across Thousands of Nodes

How the Message Passing Interface enables parallel programs to communicate across distributed-memory clusters using point-to-point and collective operations.