Notes/CUDA Basics: Kernels, Threads, Blocks, and Grids

CUDA Basics: Kernels, Threads, Blocks, and Grids

How the CUDA programming model maps parallel work onto GPU hardware through a hierarchy of threads, blocks, and grids, and how kernels launch thousands of threads to execute in parallel.

2025-08-21AI-Synthesized from Personal Notes

Computer SystemsGpusCudaParallel Computing

Terminology

Term	Definition
Kernel	A function written to run on the GPU that is executed by many threads in parallel; each thread runs the same kernel code but operates on different data identified by its unique thread index
Thread	The smallest unit of execution on the GPU; each thread has its own registers, program counter, and thread ID that distinguishes it from other threads
Thread block (block)	A group of threads (up to 1024) that execute on the same SM, can share data through shared memory, and can synchronize with each other using barriers
Grid	The collection of all thread blocks launched by a single kernel invocation; blocks in a grid execute independently and may run on different SMs

Term	Definition
blockIdx	A built-in variable that gives the index of the current block within the grid; can be 1D, 2D, or 3D
threadIdx	A built-in variable that gives the index of the current thread within its block; can be 1D, 2D, or 3D
blockDim	The dimensions (number of threads) of each block, specified at kernel launch time
Warp	A group of 32 consecutive threads within a block that execute in lockstep on the hardware; the basic scheduling unit
Barrier synchronization	A point where all threads in a block must arrive before any can proceed, ensuring shared memory writes are visible to all threads
Global thread ID	A unique index for each thread across the entire grid, typically computed as $\text{blockIdx} \times \text{blockDim} + \text{threadIdx}$

What & Why

CUDA (Compute Unified Device Architecture) is the programming model that lets you write code for NVIDIA GPUs. The core idea is simple: you write a function (a kernel) that describes what a single thread should do, then you launch that kernel with thousands or millions of threads. Each thread figures out which piece of data it should work on using its unique index.

This model works because GPUs are built to run thousands of threads simultaneously. Instead of writing explicit loops over data, you express parallelism by mapping one thread to one (or a few) data elements. The GPU hardware handles scheduling, memory access, and execution across its streaming multiprocessors.

Understanding the CUDA thread hierarchy is essential because the way you organize threads into blocks and grids directly affects performance. Block size determines how many warps run on each SM, which affects occupancy and latency hiding. Grid size determines how work is distributed across the entire GPU. Getting these dimensions right is the difference between code that saturates the GPU and code that leaves most of it idle.

How It Works

The Thread Hierarchy

CUDA organizes parallel execution into three levels:

Thread: The smallest unit. Each thread executes the kernel function independently with its own registers and thread ID.

Block: A group of threads (up to 1024) that run on the same SM. Threads within a block can cooperate through shared memory and synchronize using barriers. Blocks can be 1D, 2D, or 3D, which is convenient for mapping to 1D arrays, 2D images, or 3D volumes.

Grid: The collection of all blocks for a kernel launch. Blocks in a grid are independent and can execute in any order on any available SM. Grids can also be 1D, 2D, or 3D.

Kernel Launch and Thread Indexing

A kernel launch specifies two things: the grid dimensions (how many blocks) and the block dimensions (how many threads per block). Each thread computes its global index from its block index and thread index within the block.

For a 1D launch processing an array of $N$ elements:

Block size: typically 256 threads
Grid size: $\lceil N / 256 \rceil$ blocks
Global thread ID: $\text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}$

Each thread checks whether its global ID is within bounds (less than $N$) before accessing data. This handles the case where $N$ is not a multiple of the block size.

For 2D data like images, you use 2D blocks (e.g., 16x16 threads) and 2D grids. Each thread computes its row and column from its 2D block and thread indices.

Execution Model

When a kernel launches, the GPU's block scheduler assigns blocks to SMs. Each SM can run multiple blocks concurrently, limited by the SM's resources (registers, shared memory, maximum thread count). Within each block, threads are grouped into warps of 32 that execute in lockstep.

Blocks are independent: they cannot communicate with each other during execution (except through global memory with no ordering guarantees). This independence is what allows the GPU to scale: a GPU with more SMs simply runs more blocks in parallel.

Within a block, threads can cooperate:

They share data through shared memory (fast on-chip scratchpad)
They synchronize using barrier operations that ensure all threads reach the same point before any proceed
They can perform atomic operations on shared or global memory

Memory Model for Kernels

Each thread has access to several memory spaces:

Per-thread: Registers (fastest, private to each thread) and local memory (spills to global memory if registers run out).

Per-block: Shared memory, visible to all threads in the block. Must be explicitly managed by the programmer.

Per-grid: Global memory, visible to all threads across all blocks. Persistent across kernel launches. Highest latency.

Read-only: Constant memory (cached, broadcast to all threads reading the same address) and texture memory (cached with 2D spatial locality optimization).

Choosing Block and Grid Dimensions

Block size affects performance in several ways:

Must be a multiple of 32 (warp size) to avoid partially filled warps
Larger blocks allow more threads to share data through shared memory
Smaller blocks may allow more blocks per SM, increasing occupancy
Common choices: 128, 256, or 512 threads per block

Grid size is determined by the problem size divided by the block size. The grid must be large enough to cover all data elements.

Complexity Analysis

Operation	Sequential (CPU)	Parallel (GPU)
Element-wise operation on $N$ elements	$O(N)$	$O(N / P)$ with $P$ threads
Matrix multiply ($N \times N$)	$O(N^3)$	$O(N^3 / P)$, tiled for cache
Kernel launch overhead	N/A	$\sim$5-50 $\mu s$ per launch
Host-device memory transfer	N/A	$O(N)$ over PCIe bus

Total GPU execution time for a kernel processing $N$ elements with $P$ threads:

$T_{\text{total}} = T_{\text{transfer}} + T_{\text{launch}} + \frac{N}{P} \times T_{\text{op}}$

Where $T_{\text{transfer}}$ is the host-to-device and device-to-host data transfer time, $T_{\text{launch}}$ is the kernel launch overhead, and $T_{\text{op}}$ is the time per operation.

For GPU computation to be worthwhile, the parallel speedup must exceed the transfer overhead:

$\frac{N \times T_{\text{op}}}{T_{\text{transfer}} + T_{\text{launch}} + \frac{N}{P} \times T_{\text{op}}} > 1$

The number of blocks needed for a 1D grid:

$\text{gridDim} = \left\lceil \frac{N}{\text{blockDim}} \right\rceil$

Total threads launched (some may be idle if $N$ is not a multiple of blockDim):

$\text{totalThreads} = \text{gridDim} \times \text{blockDim} \geq N$

Implementation

ALGORITHM VectorAdd_Kernel(A, B, C, N)
INPUT: A, B: input arrays of length N in global memory
       C: output array of length N in global memory
       N: number of elements
CONTEXT: executed by each thread in parallel

BEGIN
  // Compute this thread's global index
  globalId <- blockIdx.x * blockDim.x + threadIdx.x

  // Bounds check: grid may have more threads than data elements
  IF globalId < N THEN
    C[globalId] <- A[globalId] + B[globalId]
  END IF
END


ALGORITHM LaunchVectorAdd(A_host, B_host, N)
INPUT: A_host, B_host: arrays on the CPU, N: number of elements
OUTPUT: C_host: result array on the CPU

BEGIN
  // Step 1: Allocate GPU memory
  A_device <- GPU_ALLOCATE(N * sizeof(float))
  B_device <- GPU_ALLOCATE(N * sizeof(float))
  C_device <- GPU_ALLOCATE(N * sizeof(float))

  // Step 2: Copy input data from CPU to GPU
  COPY_HOST_TO_DEVICE(A_device, A_host, N * sizeof(float))
  COPY_HOST_TO_DEVICE(B_device, B_host, N * sizeof(float))

  // Step 3: Configure launch dimensions
  blockSize <- 256
  gridSize <- CEIL(N / blockSize)

  // Step 4: Launch kernel
  LAUNCH_KERNEL(VectorAdd_Kernel, gridSize, blockSize, A_device, B_device, C_device, N)

  // Step 5: Copy result back to CPU
  C_host <- ALLOCATE(N * sizeof(float))
  COPY_DEVICE_TO_HOST(C_host, C_device, N * sizeof(float))

  // Step 6: Free GPU memory
  GPU_FREE(A_device)
  GPU_FREE(B_device)
  GPU_FREE(C_device)

  RETURN C_host
END


ALGORITHM MatrixMultiply_Kernel(A, B, C, N)
INPUT: A, B: N x N matrices in global memory
       C: output N x N matrix in global memory
       N: matrix dimension
CONTEXT: each thread computes one element of C

BEGIN
  row <- blockIdx.y * blockDim.y + threadIdx.y
  col <- blockIdx.x * blockDim.x + threadIdx.x

  IF row < N AND col < N THEN
    sum <- 0.0
    FOR k <- 0 TO N - 1 DO
      sum <- sum + A[row * N + k] * B[k * N + col]
    END FOR
    C[row * N + col] <- sum
  END IF
END


ALGORITHM TiledMatrixMultiply_Kernel(A, B, C, N)
INPUT: A, B: N x N matrices, C: output matrix, N: dimension
CONTEXT: each thread computes one element using shared memory tiles
NOTE: TILE_SIZE matches blockDim (e.g., 16x16)

BEGIN
  row <- blockIdx.y * TILE_SIZE + threadIdx.y
  col <- blockIdx.x * TILE_SIZE + threadIdx.x

  // Declare shared memory tiles
  SHARED tileA[TILE_SIZE][TILE_SIZE]
  SHARED tileB[TILE_SIZE][TILE_SIZE]

  sum <- 0.0
  numTiles <- CEIL(N / TILE_SIZE)

  FOR t <- 0 TO numTiles - 1 DO
    // Collaboratively load one tile of A and one tile of B into shared memory
    aCol <- t * TILE_SIZE + threadIdx.x
    bRow <- t * TILE_SIZE + threadIdx.y

    IF row < N AND aCol < N THEN
      tileA[threadIdx.y][threadIdx.x] <- A[row * N + aCol]
    ELSE
      tileA[threadIdx.y][threadIdx.x] <- 0.0
    END IF

    IF bRow < N AND col < N THEN
      tileB[threadIdx.y][threadIdx.x] <- B[bRow * N + col]
    ELSE
      tileB[threadIdx.y][threadIdx.x] <- 0.0
    END IF

    // Wait for all threads to finish loading
    BARRIER_SYNC()

    // Compute partial dot product from this tile
    FOR k <- 0 TO TILE_SIZE - 1 DO
      sum <- sum + tileA[threadIdx.y][k] * tileB[k][threadIdx.x]
    END FOR

    // Wait before loading next tile
    BARRIER_SYNC()
  END FOR

  IF row < N AND col < N THEN
    C[row * N + col] <- sum
  END IF
END

Real-World Applications

Image processing: filters like blur, sharpen, and edge detection apply the same convolution kernel to every pixel; each GPU thread processes one pixel, making a 4K image (8 million pixels) trivially parallel
Deep learning training: forward and backward passes through neural networks consist of matrix multiplications and element-wise activations; frameworks like PyTorch and TensorFlow launch CUDA kernels for each layer
Physics simulation: particle systems, fluid dynamics, and cloth simulation update millions of particles each frame; each thread computes forces and positions for one particle
Financial modeling: Monte Carlo simulations for option pricing run millions of independent random paths; each thread simulates one path, and results are aggregated at the end
Genomics: sequence alignment algorithms like Smith-Waterman compare DNA sequences against large databases; GPU implementations process thousands of alignments in parallel
Signal processing: FFT (Fast Fourier Transform) implementations on GPUs process audio, radar, and communications signals with throughput orders of magnitude higher than CPU implementations

Key Takeaways

CUDA organizes parallel work into a three-level hierarchy: threads grouped into blocks, blocks grouped into a grid; each thread computes its unique global index from blockIdx, blockDim, and threadIdx
A kernel is a function that runs on the GPU, executed by every thread in the grid; each thread typically processes one data element identified by its global thread ID
Thread blocks run on a single SM and can cooperate through shared memory and barrier synchronization; blocks within a grid are independent and cannot synchronize with each other
Block size should be a multiple of 32 (warp size), commonly 128 or 256; grid size is determined by the problem size divided by the block size
GPU computation involves data transfer overhead (host to device and back), so the parallel work must be large enough to amortize this cost; small problems may run faster on the CPU

2025-08-22

Parallel Computing Patterns: Map, Reduce, Scan, and Gather/Scatter

The fundamental parallel algorithmic patterns that form the building blocks of GPU programming, from embarrassingly parallel map operations to the surprisingly tricky parallel prefix sum.

2025-08-23

GPU Memory Optimization: Shared Memory, Coalescing, Bank Conflicts, and Occupancy

How to write GPU code that makes efficient use of the memory hierarchy through coalesced global memory access, shared memory tiling, bank conflict avoidance, and occupancy tuning.

2025-07-20

Cluster Architecture: Nodes, Interconnects, and the Anatomy of a Supercomputing Cluster

Understanding how compute clusters are built from nodes, interconnects, and memory architectures, from Beowulf clusters to modern blade servers.