Notes/Parallel Computing Patterns: Map, Reduce, Scan, and Gather/Scatter

Parallel Computing Patterns: Map, Reduce, Scan, and Gather/Scatter

The fundamental parallel algorithmic patterns that form the building blocks of GPU programming, from embarrassingly parallel map operations to the surprisingly tricky parallel prefix sum.

2025-08-22AI-Synthesized from Personal Notes

Computer SystemsGpusParallel PatternsParallel Computing

Terminology

Term	Definition
Map	A parallel pattern where the same function is applied independently to every element of an input collection, producing an output collection of the same size
Reduce (fold)	A pattern that combines all elements of a collection into a single value using an associative binary operator (e.g., sum, max, min)
Scan (prefix sum)	A pattern that computes a running accumulation of elements using an associative operator; element $i$ of the output is the reduction of elements $0$ through $i$ (inclusive scan) or $0$ through $i-1$ (exclusive scan)
Gather	A pattern where each output element reads from an arbitrary position in the input, determined by an index array; output[i] = input[indices[i]]
Scatter	A pattern where each input element is written to an arbitrary position in the output, determined by an index array; output[indices[i]] = input[i]

Term	Definition
Associative operator	A binary operator $\oplus$ where $(a \oplus b) \oplus c = a \oplus (b \oplus c)$; associativity allows elements to be combined in any grouping order, which is essential for parallel reduction and scan
Work efficiency	A parallel algorithm is work-efficient if it performs the same total number of operations as the best sequential algorithm; $O(N)$ work for reduce and scan
Span (depth)	The longest chain of sequential dependencies in a parallel algorithm, determining the minimum possible execution time with unlimited processors; $O(\log N)$ for reduce and scan
Coalesced access	A memory access pattern where consecutive threads read or write consecutive memory addresses, allowing the hardware to combine them into a single wide memory transaction
Write conflict	When multiple threads attempt to write to the same memory location simultaneously; scatter operations must handle conflicts using atomic operations or by ensuring unique write targets

What & Why

Almost every parallel GPU algorithm can be decomposed into a small set of fundamental patterns. These patterns are the parallel equivalents of the loops and accumulations you write in sequential code. Understanding them is essential because they determine how efficiently your algorithm maps onto GPU hardware.

Map is the simplest: apply a function to every element independently. No communication between threads is needed, making it embarrassingly parallel and perfectly suited to GPUs.

Reduce combines all elements into one value (a sum, a maximum, a count). Sequentially this is a simple loop, but in parallel it requires careful coordination because threads must combine partial results across a tree of operations.

Scan (prefix sum) is the most powerful and surprising of the group. It computes a running total across an array in $O(\log N)$ parallel steps. Scan is the key building block for parallel sorting, stream compaction, histogram computation, and many other algorithms that seem inherently sequential at first glance.

Gather and scatter handle irregular data access patterns where threads read from or write to non-contiguous locations. These patterns arise in sparse matrix operations, graph algorithms, and any computation with indirect addressing.

How It Works

Map: Embarrassingly Parallel

Map applies a function $f$ to every element independently. Thread $i$ reads input[$i$], computes $f$, and writes to output[$i$]. There are no dependencies between threads.

Map is the most common GPU pattern. Examples include: applying a color filter to every pixel, computing the activation function for every neuron in a layer, or scaling every element of a vector.

Reduce: Tree-Based Combination

Parallel reduction combines $N$ elements into one using an associative operator. The key insight is that associativity lets us combine elements in any order. Instead of a sequential left-to-right fold, we build a binary tree of partial results.

In each step, half the active threads combine pairs of values. After $\lceil \log_2 N \rceil$ steps, the final result is in position 0.

On a GPU, reduction is typically done in two phases:

Each block reduces its portion of the data to a single value using shared memory
A second kernel reduces the per-block results to the final answer

Scan (Prefix Sum): The Parallel Workhorse

Scan computes a running accumulation. Given input $[a_0, a_1, a_2, \ldots, a_{N-1}]$ and an associative operator $\oplus$:

Exclusive scan: output[$i$] = $a_0 \oplus a_1 \oplus \ldots \oplus a_{i-1}$ (output[0] = identity element)

Inclusive scan: output[$i$] = $a_0 \oplus a_1 \oplus \ldots \oplus a_i$

For addition with input [3, 1, 7, 0, 4, 1, 6, 3]:

Exclusive scan: [0, 3, 4, 11, 11, 15, 16, 22]
Inclusive scan: [3, 4, 11, 11, 15, 16, 22, 25]

The Blelloch algorithm computes an exclusive scan in two sweeps:

Up-sweep (reduce phase): Build partial sums bottom-up in a binary tree. At each level $d$, thread $i$ computes: $x[2^{d+1} \cdot i + 2^{d+1} - 1]$ += $x[2^{d+1} \cdot i + 2^d - 1]$

Down-sweep: Propagate partial sums top-down. Set the root to the identity element, then at each level, each node passes its value to its left child and the sum of its value and its left child to its right child.

Gather and Scatter

Gather reads from arbitrary input locations: each thread reads from a different, potentially non-contiguous position. This is common in sparse matrix operations and lookup tables. Gather is safe (no write conflicts) but may have poor memory coalescing if the access pattern is irregular.

Scatter writes to arbitrary output locations: each thread writes its result to a position determined by an index array. Scatter can cause write conflicts when multiple threads target the same location. Solutions include atomic operations (correct but slow) or designing the algorithm to guarantee unique write targets.

Complexity Analysis

Pattern	Sequential	Parallel Work	Parallel Span
Map	$O(N)$	$O(N)$	$O(1)$
Reduce	$O(N)$	$O(N)$	$O(\log N)$
Scan (Blelloch)	$O(N)$	$O(N)$	$O(\log N)$
Gather	$O(N)$	$O(N)$	$O(1)$
Scatter	$O(N)$	$O(N)$	$O(1)$

Parallel reduce with $N$ elements and $P$ processors:

$T_{\text{reduce}} = \frac{N}{P} + O(\log P)$

The first term is the local reduction each processor performs on its $N/P$ elements. The second term is the tree-based combination of partial results.

Blelloch scan performs $2(N - 1)$ operations total (work-efficient), with a span of $2 \log_2 N$ steps:

$W_{\text{scan}} = 2(N - 1) = O(N)$

$D_{\text{scan}} = 2 \log_2 N = O(\log N)$

Speedup for a parallel pattern with work $W$ and span $D$ using $P$ processors (Brent's theorem):

$T_P \leq \frac{W}{P} + D$

For reduce: $T_P \leq \frac{N}{P} + \log_2 N$. With $P = N$ processors, $T_P = O(\log N)$.

Implementation

ALGORITHM ParallelMap(input, f, N)
INPUT: input: array of N elements, f: function to apply
OUTPUT: output: array of N transformed elements
CONTEXT: each thread executes this in parallel

BEGIN
  id <- blockIdx.x * blockDim.x + threadIdx.x
  IF id < N THEN
    output[id] <- f(input[id])
  END IF
END


ALGORITHM ParallelReduce(input, N, op)
INPUT: input: array of N elements, op: associative binary operator
OUTPUT: single reduced value
NOTE: executed within one thread block using shared memory

BEGIN
  tid <- threadIdx.x
  SHARED sdata[BLOCK_SIZE]

  // Each thread loads one element into shared memory
  globalId <- blockIdx.x * blockDim.x + threadIdx.x
  IF globalId < N THEN
    sdata[tid] <- input[globalId]
  ELSE
    sdata[tid] <- IDENTITY(op)  // identity element for the operator
  END IF
  BARRIER_SYNC()

  // Tree-based reduction in shared memory
  stride <- BLOCK_SIZE / 2
  WHILE stride > 0 DO
    IF tid < stride THEN
      sdata[tid] <- op(sdata[tid], sdata[tid + stride])
    END IF
    BARRIER_SYNC()
    stride <- stride / 2
  END WHILE

  // Thread 0 writes the block's result
  IF tid = 0 THEN
    blockResults[blockIdx.x] <- sdata[0]
  END IF
END


ALGORITHM BlellochScan(input, N)
INPUT: input: array of N elements (assume N is a power of 2)
OUTPUT: output: exclusive prefix sum array
NOTE: executed within one thread block using shared memory

BEGIN
  tid <- threadIdx.x
  SHARED temp[N]

  // Load input into shared memory
  temp[2 * tid] <- input[2 * tid]
  temp[2 * tid + 1] <- input[2 * tid + 1]

  // === UP-SWEEP (reduce phase) ===
  offset <- 1
  FOR d <- log2(N) - 1 DOWNTO 0 DO
    BARRIER_SYNC()
    IF tid < N / (2 * offset) THEN
      ai <- offset * (2 * tid + 1) - 1
      bi <- offset * (2 * tid + 2) - 1
      temp[bi] <- temp[ai] + temp[bi]
    END IF
    offset <- offset * 2
  END FOR

  // Set root to identity (0 for addition)
  IF tid = 0 THEN
    temp[N - 1] <- 0
  END IF

  // === DOWN-SWEEP ===
  FOR d <- 0 TO log2(N) - 1 DO
    offset <- offset / 2
    BARRIER_SYNC()
    IF tid < N / (2 * offset) THEN
      ai <- offset * (2 * tid + 1) - 1
      bi <- offset * (2 * tid + 2) - 1
      t <- temp[ai]
      temp[ai] <- temp[bi]
      temp[bi] <- temp[bi] + t
    END IF
  END FOR

  BARRIER_SYNC()

  // Write results
  output[2 * tid] <- temp[2 * tid]
  output[2 * tid + 1] <- temp[2 * tid + 1]
END


ALGORITHM ParallelGather(input, indices, output, N)
INPUT: input: source array, indices: index array of length N
OUTPUT: output: gathered array of length N

BEGIN
  id <- blockIdx.x * blockDim.x + threadIdx.x
  IF id < N THEN
    output[id] <- input[indices[id]]
  END IF
END


ALGORITHM ParallelScatter(input, indices, output, N)
INPUT: input: source array of length N, indices: target index array
OUTPUT: output: scattered array
NOTE: assumes no two indices are the same (no write conflicts)

BEGIN
  id <- blockIdx.x * blockDim.x + threadIdx.x
  IF id < N THEN
    output[indices[id]] <- input[id]
  END IF
END

Real-World Applications

Stream compaction: given an array and a predicate, produce a dense array of only the elements that satisfy the predicate; implemented as a map (evaluate predicate), scan (compute output positions), and scatter (write to compacted positions)
Parallel sorting (radix sort): radix sort on GPUs uses scan to compute the destination index for each element at each digit pass, achieving $O(N \cdot k / P)$ time for $k$-bit keys with $P$ processors
Histogram computation: counting occurrences of each value uses scatter with atomic additions; optimized versions use shared memory histograms per block, then reduce across blocks
Sparse matrix-vector multiply: the CSR (Compressed Sparse Row) format uses gather to fetch vector elements at column indices, then reduce within each row to produce the output vector
Neural network layers: forward passes use map for activation functions, reduce for loss computation, and scan for cumulative operations like running batch statistics
Image processing pipelines: tone mapping uses map, computing an image histogram uses reduce, and building a cumulative distribution function for histogram equalization uses scan

Key Takeaways

Map, reduce, scan, gather, and scatter are the fundamental building blocks of parallel GPU algorithms; most complex parallel algorithms decompose into combinations of these patterns
Map is embarrassingly parallel with $O(1)$ span, making it the ideal GPU workload; reduce and scan have $O(\log N)$ span, requiring tree-based coordination between threads
Scan (prefix sum) is the most versatile pattern: it converts seemingly sequential problems (like computing output positions for stream compaction) into parallel ones
The Blelloch scan algorithm is work-efficient ($O(N)$ operations) with $O(\log N)$ span, using an up-sweep to build partial sums and a down-sweep to distribute them
Gather is safe but may have poor memory coalescing; scatter risks write conflicts and often requires atomic operations or careful index design to avoid them

2025-08-23

GPU Memory Optimization: Shared Memory, Coalescing, Bank Conflicts, and Occupancy

How to write GPU code that makes efficient use of the memory hierarchy through coalesced global memory access, shared memory tiling, bank conflict avoidance, and occupancy tuning.

2025-07-20

Cluster Architecture: Nodes, Interconnects, and the Anatomy of a Supercomputing Cluster

Understanding how compute clusters are built from nodes, interconnects, and memory architectures, from Beowulf clusters to modern blade servers.

2025-07-21

Job Scheduling and Resource Management: SLURM, PBS, and the Art of Sharing a Supercomputer

How job schedulers like SLURM and PBS allocate compute resources across hundreds of users, balancing fairness, throughput, and utilization on shared HPC clusters.