Notes/Cluster Architecture: Nodes, Interconnects, and the Anatomy of a Supercomputing Cluster

Cluster Architecture: Nodes, Interconnects, and the Anatomy of a Supercomputing Cluster

Understanding how compute clusters are built from nodes, interconnects, and memory architectures, from Beowulf clusters to modern blade servers.

2025-07-20AI-Synthesized from Personal Notes

Computer SystemsCluster ArchitectureInterconnects

Terminology

Term	Definition
Node	An independent computer within a cluster, containing its own CPU(s), memory, and often local storage
Interconnect	The networking fabric that links nodes together, determining communication bandwidth and latency between them
Shared memory	A memory architecture where all processors can access a single global address space directly
Distributed memory	A memory architecture where each node has its own private memory, and data must be explicitly sent between nodes via messages
InfiniBand	A high-bandwidth, low-latency interconnect technology commonly used in HPC clusters, offering speeds from 100 Gbps to 400+ Gbps
Beowulf cluster	A cluster built from commodity off-the-shelf hardware running open-source software, pioneered in 1994 at NASA
Blade server	A compact, modular server unit that slides into a shared chassis providing power, cooling, and networking
RDMA	Remote Direct Memory Access: a technique that allows one node to read or write another node's memory without involving the remote CPU or operating system
Bisection bandwidth	The minimum total bandwidth available when the cluster is divided into two equal halves, measuring worst-case communication capacity
NUMA	Non-Uniform Memory Access: a shared-memory architecture where memory access time depends on the memory location relative to the processor

What & Why

A compute cluster is a collection of independent computers (nodes) connected by a high-speed network (interconnect) that work together to solve problems too large for any single machine. Clusters are the backbone of high-performance computing (HPC), powering everything from weather forecasting to molecular dynamics simulations.

The fundamental challenge in cluster design is communication. A single CPU core can access its local cache in under a nanosecond, but sending data to another node across a network takes microseconds at best. This six-order-of-magnitude gap means that cluster architecture is really about minimizing the cost of moving data between nodes while maximizing the compute power available at each node.

Understanding cluster architecture matters because the hardware topology directly shapes how software must be written. A program that runs efficiently on a shared-memory machine may perform terribly on a distributed-memory cluster if it assumes fast access to all data. The choice of interconnect, memory model, and node configuration determines what algorithms are practical and what performance is achievable.

How It Works

Node Anatomy

Each node in a cluster is a complete computer. A typical HPC node contains one or two multi-core CPUs (often 64 to 128 cores total), 256 GB to 2 TB of RAM, a network interface card (NIC) for the cluster interconnect, and optionally local NVMe storage or GPUs. The node runs its own operating system (usually Linux) and can function independently.

Nodes are categorized by their role. Compute nodes run the actual workloads. Login nodes provide user access for submitting jobs. Management nodes run the job scheduler and monitoring services. Storage nodes serve the parallel file system.

Shared vs. Distributed Memory

In a shared-memory system, all processors access a single global memory space. Any CPU can read or write any memory address directly. This makes programming simpler (threads share data naturally) but limits scalability because the memory bus becomes a bottleneck as more processors compete for access. Modern multi-socket servers use NUMA, where each CPU has "local" memory that is fast to access and "remote" memory on other sockets that is slower.

In a distributed-memory system, each node has its own private memory. Node 0 cannot directly read memory on Node 1. To share data, nodes must explicitly send messages over the interconnect. This scales to thousands of nodes (no shared bus bottleneck) but requires the programmer to manage all data movement explicitly using libraries like MPI.

Most modern HPC clusters use a hybrid approach: shared memory within each node (multi-core CPUs sharing RAM) and distributed memory between nodes (message passing over the interconnect).

Interconnect Technologies

The interconnect is the nervous system of a cluster. Its bandwidth and latency determine how fast nodes can exchange data, which directly limits the scalability of parallel applications.

InfiniBand is the dominant HPC interconnect. HDR InfiniBand delivers 200 Gbps per port with latencies under 1 microsecond. NDR (next generation) pushes to 400 Gbps. InfiniBand supports RDMA, allowing one node to read or write another node's memory without involving the remote CPU, dramatically reducing latency for small messages.

Ethernet (25/100/400 GbE) is cheaper and ubiquitous but has higher latency (5-10 microseconds typical) and lacks native RDMA support (though RoCE, RDMA over Converged Ethernet, bridges this gap). Ethernet clusters are common for loosely coupled workloads like web serving and big data analytics.

Proprietary interconnects like Cray's Slingshot or NVIDIA's NVLink/NVSwitch offer specialized performance for specific architectures, particularly for GPU-to-GPU communication.

Beowulf Clusters

The Beowulf model, pioneered at NASA in 1994 by Thomas Sterling and Donald Becker, proved that commodity PCs connected by standard Ethernet could deliver supercomputer-class performance at a fraction of the cost. A Beowulf cluster uses off-the-shelf hardware, runs Linux, and relies on open-source software (MPI, job schedulers). This democratized HPC: any university lab could build a cluster from desktop PCs.

Blade Servers

Blade servers pack compute density by sharing infrastructure. Multiple thin server "blades" slide into a chassis that provides shared power supplies, cooling fans, and network switches. This reduces cabling, saves rack space, and simplifies management. A single 10U chassis might hold 16 blade servers, each with two CPUs and 512 GB of RAM.

Complexity Analysis

Communication cost dominates cluster performance. The key metrics are bandwidth (bytes per second) and latency (time for the first byte to arrive).

For a message of size $m$ bytes sent over a link with bandwidth $B$ bytes/second and latency $L$ seconds:

$T_{\text{msg}} = L + \frac{m}{B}$

For small messages, latency dominates. For large messages, bandwidth dominates. The crossover point is:

$m^* = L \times B$

Below $m^*$, optimizing for latency matters more. Above $m^*$, maximizing bandwidth matters more.

Interconnect	Bandwidth	Latency	Crossover $m^*$
InfiniBand HDR	25 GB/s	~0.6 us	~15 KB
100 GbE	12.5 GB/s	~5 us	~62 KB
1 GbE (Beowulf)	125 MB/s	~50 us	~6 KB

For a parallel application running on $p$ nodes, the speedup is bounded by Amdahl's Law. If fraction $f$ of the work is parallelizable:

$S(p) = \frac{1}{(1 - f) + \frac{f}{p}}$

Communication overhead further reduces this. If each parallel step requires an all-to-all exchange of $m$ bytes:

$T_{\text{parallel}} = \frac{W \cdot f}{p} + W(1 - f) + p \cdot \left(L + \frac{m}{B}\right)$

where $W$ is total work. The $p \cdot (L + m/B)$ term shows why communication cost grows with node count, eventually overwhelming the benefit of adding more nodes.

Implementation

ALGORITHM EstimateMessageTime(messageSize, bandwidth, latency)
INPUT: messageSize: bytes to send, bandwidth: bytes/sec, latency: seconds
OUTPUT: estimated transfer time in seconds
BEGIN
  RETURN latency + (messageSize / bandwidth)
END

ALGORITHM SelectInterconnect(workloadType, messagePattern, budget)
INPUT: workloadType: "tightly-coupled" or "loosely-coupled",
       messagePattern: "small-frequent" or "large-bulk",
       budget: "high" or "low"
OUTPUT: recommended interconnect type
BEGIN
  IF workloadType = "loosely-coupled" AND budget = "low" THEN
    RETURN "Ethernet (1-25 GbE)"
  END IF

  IF messagePattern = "small-frequent" THEN
    // Latency-sensitive: need RDMA
    RETURN "InfiniBand (HDR/NDR)"
  END IF

  IF workloadType = "tightly-coupled" AND budget = "high" THEN
    RETURN "InfiniBand (HDR/NDR)"
  ELSE
    RETURN "100 GbE with RoCE"
  END IF
END

ALGORITHM ComputeSpeedupWithOverhead(totalWork, parallelFraction, numNodes,
                                      latency, bandwidth, messageSize)
INPUT: totalWork: total computation time on 1 node,
       parallelFraction: fraction f that is parallelizable,
       numNodes: number of compute nodes p,
       latency: network latency per message,
       bandwidth: network bandwidth,
       messageSize: bytes exchanged per synchronization step
OUTPUT: estimated speedup factor
BEGIN
  serialTime <- totalWork * (1 - parallelFraction)
  parallelTime <- (totalWork * parallelFraction) / numNodes
  commTime <- numNodes * (latency + messageSize / bandwidth)

  totalParallelTime <- serialTime + parallelTime + commTime

  speedup <- totalWork / totalParallelTime
  RETURN speedup
END

ALGORITHM DesignCluster(targetFlops, flopsPerNode, memPerNode,
                         appMemoryNeed, interconnectType)
INPUT: targetFlops: required peak performance,
       flopsPerNode: FLOPS per compute node,
       memPerNode: RAM per node,
       appMemoryNeed: total memory required by application,
       interconnectType: "infiniband" or "ethernet"
OUTPUT: cluster specification
BEGIN
  nodesForCompute <- CEIL(targetFlops / flopsPerNode)
  nodesForMemory <- CEIL(appMemoryNeed / memPerNode)
  totalNodes <- MAX(nodesForCompute, nodesForMemory)

  // Add 5% overhead for management and login nodes
  managementNodes <- MAX(2, CEIL(totalNodes * 0.05))

  spec <- {
    computeNodes: totalNodes,
    managementNodes: managementNodes,
    totalNodes: totalNodes + managementNodes,
    interconnect: interconnectType,
    peakFlops: totalNodes * flopsPerNode,
    totalMemory: totalNodes * memPerNode
  }

  RETURN spec
END

Real-World Applications

Weather and climate modeling: national weather services run atmospheric simulations on clusters with thousands of nodes, dividing the atmosphere into a 3D grid where each node computes a region and exchanges boundary data with neighbors every time step
Molecular dynamics: pharmaceutical companies simulate protein folding and drug interactions on HPC clusters, where each node computes forces on a subset of atoms and communicates with nodes holding neighboring atoms
Computational fluid dynamics: aerospace engineers simulate airflow over aircraft designs, partitioning the mesh across hundreds of nodes with InfiniBand interconnects to handle the tight coupling between adjacent mesh cells
Genomics and bioinformatics: genome sequencing pipelines process terabytes of raw data across Beowulf-style clusters, where the workload is embarrassingly parallel (each node processes independent sequence reads)
AI training: large language model training distributes model parameters and data across GPU clusters connected by NVLink within nodes and InfiniBand between nodes, requiring all-reduce operations after every training step
Financial risk analysis: banks run Monte Carlo simulations across thousands of nodes to estimate portfolio risk, with each node computing independent scenarios and results aggregated at the end

Key Takeaways

A cluster is a collection of independent nodes connected by a high-speed interconnect, combining their compute power to solve problems too large for a single machine
Shared-memory systems are easier to program but limited in scalability; distributed-memory systems scale to thousands of nodes but require explicit message passing
The interconnect is the critical bottleneck: InfiniBand offers sub-microsecond latency and RDMA for tightly coupled workloads, while Ethernet is cheaper for loosely coupled tasks
Message transfer time is $T = L + m/B$, where latency dominates for small messages and bandwidth dominates for large ones
Beowulf clusters democratized HPC by proving commodity hardware could deliver supercomputer performance at low cost
Modern clusters use a hybrid memory model: shared memory within each node (multi-core CPUs) and distributed memory between nodes (MPI over the interconnect)

2025-07-21

Job Scheduling and Resource Management: SLURM, PBS, and the Art of Sharing a Supercomputer

How job schedulers like SLURM and PBS allocate compute resources across hundreds of users, balancing fairness, throughput, and utilization on shared HPC clusters.

2025-07-22

MPI and Distributed Computing: Message Passing Across Thousands of Nodes

How the Message Passing Interface enables parallel programs to communicate across distributed-memory clusters using point-to-point and collective operations.

2025-07-23

Parallel File Systems: Lustre, GPFS, and High-Speed I/O for Supercomputers

How parallel file systems like Lustre and GPFS stripe data across hundreds of storage servers to deliver the bandwidth that thousands of compute nodes demand.