Notes/Parallel File Systems: Lustre, GPFS, and High-Speed I/O for Supercomputers

Parallel File Systems: Lustre, GPFS, and High-Speed I/O for Supercomputers

How parallel file systems like Lustre and GPFS stripe data across hundreds of storage servers to deliver the bandwidth that thousands of compute nodes demand.

2025-07-23AI-Synthesized from Personal Notes

Computer SystemsParallel FilesystemsLustreIo

Terminology

Term	Definition
Parallel file system	A file system that distributes data across multiple storage servers, allowing many clients to read and write simultaneously at aggregate bandwidths far exceeding any single server
Lustre	An open-source parallel file system widely used in HPC, serving over 60% of TOP100 supercomputers, designed for high throughput on large files
GPFS (Spectrum Scale)	IBM's enterprise parallel file system, now called Spectrum Scale, known for strong POSIX compliance and metadata performance
Striping	The technique of splitting a file into fixed-size chunks (stripes) and distributing them across multiple storage targets to parallelize I/O
Stripe size	The size of each chunk when a file is striped across storage targets, typically 1-4 MB in Lustre
OST (Object Storage Target)	A storage device (or RAID group) in Lustre that holds file data; a Lustre file system may have hundreds of OSTs
MDS (Metadata Server)	A server in Lustre that manages file metadata (names, permissions, directory structure, stripe layout) but does not store file data
I/O bottleneck	A condition where the storage system cannot deliver data fast enough to keep compute nodes busy, causing them to wait on I/O
MPI-IO	An extension of MPI that provides parallel I/O operations, allowing multiple processes to read and write a shared file concurrently with coordinated access patterns
POSIX I/O	The standard Unix file I/O interface (open, read, write, close) that provides strong consistency guarantees but can limit parallel performance due to locking overhead

What & Why

A supercomputer with 10,000 nodes can produce terabytes of output per simulation and needs to read terabytes of input data. A single storage server with a few NVMe drives might deliver 10-20 GB/s, but the cluster needs 100+ GB/s of aggregate I/O bandwidth. The solution is a parallel file system that stripes data across hundreds of storage servers, so that thousands of compute nodes can read and write simultaneously.

The I/O bottleneck is one of the most common performance killers in HPC. A simulation that scales perfectly to 4,096 cores might spend 30% of its time waiting for file I/O if the storage system cannot keep up. Parallel file systems exist to close this gap by making storage bandwidth scale with the number of storage servers, just as compute power scales with the number of compute nodes.

Lustre and GPFS are the two dominant parallel file systems in HPC. Lustre is open-source and powers the majority of the world's largest supercomputers. GPFS (now IBM Spectrum Scale) is a commercial product known for strong metadata performance and enterprise features. Both follow the same core principle: separate metadata from data, stripe data across many servers, and let clients talk directly to the storage servers holding their data.

How It Works

Lustre Architecture

Lustre separates metadata from data into distinct server roles:

Metadata Servers (MDS): handle file names, directory structure, permissions, and stripe layout information. When a client opens a file, it contacts the MDS to learn which OSTs hold the file's stripes. The MDS does not touch file data.

Object Storage Servers (OSS): each OSS manages one or more Object Storage Targets (OSTs), which are the actual storage devices (typically RAID arrays or JBOD shelves). File data lives on OSTs. A large Lustre deployment might have 200+ OSTs.

Clients: compute nodes mount the Lustre file system and see a standard POSIX directory tree. When reading or writing, the client contacts the MDS once to get the stripe layout, then communicates directly with the relevant OSTs for data transfer. This direct client-to-OST path is key to scalability: the MDS is not a bottleneck for data I/O.

Striping

When a file is created, Lustre assigns it a stripe pattern: the stripe size (chunk size, typically 1 MB) and the stripe count (number of OSTs to spread across). A file with stripe count 4 and stripe size 1 MB distributes its first 1 MB to OST 0, the next 1 MB to OST 1, then OST 2, then OST 3, then back to OST 0 in round-robin fashion.

Wider striping (more OSTs) increases aggregate bandwidth for large sequential reads/writes because multiple OSTs serve data in parallel. Narrow striping (fewer OSTs) reduces metadata overhead and is better for many small files. Choosing the right stripe parameters is a critical tuning decision.

GPFS / Spectrum Scale

GPFS takes a different approach to metadata. Instead of dedicated metadata servers, GPFS distributes metadata across all storage nodes using a shared-disk architecture with distributed locking. Any node can serve metadata for any file, which improves metadata throughput for workloads with many small files. GPFS also supports data tiering (moving cold data to tape or object storage) and native encryption.

I/O Patterns and Bottlenecks

HPC applications exhibit distinct I/O patterns that stress parallel file systems differently:

Checkpoint/restart: all processes dump their state to files simultaneously, creating a burst of writes. This is the most common I/O pattern and benefits from wide striping.

N-to-1: all processes write to a single shared file. Without coordination, this creates lock contention at the file system level. MPI-IO with collective writes solves this by aggregating small writes into large sequential ones.

N-to-N: each process writes its own file. This avoids contention but creates millions of small files, stressing the metadata server.

Read-heavy analytics: post-processing reads large datasets. Benefits from wide striping and read-ahead caching.

Complexity Analysis

Aggregate I/O bandwidth scales with the number of OSTs (storage targets). For $S$ OSTs, each with bandwidth $B_s$:

$B_{\text{aggregate}} = \min(S \cdot B_s, \, B_{\text{network}})$

where $B_{\text{network}}$ is the total network bandwidth available for I/O traffic. The file system bandwidth is capped by whichever is smaller: storage throughput or network capacity.

For a file of size $F$ striped across $k$ OSTs with stripe size $s$:

$T_{\text{read}} = L_{\text{metadata}} + \frac{F}{\min(k, S) \cdot B_s}$

The metadata lookup $L_{\text{metadata}}$ is a one-time cost per file open. For large files, the data transfer dominates.

I/O Pattern	Bottleneck	Mitigation
N-to-1 (shared file)	Lock contention, $O(p)$ serialization	MPI-IO collective writes
N-to-N (file per process)	MDS overload, $O(p)$ metadata ops	Subdirectory sharding, DNE
Checkpoint burst	OST saturation	Burst buffers, wide striping
Small random I/O	Latency, $O(L)$ per operation	Aggregation, local SSD cache

For $p$ processes each writing $d$ bytes in an N-to-1 pattern without coordination:

$T_{\text{uncoordinated}} = p \cdot (L_{\text{lock}} + d / B_s)$

With MPI-IO collective writes that aggregate into large sequential I/O:

$T_{\text{collective}} = L_{\text{coord}} + \frac{p \cdot d}{\min(k, S) \cdot B_s}$

The speedup from collective I/O can be orders of magnitude for small per-process writes.

Implementation

ALGORITHM StripeFileAcrossOSTs(fileData, stripeSize, stripeCount, osts)
INPUT: fileData: byte array of the file,
       stripeSize: bytes per stripe chunk,
       stripeCount: number of OSTs to use,
       osts: list of available OSTs
OUTPUT: stripe layout mapping
BEGIN
  selectedOSTs <- SELECT_ROUND_ROBIN(osts, stripeCount)
  layout <- empty map

  offset <- 0
  stripeIndex <- 0

  WHILE offset < LENGTH(fileData) DO
    chunkEnd <- MIN(offset + stripeSize, LENGTH(fileData))
    chunk <- fileData[offset .. chunkEnd]
    targetOST <- selectedOSTs[stripeIndex MOD stripeCount]

    APPEND chunk TO layout[targetOST]

    offset <- chunkEnd
    stripeIndex <- stripeIndex + 1
  END WHILE

  RETURN layout
END

ALGORITHM ParallelRead(fileName, mds, clientRank, totalClients)
INPUT: fileName: path to file, mds: metadata server,
       clientRank: this client's rank, totalClients: total readers
OUTPUT: this client's portion of the file
BEGIN
  // Step 1: Contact MDS for stripe layout (one-time metadata lookup)
  stripeLayout <- MDS_LOOKUP(mds, fileName)
  stripeSize <- stripeLayout.stripeSize
  stripeCount <- stripeLayout.stripeCount
  fileSize <- stripeLayout.fileSize
  ostList <- stripeLayout.osts

  // Step 2: Compute this client's byte range
  chunkSize <- CEIL(fileSize / totalClients)
  startByte <- clientRank * chunkSize
  endByte <- MIN(startByte + chunkSize, fileSize)

  // Step 3: Determine which OSTs hold the needed stripes
  buffer <- empty byte array
  offset <- startByte

  WHILE offset < endByte DO
    stripeIndex <- FLOOR(offset / stripeSize)
    ostIndex <- stripeIndex MOD stripeCount
    targetOST <- ostList[ostIndex]
    offsetInOST <- (FLOOR(stripeIndex / stripeCount)) * stripeSize
                   + (offset MOD stripeSize)

    bytesToRead <- MIN(stripeSize - (offset MOD stripeSize), endByte - offset)

    // Read directly from the OST (bypasses MDS)
    data <- OST_READ(targetOST, offsetInOST, bytesToRead)
    APPEND data TO buffer

    offset <- offset + bytesToRead
  END WHILE

  RETURN buffer
END

ALGORITHM CollectiveWrite(localData, sharedFileName, comm, stripeLayout)
INPUT: localData: each process's data to write,
       sharedFileName: target file path,
       comm: MPI communicator,
       stripeLayout: file stripe configuration
OUTPUT: data written to parallel file system
BEGIN
  rank <- MPI_COMM_RANK(comm)
  size <- MPI_COMM_SIZE(comm)

  // Phase 1: Gather write sizes to compute offsets
  localSize <- LENGTH(localData)
  allSizes <- MPI_ALLGATHER(localSize, comm)

  myOffset <- SUM(allSizes[0 .. rank - 1])

  // Phase 2: Determine aggregator processes
  // Use one aggregator per OST to minimize lock contention
  numAggregators <- stripeLayout.stripeCount
  aggregatorRanks <- EVENLY_SPACED(0, size, numAggregators)

  // Phase 3: Route data to aggregators based on stripe mapping
  myAggregator <- FIND_AGGREGATOR(myOffset, stripeLayout, aggregatorRanks)
  MPI_SEND(localData, TO myAggregator, comm)

  // Phase 4: Aggregators write large sequential chunks
  IF rank IN aggregatorRanks THEN
    collectedData <- RECEIVE_ALL_FROM_ASSIGNED_PROCESSES(comm)
    SORT collectedData BY file offset
    targetOST <- DETERMINE_OST(rank, stripeLayout)
    OST_WRITE(targetOST, collectedData)
  END IF

  MPI_BARRIER(comm)
END

Real-World Applications

Climate simulation checkpointing: models like CESM write checkpoint files every few simulated hours, producing 10-50 TB per run; Lustre with wide striping across 200+ OSTs delivers the 100+ GB/s needed to keep checkpoint time under a few minutes
Genomics pipelines: whole-genome sequencing produces millions of small files (FASTQ reads, BAM alignments); GPFS's distributed metadata handles the metadata-heavy workload better than Lustre's centralized MDS
AI training data loading: training large models requires reading millions of image or text samples per epoch; parallel file systems with client-side caching and read-ahead reduce I/O wait time
Seismic data processing: oil and gas companies store petabytes of seismic survey data on Lustre, with hundreds of compute nodes reading overlapping 3D sub-volumes in parallel
Particle physics (CERN): the LHC experiments generate petabytes of collision data annually, stored on distributed file systems and accessed by thousands of analysis jobs worldwide
Movie rendering: visual effects studios use parallel file systems to serve texture maps and geometry data to hundreds of render nodes simultaneously

Key Takeaways

Parallel file systems stripe data across many storage servers (OSTs) so that aggregate I/O bandwidth scales with the number of servers: $B_{\text{agg}} = \min(S \cdot B_s, B_{\text{net}})$
Lustre separates metadata (MDS) from data (OSS/OST), allowing clients to read and write data directly from storage targets without bottlenecking on a central server
Striping parameters (stripe size and count) must be tuned to the workload: wide striping for large files, narrow striping for many small files
The N-to-1 I/O pattern (many processes writing one shared file) causes severe lock contention; MPI-IO collective writes solve this by aggregating small writes into large sequential ones
GPFS distributes metadata across all nodes (no dedicated MDS), giving it better metadata throughput for workloads with millions of small files
I/O is often the hidden bottleneck in HPC: a perfectly scaling computation can still waste 30%+ of its time waiting on storage if the file system is misconfigured

2025-07-24

Supercomputer Design: TOP500 Rankings, Interconnect Topologies, and Scientific Workloads

How the world's fastest supercomputers are designed, from interconnect topologies like fat trees and dragonfly networks to cooling systems and the scientific workloads that drive their architecture.

2021-03-05

CPU Architecture: Pipelines, Caches, and Branch Prediction

How modern processors execute billions of instructions per second using pipelining, cache hierarchies, branch prediction, and instruction-level parallelism.

2021-03-06

Memory Hierarchy: From Registers to Disk

How computers organize memory into layers of increasing size and decreasing speed, and why locality of reference makes this hierarchy effective.