Notes/Job Scheduling and Resource Management: SLURM, PBS, and the Art of Sharing a Supercomputer

Job Scheduling and Resource Management: SLURM, PBS, and the Art of Sharing a Supercomputer

How job schedulers like SLURM and PBS allocate compute resources across hundreds of users, balancing fairness, throughput, and utilization on shared HPC clusters.

2025-07-21AI-Synthesized from Personal Notes

Computer SystemsJob SchedulingSlurmResource Management

Terminology

Term	Definition
Job scheduler	Software that decides when and where submitted jobs run on a cluster, managing the queue of pending work and allocating resources
SLURM	Simple Linux Utility for Resource Management: the most widely used open-source HPC job scheduler, managing over 60% of TOP500 systems
PBS	Portable Batch System: a family of job schedulers (OpenPBS, PBS Pro, Torque) used in HPC environments
Job queue	An ordered list of submitted jobs waiting to be executed, organized by priority and submission time
Fair-share scheduling	A scheduling policy that adjusts job priority based on historical resource usage, ensuring all users or groups receive their allocated share over time
Backfill scheduling	An optimization that allows smaller, shorter jobs to run ahead of larger queued jobs if they will finish before the larger job's resources become available
Partition	A logical grouping of nodes in SLURM (called a queue in PBS) that defines a set of resources with specific access policies and limits
Walltime	The maximum real-world (clock) time a job is allowed to run before the scheduler terminates it
Resource allocation	The process of assigning specific nodes, CPUs, memory, and GPUs to a job for its exclusive or shared use
Utilization	The fraction of total cluster resources (CPU hours, GPU hours) that are actively being used by running jobs, typically expressed as a percentage

What & Why

An HPC cluster may have thousands of nodes and hundreds of users, each submitting jobs that need different amounts of CPUs, memory, GPUs, and time. Without a scheduler, users would fight over resources, jobs would collide, and expensive hardware would sit idle. The job scheduler is the traffic controller that makes shared clusters usable.

Job scheduling solves three competing goals simultaneously. Throughput: maximize the total work completed per unit time. Fairness: ensure every user or research group gets their allocated share of resources. Utilization: keep as many nodes busy as possible, minimizing idle hardware. These goals conflict. Maximizing throughput might starve small users. Strict fairness might leave nodes idle when one group has no pending work. The scheduler's algorithms balance these tensions.

SLURM and PBS are the two dominant scheduler families in HPC. SLURM manages over 60% of the world's top supercomputers. Both follow the same basic model: users submit batch jobs describing their resource needs, the scheduler queues them, and a scheduling algorithm decides the execution order based on priority, resource availability, and policy constraints.

How It Works

The Job Lifecycle

A job moves through a well-defined lifecycle. The user writes a job script specifying the resources needed (nodes, CPUs per node, memory, walltime, partition) and the commands to execute. They submit this script to the scheduler. The scheduler validates the request, assigns a priority, and places the job in a queue. When sufficient resources become available and the job reaches the front of the queue, the scheduler allocates nodes and launches the job. The job runs until it completes, exceeds its walltime, or is cancelled. After completion, the scheduler releases the resources and records accounting data.

Priority Calculation

Job priority determines queue ordering. SLURM computes priority as a weighted sum of multiple factors:

$P_{\text{job}} = w_{\text{age}} \cdot F_{\text{age}} + w_{\text{fs}} \cdot F_{\text{fairshare}} + w_{\text{size}} \cdot F_{\text{jobsize}} + w_{\text{qos}} \cdot F_{\text{qos}}$

The age factor increases over time so that no job waits forever. The fair-share factor rewards users who have consumed less than their allocation and penalizes heavy users. The job size factor can prioritize larger jobs (to improve utilization) or smaller jobs (to improve turnaround). QoS (Quality of Service) allows administrators to create priority tiers.

Fair-Share Scheduling

Fair-share scheduling tracks each user's (or group's) historical resource consumption relative to their allocated share. The fair-share factor is computed as:

$F_{\text{fairshare}} = 2^{-(\text{usage} / \text{allocation})}$

If a user has consumed exactly their allocation, $F_{\text{fairshare}} = 0.5$. If they have used nothing, $F_{\text{fairshare}} = 1.0$ (maximum priority boost). If they have used double their allocation, $F_{\text{fairshare}} = 0.25$ (priority penalty). This creates a self-correcting system: heavy users see their priority drop, allowing underserved users to catch up.

Backfill Scheduling

Without backfill, a large job waiting for 128 nodes might block the entire queue while 127 nodes sit idle. Backfill scheduling looks at smaller jobs further back in the queue and asks: "Can this job start now and finish before the large job's reserved start time?" If yes, the small job runs immediately, filling the gap.

This dramatically improves utilization. Studies show backfill can increase cluster utilization by 10-20% compared to strict priority ordering. The trade-off is that users must provide accurate walltime estimates. If a backfilled job exceeds its walltime, it gets killed to make room for the larger job.

Resource Matching

When the scheduler decides to run a job, it must find nodes that satisfy all constraints: enough free CPUs, sufficient memory, required GPUs, correct partition, and network topology preferences. This is a variant of the bin-packing problem, which is NP-hard in general. Schedulers use heuristics like first-fit, best-fit, or topology-aware placement to find good solutions quickly.

Complexity Analysis

The scheduling problem has well-studied computational complexity.

Operation	Time Complexity	Notes
Job submission	$O(1)$	Enqueue and validate
Priority calculation	$O(1)$ per job	Weighted sum of factors
Queue sort by priority	$O(J \log J)$	$J$ = number of pending jobs
Resource matching (first-fit)	$O(N)$	$N$ = number of nodes
Optimal bin packing	NP-hard	Heuristics used in practice
Backfill scan	$O(J \times N)$	Check each pending job against available resources

The scheduling cycle runs periodically (every few seconds in SLURM). Each cycle processes the queue:

$T_{\text{cycle}} = O(J \log J) + O(J \times N)$

For a cluster with $N = 10{,}000$ nodes and $J = 50{,}000$ pending jobs, this is tractable because the constant factors are small (simple arithmetic per job-node pair).

Fair-share decay uses an exponential half-life to weight recent usage more heavily:

$U_{\text{effective}}(t) = \sum_{i} u_i \cdot 2^{-(t - t_i) / h}$

where $u_i$ is usage at time $t_i$ and $h$ is the half-life (typically 7-14 days). This ensures that a user who consumed heavily last week but has been idle this week sees their priority recover.

Implementation

ALGORITHM CalculateJobPriority(job, fairshareDB, currentTime, weights)
INPUT: job: submitted job, fairshareDB: historical usage database,
       currentTime: current timestamp, weights: priority factor weights
OUTPUT: numeric priority value
BEGIN
  // Age factor: time waiting in queue (normalized to 0-1)
  ageFactor <- MIN(1.0, (currentTime - job.submitTime) / MAX_WAIT)

  // Fair-share factor: penalize heavy users
  userUsage <- LOOKUP(fairshareDB, job.user)
  userAllocation <- LOOKUP(fairshareDB, job.user, "allocation")
  fairshareFactor <- POWER(2, -(userUsage / userAllocation))

  // Job size factor: optionally favor larger jobs
  sizeFactor <- job.requestedNodes / MAX_NODES

  // QoS factor: administrative priority tier
  qosFactor <- job.qos.priorityBoost

  priority <- weights.age * ageFactor
            + weights.fairshare * fairshareFactor
            + weights.size * sizeFactor
            + weights.qos * qosFactor

  RETURN priority
END

ALGORITHM ScheduleWithBackfill(queue, nodes, reservations)
INPUT: queue: list of pending jobs sorted by priority,
       nodes: list of cluster nodes with current state,
       reservations: future resource reservations for top-priority jobs
OUTPUT: list of jobs to start now
BEGIN
  toStart <- empty list

  // Phase 1: Schedule top-priority jobs in order
  FOR EACH job IN queue BY DESCENDING priority DO
    available <- FIND_NODES(nodes, job.requestedNodes,
                            job.cpusPerNode, job.memPerNode)
    IF available is not empty THEN
      ALLOCATE(available, job)
      APPEND job TO toStart
    ELSE
      // Reserve earliest start time for this job
      reserveTime <- ESTIMATE_EARLIEST_START(nodes, reservations, job)
      ADD_RESERVATION(reservations, job, reserveTime)
      BREAK  // Only reserve for the highest-priority blocked job
    END IF
  END FOR

  // Phase 2: Backfill smaller jobs that fit in the gap
  FOR EACH job IN queue WHERE job NOT IN toStart DO
    available <- FIND_NODES(nodes, job.requestedNodes,
                            job.cpusPerNode, job.memPerNode)
    IF available is not empty THEN
      // Check: will this job finish before any reservation starts?
      estimatedEnd <- currentTime + job.walltime
      IF estimatedEnd <= EARLIEST_RESERVATION_TIME(reservations, available) THEN
        ALLOCATE(available, job)
        APPEND job TO toStart
      END IF
    END IF
  END FOR

  RETURN toStart
END

ALGORITHM UpdateFairShare(fairshareDB, completedJob, halfLife)
INPUT: fairshareDB: usage database, completedJob: just-finished job,
       halfLife: decay half-life in seconds
OUTPUT: updated fairshareDB
BEGIN
  user <- completedJob.user
  cpuHours <- completedJob.nodesUsed * completedJob.cpusPerNode
              * completedJob.actualRuntime / 3600

  // Apply decay to all existing usage records
  currentTime <- CURRENT_TIME()
  FOR EACH record IN fairshareDB[user].usageHistory DO
    age <- currentTime - record.timestamp
    record.weight <- POWER(2, -(age / halfLife))
  END FOR

  // Add new usage record
  APPEND {cpuHours: cpuHours, timestamp: currentTime, weight: 1.0}
    TO fairshareDB[user].usageHistory

  // Recompute effective usage
  fairshareDB[user].effectiveUsage <-
    SUM(record.cpuHours * record.weight
        FOR record IN fairshareDB[user].usageHistory)

  RETURN fairshareDB
END

Real-World Applications

National labs and supercomputing centers: facilities like NERSC, ORNL, and TACC use SLURM to manage clusters with 10,000+ nodes shared by thousands of researchers across dozens of scientific disciplines
University research clusters: academic HPC centers use fair-share scheduling to divide resources among departments, ensuring the physics group cannot monopolize nodes that the biology group also needs
Cloud HPC (AWS ParallelCluster, Azure CycleCloud): cloud-based HPC platforms integrate SLURM with auto-scaling, dynamically provisioning and terminating nodes based on queue depth
AI training clusters: organizations training large models use SLURM to schedule multi-node GPU jobs, with partitions dedicated to different GPU types (A100, H100) and QoS tiers for urgent experiments
Weather forecasting: national weather services run time-critical forecast jobs with high QoS priority, preempting lower-priority research jobs to meet broadcast deadlines
Pharmaceutical research: drug discovery pipelines submit thousands of short molecular docking jobs that benefit heavily from backfill scheduling, filling gaps between larger simulation jobs

Key Takeaways

Job schedulers like SLURM and PBS are essential for sharing HPC clusters among many users, balancing throughput, fairness, and utilization
Job priority is a weighted combination of queue wait time, fair-share standing, job size, and administrative QoS tier
Fair-share scheduling uses exponential decay of historical usage ($F = 2^{-\text{usage}/\text{allocation}}$) to self-correct: heavy users lose priority, idle users gain it
Backfill scheduling improves utilization by 10-20% by letting small jobs run in gaps before large reserved jobs, but requires accurate walltime estimates
Resource matching is a bin-packing variant (NP-hard in general), solved with heuristics like first-fit and topology-aware placement
The scheduling cycle runs in $O(J \log J + J \times N)$ time, which is tractable even for clusters with tens of thousands of nodes and jobs

2025-07-22

MPI and Distributed Computing: Message Passing Across Thousands of Nodes

How the Message Passing Interface enables parallel programs to communicate across distributed-memory clusters using point-to-point and collective operations.

2025-07-23

Parallel File Systems: Lustre, GPFS, and High-Speed I/O for Supercomputers

How parallel file systems like Lustre and GPFS stripe data across hundreds of storage servers to deliver the bandwidth that thousands of compute nodes demand.

2025-07-24

Supercomputer Design: TOP500 Rankings, Interconnect Topologies, and Scientific Workloads

How the world's fastest supercomputers are designed, from interconnect topologies like fat trees and dragonfly networks to cooling systems and the scientific workloads that drive their architecture.