Job Scheduling and Resource Management: SLURM, PBS, and the Art of Sharing a Supercomputer
How job schedulers like SLURM and PBS allocate compute resources across hundreds of users, balancing fairness, throughput, and utilization on shared HPC clusters.
Terminology
| Term | Definition |
|---|---|
| Job scheduler | Software that decides when and where submitted jobs run on a cluster, managing the queue of pending work and allocating resources |
| SLURM | Simple Linux Utility for Resource Management: the most widely used open-source HPC job scheduler, managing over 60% of TOP500 systems |
| PBS | Portable Batch System: a family of job schedulers (OpenPBS, PBS Pro, Torque) used in HPC environments |
| Job queue | An ordered list of submitted jobs waiting to be executed, organized by priority and submission time |
| Fair-share scheduling | A scheduling policy that adjusts job priority based on historical resource usage, ensuring all users or groups receive their allocated share over time |
| Backfill scheduling | An optimization that allows smaller, shorter jobs to run ahead of larger queued jobs if they will finish before the larger job's resources become available |
| Partition | A logical grouping of nodes in SLURM (called a queue in PBS) that defines a set of resources with specific access policies and limits |
| Walltime | The maximum real-world (clock) time a job is allowed to run before the scheduler terminates it |
| Resource allocation | The process of assigning specific nodes, CPUs, memory, and GPUs to a job for its exclusive or shared use |
| Utilization | The fraction of total cluster resources (CPU hours, GPU hours) that are actively being used by running jobs, typically expressed as a percentage |
What & Why
An HPC cluster may have thousands of nodes and hundreds of users, each submitting jobs that need different amounts of CPUs, memory, GPUs, and time. Without a scheduler, users would fight over resources, jobs would collide, and expensive hardware would sit idle. The job scheduler is the traffic controller that makes shared clusters usable.
Job scheduling solves three competing goals simultaneously. Throughput: maximize the total work completed per unit time. Fairness: ensure every user or research group gets their allocated share of resources. Utilization: keep as many nodes busy as possible, minimizing idle hardware. These goals conflict. Maximizing throughput might starve small users. Strict fairness might leave nodes idle when one group has no pending work. The scheduler's algorithms balance these tensions.
SLURM and PBS are the two dominant scheduler families in HPC. SLURM manages over 60% of the world's top supercomputers. Both follow the same basic model: users submit batch jobs describing their resource needs, the scheduler queues them, and a scheduling algorithm decides the execution order based on priority, resource availability, and policy constraints.
How It Works
The Job Lifecycle
A job moves through a well-defined lifecycle. The user writes a job script specifying the resources needed (nodes, CPUs per node, memory, walltime, partition) and the commands to execute. They submit this script to the scheduler. The scheduler validates the request, assigns a priority, and places the job in a queue. When sufficient resources become available and the job reaches the front of the queue, the scheduler allocates nodes and launches the job. The job runs until it completes, exceeds its walltime, or is cancelled. After completion, the scheduler releases the resources and records accounting data.
Priority Calculation
Job priority determines queue ordering. SLURM computes priority as a weighted sum of multiple factors:
The age factor increases over time so that no job waits forever. The fair-share factor rewards users who have consumed less than their allocation and penalizes heavy users. The job size factor can prioritize larger jobs (to improve utilization) or smaller jobs (to improve turnaround). QoS (Quality of Service) allows administrators to create priority tiers.
Fair-Share Scheduling
Fair-share scheduling tracks each user's (or group's) historical resource consumption relative to their allocated share. The fair-share factor is computed as:
If a user has consumed exactly their allocation, F_{\text{fairshare}} = 0.5. If they have used nothing, F_{\text{fairshare}} = 1.0 (maximum priority boost). If they have used double their allocation, F_{\text{fairshare}} = 0.25 (priority penalty). This creates a self-correcting system: heavy users see their priority drop, allowing underserved users to catch up.
Backfill Scheduling
Without backfill, a large job waiting for 128 nodes might block the entire queue while 127 nodes sit idle. Backfill scheduling looks at smaller jobs further back in the queue and asks: "Can this job start now and finish before the large job's reserved start time?" If yes, the small job runs immediately, filling the gap.
This dramatically improves utilization. Studies show backfill can increase cluster utilization by 10-20% compared to strict priority ordering. The trade-off is that users must provide accurate walltime estimates. If a backfilled job exceeds its walltime, it gets killed to make room for the larger job.
Resource Matching
When the scheduler decides to run a job, it must find nodes that satisfy all constraints: enough free CPUs, sufficient memory, required GPUs, correct partition, and network topology preferences. This is a variant of the bin-packing problem, which is NP-hard in general. Schedulers use heuristics like first-fit, best-fit, or topology-aware placement to find good solutions quickly.
Complexity Analysis
The scheduling problem has well-studied computational complexity.
| Operation | Time Complexity | Notes |
|---|---|---|
| Job submission | $O(1)$ | Enqueue and validate |
| Priority calculation | $O(1)$ per job | Weighted sum of factors |
| Queue sort by priority | $O(J \log J)$ | $J$ = number of pending jobs |
| Resource matching (first-fit) | $O(N)$ | $N$ = number of nodes |
| Optimal bin packing | NP-hard | Heuristics used in practice |
| Backfill scan | $O(J \times N)$ | Check each pending job against available resources |
The scheduling cycle runs periodically (every few seconds in SLURM). Each cycle processes the queue:
For a cluster with $N = 10{,}000$ nodes and $J = 50{,}000$ pending jobs, this is tractable because the constant factors are small (simple arithmetic per job-node pair).
Fair-share decay uses an exponential half-life to weight recent usage more heavily:
where $u_i$ is usage at time $t_i$ and $h$ is the half-life (typically 7-14 days). This ensures that a user who consumed heavily last week but has been idle this week sees their priority recover.
Implementation
ALGORITHM CalculateJobPriority(job, fairshareDB, currentTime, weights)
INPUT: job: submitted job, fairshareDB: historical usage database,
currentTime: current timestamp, weights: priority factor weights
OUTPUT: numeric priority value
BEGIN
// Age factor: time waiting in queue (normalized to 0-1)
ageFactor <- MIN(1.0, (currentTime - job.submitTime) / MAX_WAIT)
// Fair-share factor: penalize heavy users
userUsage <- LOOKUP(fairshareDB, job.user)
userAllocation <- LOOKUP(fairshareDB, job.user, "allocation")
fairshareFactor <- POWER(2, -(userUsage / userAllocation))
// Job size factor: optionally favor larger jobs
sizeFactor <- job.requestedNodes / MAX_NODES
// QoS factor: administrative priority tier
qosFactor <- job.qos.priorityBoost
priority <- weights.age * ageFactor
+ weights.fairshare * fairshareFactor
+ weights.size * sizeFactor
+ weights.qos * qosFactor
RETURN priority
END
ALGORITHM ScheduleWithBackfill(queue, nodes, reservations)
INPUT: queue: list of pending jobs sorted by priority,
nodes: list of cluster nodes with current state,
reservations: future resource reservations for top-priority jobs
OUTPUT: list of jobs to start now
BEGIN
toStart <- empty list
// Phase 1: Schedule top-priority jobs in order
FOR EACH job IN queue BY DESCENDING priority DO
available <- FIND_NODES(nodes, job.requestedNodes,
job.cpusPerNode, job.memPerNode)
IF available is not empty THEN
ALLOCATE(available, job)
APPEND job TO toStart
ELSE
// Reserve earliest start time for this job
reserveTime <- ESTIMATE_EARLIEST_START(nodes, reservations, job)
ADD_RESERVATION(reservations, job, reserveTime)
BREAK // Only reserve for the highest-priority blocked job
END IF
END FOR
// Phase 2: Backfill smaller jobs that fit in the gap
FOR EACH job IN queue WHERE job NOT IN toStart DO
available <- FIND_NODES(nodes, job.requestedNodes,
job.cpusPerNode, job.memPerNode)
IF available is not empty THEN
// Check: will this job finish before any reservation starts?
estimatedEnd <- currentTime + job.walltime
IF estimatedEnd <= EARLIEST_RESERVATION_TIME(reservations, available) THEN
ALLOCATE(available, job)
APPEND job TO toStart
END IF
END IF
END FOR
RETURN toStart
END
ALGORITHM UpdateFairShare(fairshareDB, completedJob, halfLife)
INPUT: fairshareDB: usage database, completedJob: just-finished job,
halfLife: decay half-life in seconds
OUTPUT: updated fairshareDB
BEGIN
user <- completedJob.user
cpuHours <- completedJob.nodesUsed * completedJob.cpusPerNode
* completedJob.actualRuntime / 3600
// Apply decay to all existing usage records
currentTime <- CURRENT_TIME()
FOR EACH record IN fairshareDB[user].usageHistory DO
age <- currentTime - record.timestamp
record.weight <- POWER(2, -(age / halfLife))
END FOR
// Add new usage record
APPEND {cpuHours: cpuHours, timestamp: currentTime, weight: 1.0}
TO fairshareDB[user].usageHistory
// Recompute effective usage
fairshareDB[user].effectiveUsage <-
SUM(record.cpuHours * record.weight
FOR record IN fairshareDB[user].usageHistory)
RETURN fairshareDB
END
Real-World Applications
- National labs and supercomputing centers: facilities like NERSC, ORNL, and TACC use SLURM to manage clusters with 10,000+ nodes shared by thousands of researchers across dozens of scientific disciplines
- University research clusters: academic HPC centers use fair-share scheduling to divide resources among departments, ensuring the physics group cannot monopolize nodes that the biology group also needs
- Cloud HPC (AWS ParallelCluster, Azure CycleCloud): cloud-based HPC platforms integrate SLURM with auto-scaling, dynamically provisioning and terminating nodes based on queue depth
- AI training clusters: organizations training large models use SLURM to schedule multi-node GPU jobs, with partitions dedicated to different GPU types (A100, H100) and QoS tiers for urgent experiments
- Weather forecasting: national weather services run time-critical forecast jobs with high QoS priority, preempting lower-priority research jobs to meet broadcast deadlines
- Pharmaceutical research: drug discovery pipelines submit thousands of short molecular docking jobs that benefit heavily from backfill scheduling, filling gaps between larger simulation jobs
Key Takeaways
- Job schedulers like SLURM and PBS are essential for sharing HPC clusters among many users, balancing throughput, fairness, and utilization
- Job priority is a weighted combination of queue wait time, fair-share standing, job size, and administrative QoS tier
- Fair-share scheduling uses exponential decay of historical usage ($F = 2^{-\text{usage}/\text{allocation}}$) to self-correct: heavy users lose priority, idle users gain it
- Backfill scheduling improves utilization by 10-20% by letting small jobs run in gaps before large reserved jobs, but requires accurate walltime estimates
- Resource matching is a bin-packing variant (NP-hard in general), solved with heuristics like first-fit and topology-aware placement
- The scheduling cycle runs in $O(J \log J + J \times N)$ time, which is tractable even for clusters with tens of thousands of nodes and jobs