SLURM job scheduler

Slurm workload manager, formerly known as Simple Linux Utility For Resource Management (SLURM), is an open source, fault-tolerant, and highly scalable resource manager and job scheduling system of high availability currently developed by SchedMD. Initially developed for large Linux Clusters at the Lawrence Livermore National Laboratory, Slurm is used extensively on most Top 500 supercomputers around the globe.

Caveat concerning batch job submission

Please note that Slurm batch jobs will NOT run from your home directory on the Wilson cluster!

Your Wilson home directory, $HOME, is your lab-wide “nashome” directory. The /nashome filesystem is mounted with Kerberos authentication, and unfortunately, Kerberos is not compatible with Slurm. Start batch jobs from your area under either /work1 or /wclustre.

Common Slurm commands

commandbrief description
squeuereports the state of queued jobs
sbatchsubmit a job script for later execution
scancelcancel a pending or running job
scontrolview or modify Slurm configuration and state
sallocallocate resources and spawn a shell which is then used to execute srun commands
srunsubmit a job for execution or initiate job steps in real time
sacctmgrview and modify Slurm account information
sinfoworker node and partition information
Table 1 – Common Slurm Commands

See the downloadable PDF cheatsheet for a summary of the commands. SchedMD has a quick start guide for Slurm users.

squeue examples: checking the status of batch jobs

The command below checks the status of your jobs in the queue.

$ squeue -u $USER

The following command will show jobs assigned to the wc_cpu (CPU) partition.

$ squeue -p wc_cpu

The squeue command with the options below will tell you the status of GPU batch jobs including which workers are in use, how many GPUs a job is using, the number of cores, maximum memory, how long the job has been running, and the time limit for the job.

$ squeue -p wc_gpu --Format=Account:.10,UserName:.10,NodeList:.10,tres-alloc:.64,State:.8,TimeUsed:.10,TimeLimit:.12

Details about your Slurm account(s)

Every Slurm user has a default account. You can find your default with the command:

$ sacctmgr list user name=$USER
      User   Def Acct     Admin
---------- ---------- ---------
     smith    wc_test      None

Users on Wilson may have multiple Slurm accounts with different quality of service (QOS) levels. You can find all your associated accounts along with available QOS levels using the following command:

$ sacctmgr list user WithAssoc Format=User,Account,QOS,DefaultQOS name=$USER
      User    Account                  QOS   Def QOS
---------- ---------- -------------------- ---------
     smith    hpcsoft             opp,test       opp
     smith   scd_devs     opp,regular,test       opp
     smith  spack4hpc     opp,regular,test       opp
     smith    wc_test     opp,regular,test       opp

NOTE: If you do not specify an account name during your job submission (using --account), your default account will be used.

Quality of service (QoS) levels

Jobs submitted to Slurm are associated with an appropriate QoS (or Quality of Service) configuration. Admins assign parameters to a QoS that are used to manage dispatch priority and resource use limits.

NameDescriptionPriority (higher is better)Global Resource ConstraintsMax WalltimePer Account ConstraintsPer User Constraints
adminadmin testing100NoneNoneNoneNone
testfor quick user testing75Max nodes = 5
Max GPUs = 4
04:00:00NoneMax running jobs = 1
Max queued jobs = 3
regularregular QoS for approved accounts25None1-00:00:00NoneNone
walltime7donly available to certain approved accounts25Max nodes = 507-00:00:00NoneNone
oppavailable to all accounts for opportunistic usage0None08:00:00NoneMax queued jobs = 50
Table 2 – Wilson cluster QoS overview

A few notes about the resource constraints:

  • Global resource constraints are enforced across all the accounts in the cluster. For example, test QoS restricts access to 4 GPUs globally. If account A is using 4 GPUs, account B has to wait until the resources are free.
  • Per account constraints are enforced on an account basis. We currently do not have per account constraints on Wilson cluster.
  • Per user constraints are enforced on a user basis. For example, test QoS restricts the number of running jobs per user to 1. This means a single user, regardless of their account, cannot run more than a single job using the QoS.
  • Finally, these constraints may be relaxed or adjusted from time to time based on the job mix and to maximize cluster utilization.

A few notes about available QoS:

  • Users can select QoS appropriately by using --qos directive with their submit commands. The default QoS for all accounts is opp. Jobs running under this QoS have the lowest priority and will only start when there aren’t any eligible regular QoS jobs waiting in the queue.
  • The test QoS is for users to run test jobs for fast turnaround debugging. These test jobs run at a relatively high priority so that they will start as soon as nodes are available. Any user can have no more than three jobs submitted and no more than one job running at any given time.
  • All Wilson users have opportunistic access to batch resources. Compute projects with specific scientific or engineering goals are able to request access to higher QoS level and more compute resources.

NOTE: if you do not specify a QOS during job submission (using --qos), the default opp will be used.

Slurm partitions

A partition in Slurm is a way to categorize worker nodes by their unique features. On the Wilson cluster we distinguish workers meant for CPU computing from GPU-acclerated workers. There is a separate partition for the one IBM Power9 “Summit-like” worker since the Power9 architecture is not binary compatible with the common AMD/Intel x86_64 architecture. There is also a test partition to set aside CPU workers for rapid testing. Slurm allows setting job limits by partition.

NameDescriptionTotal NodesDefault RuntimeExclusive Access
wc_cpuCPU workers

2.6 GHz Intel E5-2650v2 “Ivy Bridge”, 16 cores/node, 8GB/core memory, ~280GB local scratch disk, inter-node QDR (40Gbps) Infiniband
9008:00:00Y
wc_cpu_testCPU workers

2.6 GHz Intel E5-2650v2 “Ivy Bridge”, 16 cores/node, 8GB/core memory, ~280GB local scratch disk, inter-node QDR (40Gbps) Infiniband
701:00:00Y
wc_gpuGPU workers

Several types of GPUs such as V100, P100, A100. More details in Table 4 below
708:00:00N
wc_gpu_ppcIBM Power9 with GPUs (V100)108:00:00N
Table 3 – Wilson cluster Slurm partition overview

The desired partition is selected by specifying the --partition flag in a batch submission. On Wilson, the CPU workers are scheduled for exclusive use by a single job. The default is that the GPU workers permit shared use by mutiple jobs / users. The option --exclusive can be used to request that a job be given exclusive use of the worker nodes.

Slurm job dispatch and priority

Slurm on Wilson cluster primarily uses QoS to manage partition access and job priorities. All users submit their jobs to be run by Slurm on a particular resource within one of several partitions. We do not use any form of preemption on our cluster. The resource constraints we have in place are to ensure that multiple projects or accounts can be active on the cluster at any given time and to ensure fair share of available resources.

To see the list of jobs currently in the queue by partition, visit our cluster status page. Click on the “Start Time” column header to sort the table by start time. Don’t be alarmed if you see dates from 1969 for idle jobs. This just means Slurm hasn’t gotten to those jobs yet and monitoring is showing the default.

For running jobs, “Start Time” is the actual time that the jobs started. Following that are the pending jobs in the predicted order they may start. You can also click on the “Submit Time” column header to see which jobs have been waiting the longest. There are filters in the top right corner to select partitions and users.

From a command line, Slurm’s ‘squeue‘ command lists the jobs that are queued. It includes running jobs as well as those waiting to be started, aka dispatched. By changing the format of the commands output, one can get a lot of information about several things, such as:

  • Start time – actual or predicted
  • QoS the job is running under
  • Reason that the job is pending
  • Calculated dispatch real-time priority of the job

The following command should return a sorted (by priority) list of your jobs:

$ squeue --sort=P,-p --user=$USER

The following command uses the --Format option of squeue to provide even more details. You can tweak the decimal point values to adjust column width. The Slurm squeue manual page here has details on the options for --Format.

$ squeue --sort=P,-p --Format=Account:.10,UserName:.10,JobID:.8,Name:.12,PriorityLong:.10,State:.5,QOS:.8,SubmitTime:.20,StartTime:.20,TimeLimit:.11,Reason:.15 --user=$USER

Specifying the number and type of GPUs in a job

All GPU worker nodes are in the wc_gpu partition except the IBM Power9 worker which is in the wc_gpu_ppc partition. The WC has GPU worker nodes with different generations of NVIDIA GPUs. The “Slurm spec” in the table below is how how you tell Slurm the GPU type you want. If you do not specify a type, you job will run with first available GPU of any type.

nodesGPU typeCUDA cores / GPUdevice memory [GB]Slurm specGPUs / nodeCPU cores / GPUMemory/ GPU (GB)
2A100691280a100232256
1A100691280a100416126
4V100512032v10022092
1P100358416p1008292
1P100358416p100nvlink214500
1V100512032v100nvlinkppc6448 (32 threads)250
Table 4 – Wilson cluster GPU overview

A GPU job can request more than one GPU to enable parallel GPU use by your code. Slurm, however, will not permit mixing different Slurm specifications within a job, e.g., a job cannot request eight V100 plus eight P100 devices. GPUs on a worker are partitioned by jobs up to the limit of the maximum number of GPUs in a node. Each job is assigned exclusive of its GPUs, an individual GPU is never shared by different jobs.

Slurm manages GPUs as generic resource via the --gres flag to the commands sbatchsalloc, or srun. The table below shows examples of how to choose GPUs in a batch submission.

Slurm optionsDescription
--gres=gpu:1one GPU per node of any type
--gres=gpu:p100:1one P100 GPU per node
--gres=gpu:v100:2two V100 GPUs per node — the max on the Intel V100 workers
--gres=gpu:v100:2 --nodes=2total of four V100 GPUs, requires two worker nodes
Table 5 – Wilson cluster GPU access flags overview

For the latter two examples, the submission request should include the --exclusive option since the request asks for all GPUs on the worker nodes.

The batch system is configured to portion out CPU and RAM to each batch job. GPU jobs are assigned default values --cpus-per-gpu=2 and --mem-per-gpu=30G. Jobs may request more than the default values. Users should override the defaults using the suggested Cores/GPU and Mem/GPU values in the above table.

Interactive jobs

Interactive job on CPU-only workers

The command below starts an interactive batch job.

  • The job uses a single worker node of the wc_cpu (CPU-only) partition.
  • The job is run under the Slurm account myAccount at regular QoS.
  • The job requests a time limit of 50 minutes.
  • The job requests one task and 16 CPUs per task. Specifying CPUs per task is important for threaded applications. If the single task runs 16 threads, they will be able to use all 16 cores on this worker. Other combinations of nodes, tasks and cores per task are acceptable.
  • Since CPU-only workers are not shared among jobs, available compute resources are usually maximized by choosing ntasks * cpus-per-task = nodes * SLURM_CPUS_ON_NODE.
$ cd /work1_or_wclustre/your_projects_directory/
$ srun --unbuffered --pty --partition=wc_cpu --time=00:50:00 \
       --account=myAccount  --qos=regular \
       --nodes=1 --ntasks=1 --cpus-per-task=16 /bin/bash
$ hostname
wcwn001.fnal.gov
$ env | grep -e TASKS -e CPUS -e CPU_BIND_
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPUS_PER_TASK=16
SLURM_TASKS_PER_NODE=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NTASKS=1
SLURM_JOB_CPUS_PER_NODE=16
SLURM_CPUS_ON_NODE=16
SLURM_CPU_BIND_LIST=0xFFFF
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_STEP_NUM_TASKS=1
$ exit  # end the batch job
$ hostname
wc.fnal.gov
Interactive job on a GPU node

The command which follows request a one-hour interactive job on a GPU worker. In this example, we specify a single GPU device of type V100 using the syntax from Table 5. The request also specifies the number of CPU cores and system RAM per GPU from Table 4 for the V100 worker nodes. Specifying no more than the maximum values of cores and memory per GPU listed in Table 4 will fairly share these resources among jobs that may share the worker node.

$ srun --unbuffered --pty  --partition=wc_gpu --time=01:00:00 \
       --account=myAccount --qos=regular \
       --nodes=1 --ntasks=1 \
       --gres=gpu:v100:1 --cpus-per-gpu=20 --mem-per-gpu=92G \
       /bin/bash
$ hostname
wcgpu04.fnal.gov
$ nvidia-smi --list-gpus
GPU 0: Tesla V100-PCIE-32GB
$ env | grep -e TASKS -e CPUS -e CPU_BIND_ -e GPU_
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_TASKS_PER_NODE=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NTASKS_PER_NODE=1
SLURM_CPUS_PER_GPU=20
SLURM_NTASKS=1
SLURM_JOB_CPUS_PER_NODE=20
SLURM_CPUS_ON_NODE=20
SLURM_CPU_BIND_LIST=0x003FF003FF
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_STEP_NUM_TASKS=1
GPU_DEVICE_ORDINAL=0

Batch jobs and scripts

Slurm can run a sequence of shell commands specified within a batch script. Typically many such scripts specify a set of parameters for slurm at the top of the script file. These parameters are prefaced by #SBATCH at the beginning of the lines.

Running a simple CPU batch script

In this example we request a single CPU-only worker node and request four tasks and four CPU cores per task. The product equals the sixteen cores on these workers. This combination of tasks and cores per task is typical when running a total four MPI ranks where each MPI rank will run four threads.

$ cat batch_cpu.sh
#! /bin/bash
#SBATCH --account=myAccount
#SBATCH --qos=test
#SBATCH --time=00:15:00
#SBATCH --partition=wc_cpu
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --job-name=cpu_test
#SBATCH --mail-type=NONE
#SBATCH --output=job_%x.o%A
#SBATCH --no-requeue

hostname

env | grep -e TASKS -e CPUS -e CPU_BIND_

We submit the batch job with the command below

$ sbatch batch_cpu.sh

This example produced a batch output fine named job_cpu_test.o527837 in the directory where the job was submitted. The ouput is shown below.

$ cat job_cpu_test.o527837
wcwn026.fnal.gov
SLURM_CPUS_PER_TASK=4
SLURM_TASKS_PER_NODE=4
SLURM_NTASKS=4
SLURM_JOB_CPUS_PER_NODE=16
SLURM_CPUS_ON_NODE=16
Running a simple GPU batch script

In this example, we will run a batch script called batch_gpu.sh. The cat command that follows displays the content of this file. Note that this job requests a single V100 GPU a “fair share” of CPU cores and host memory per GPU on the shared GPU worker.

$ cat batch_gpu.sh
#! /bin/bash
#SBATCH --account=myAccount
#SBATCH --qos=test
#SBATCH --time=00:15:00
#SBATCH --partition=wc_gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:v100:1
#SBATCH --cpus-per-gpu=20
#SBATCH --mem-per-gpu=92G
#SBATCH --job-name=gpu_test
#SBATCH --mail-type=NONE
#SBATCH --output=job_%x.o%A
#SBATCH --no-requeue

hostname

nvidia-smi --list-gpus

env | grep -e TASKS -e CPUS -e CPU_BIND_ -e GPU_

We submit this script to slurm

$ sbatch batch_gpu.sh

The job will be placed in the batch queue and upon job completion the directory where the job was submitted contains a file named job_gpu_test.o527838. Below we display the result form this job.

$ cat job_gpu_test.o527838
wcgpu04.fnal.gov
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-dca4cd4b-03e3-1021-36d2-916a1e59be96)
SLURM_TASKS_PER_NODE=1
SLURM_CPUS_PER_GPU=20
SLURM_NTASKS=1
SLURM_JOB_CPUS_PER_NODE=20
SLURM_CPUS_ON_NODE=20
GPU_DEVICE_ORDINAL=0
Running MPI application under slurm

Please refer to the XXXXX for instructions on running MPI in batch jobs.

There’s a good description of MPI process affinity binding and srun here: task binding and distribution.

Developer access to GPU workers

Several of A100, V100, and P100 equipped GPU workers are reserved for use by GPU developers during weekday business hours (09:00-17:00 M-F). This reservation is intended to provide developers rapid access to GPUs for testing without having to wait for long batch jobs to finish. Both interactive and batch job access is permitted on the reserved nodes. Jobs must include the parameter --reservation=gpudevtest to access the reserved nodes. The reservation is only available to members of a special GPU developer project and the batch request must include --project=scd_devs. Users wishing to be added to the developer group must complete the user request form asking to be added to project scd_devs. Please provide a detailed justification for you request in the reason field of the request form. Approval is subject to review of your justification.

Slurm Environment variables

The table below list some of the commonly used environment variables. A full list is found in the Slurm documentation for sbatch.

Variable NameDescriptionExample ValuePBS/Torque analog
$SLURM_JOB_IDJob ID5741192$PBS_JOBID
$SLURM_JOB_NAMEJob Namemyjob$PBS_JOBNAME
$SLURM_SUBMIT_DIRSubmit Directory/work1/user$PBS_O_WORKDIR
$SLURM_JOB_NODELISTNodes assigned to jobwcwn[001-005]cat $PBS_NODEFILE
$SLURM_SUBMIT_HOSTHost submitted fromwc.fnal.gov$PBS_O_HOST
$SLURM_JOB_NUM_NODESNumber of nodes allocated to job2$PBS_NUM_NODES
$SLURM_CPUS_ON_NODENumber of cores/node8,3$PBS_NUM_PPN
$SLURM_NTASKSTotal number of cores for job11$PBS_NP
$SLURM_NODEIDIndex to node running on relative to nodes assigned to job0$PBS_O_NODENUM
$PBS_O_VNODENUMIndex to core running on within node4$SLURM_LOCALID
$SLURM_PROCIDIndex to task relative to job0$PBS_O_TASKNUM – 1
$SLURM_ARRAY_TASK_IDJob Array Index0$PBS_ARRAYID
Table 6 – Slurm environment variables

Slurm fixed reservation requests

Projects that require access to Wilson cluster resources within a fixed period of time or before a fixed deadline are able to request ahead a reservation for a designated amount of compute resources. Examples where projects may benefit from having a reservation in place include:

  • Making Wilson compute resources available to a workshop or a hackathon.
  • Having high-priority access to Wilson computing to meet a high-priority deadline such as preparation of a scientific paper or conference presentation.

Reservations should not be considered unless there is a well-defined need that cannot be met by the standard batch system queuing policies. Reservation requests are carefully reviewed on their scientific and engineering merit and approval is not automatic. A reservation request must be made though a service desk request using this link. Requests should be made at least two business days in advance of the reservation start date. Your request must contain at least the following

  • Name of your Wilson cluster computing project
  • Supervisor name(s)
  • Name of the event requiring a Slurm reservation
  • Type of event, e.g., workshop, presentation, paper publication
  • Date and time ranges of the Slurm reservation
  • Type (cpu or gpu) and number of workers to be reserved
  • Justification for a special batch reservation. In particular, why do the normal batch policies not meet your needs.
Additional useful information
A quick two page summary of SLURM Commands
Quick Start SLURM User Guide
Comparison between SLURM and other popular batch schedulers
Official SLURM documentation