SLURM job scheduler | Wilson Cluster-Institutional Cluster

Slurm workload manager, formerly known as Simple Linux Utility For Resource Management (SLURM), is an open source, fault-tolerant, and highly scalable resource manager and job scheduling system of high availability currently developed by SchedMD. Initially developed for large Linux Clusters at the Lawrence Livermore National Laboratory, Slurm is used extensively on most Top 500 supercomputers around the globe.

Caveat concerning batch job submission

Please note that Slurm batch jobs will NOT run from your home directory on the Wilson cluster!

Your Wilson home directory, $HOME, is your lab-wide “nashome” directory. The /nashome filesystem is mounted with Kerberos authentication, and unfortunately, Kerberos is not compatible with Slurm. Start batch jobs from your area under either /work1 or /wclustre.

Common Slurm commands

command	brief description
squeue	reports the state of queued jobs
sbatch	submit a job script for later execution
scancel	cancel a pending or running job
scontrol	view or modify Slurm configuration and state
salloc	allocate resources and spawn a shell which is then used to execute srun commands
srun	submit a job for execution or initiate job steps in real time
sacctmgr	view and modify Slurm account information
sinfo	worker node and partition information

Table 1 – Common Slurm Commands

See the downloadable PDF cheatsheet for a summary of the commands. SchedMD has a quick start guide for Slurm users.

squeue examples: checking the status of batch jobs

The command below checks the status of your jobs in the queue.

$ squeue -u $USER

The following command will show jobs assigned to the wc_cpu (CPU) partition.

$ squeue -p wc_cpu

The squeue command with the options below will tell you the status of GPU batch jobs including which workers are in use, how many GPUs a job is using, the number of cores, maximum memory, how long the job has been running, and the time limit for the job.

$ squeue -p wc_gpu --Format=Account:.10,UserName:.10,NodeList:.10,tres-alloc:.64,State:.8,TimeUsed:.10,TimeLimit:.12

Details about your Slurm account(s)

Every Slurm user has a default account. You can find your default with the command:

$ sacctmgr list user name=$USER
      User   Def Acct     Admin
---------- ---------- ---------
     smith    wc_test      None

Users on Wilson may have multiple Slurm accounts with different quality of service (QOS) levels. You can find all your associated accounts along with available QOS levels using the following command:

$ sacctmgr list user WithAssoc Format=User,Account,QOS,DefaultQOS name=$USER
      User    Account                  QOS   Def QOS
---------- ---------- -------------------- ---------
     smith    hpcsoft             opp,test       opp
     smith   scd_devs     opp,regular,test       opp
     smith  spack4hpc     opp,regular,test       opp
     smith    wc_test     opp,regular,test       opp

NOTE: If you do not specify an account name during your job submission (using --account), your default account will be used.

Quality of service (QoS) levels

Jobs submitted to Slurm are associated with an appropriate QoS (or Quality of Service) configuration. Admins assign parameters to a QoS that are used to manage dispatch priority and resource use limits.

Name	Description	Priority (higher is better)	Global Resource Constraints	Max Walltime	Per Account Constraints	Per User Constraints
admin	admin testing	100	None	None	None	None
test	for quick user testing	75	Max nodes = 5 Max GPUs = 4	04:00:00	None	Max running jobs = 1 Max queued jobs = 3
regular	regular QoS for approved accounts	25	None	1-00:00:00	None	None
walltime7d	only available to certain approved accounts	25	Max nodes = 50	7-00:00:00	None	None
opp	available to all accounts for opportunistic usage	0	None	08:00:00	None	Max queued jobs = 50

Table 2 – Wilson cluster QoS overview

A few notes about the resource constraints:

Global resource constraints are enforced across all the accounts in the cluster. For example, test QoS restricts access to 4 GPUs globally. If account A is using 4 GPUs, account B has to wait until the resources are free.
Per account constraints are enforced on an account basis. We currently do not have per account constraints on Wilson cluster.
Per user constraints are enforced on a user basis. For example, test QoS restricts the number of running jobs per user to 1. This means a single user, regardless of their account, cannot run more than a single job using the QoS.
Finally, these constraints may be relaxed or adjusted from time to time based on the job mix and to maximize cluster utilization.

A few notes about available QoS:

Users can select QoS appropriately by using --qos directive with their submit commands. The default QoS for all accounts is opp. Jobs running under this QoS have the lowest priority and will only start when there aren’t any eligible regular QoS jobs waiting in the queue.
The test QoS is for users to run test jobs for fast turnaround debugging. These test jobs run at a relatively high priority so that they will start as soon as nodes are available. Any user can have no more than three jobs submitted and no more than one job running at any given time.
All Wilson users have opportunistic access to batch resources. Compute projects with specific scientific or engineering goals are able to request access to higher QoS level and more compute resources.

NOTE: if you do not specify a QOS during job submission (using --qos), the default opp will be used.

Slurm partitions

A partition in Slurm is a way to categorize worker nodes by their unique features. On the Wilson cluster we distinguish workers meant for CPU computing from GPU-acclerated workers. There is a separate partition for the one IBM Power9 “Summit-like” worker since the Power9 architecture is not binary compatible with the common AMD/Intel x86_64 architecture. There is also a test partition to set aside CPU workers for rapid testing. Slurm allows setting job limits by partition.

Name	Description	Total Nodes	Default Runtime	Exclusive Access
wc_cpu	CPU workers 2.6 GHz Intel E5-2650v2 “Ivy Bridge”, 16 cores/node, 8GB/core memory, ~280GB local scratch disk, inter-node QDR (40Gbps) Infiniband	90	08:00:00	Y
wc_cpu_test	CPU workers 2.6 GHz Intel E5-2650v2 “Ivy Bridge”, 16 cores/node, 8GB/core memory, ~280GB local scratch disk, inter-node QDR (40Gbps) Infiniband	7	01:00:00	Y
wc_gpu	GPU workers Several types of GPUs such as V100, P100, A100. More details in Table 4 below	7	08:00:00	N
wc_gpu_ppc	IBM Power9 with GPUs (V100)	1	08:00:00	N

Table 3 – Wilson cluster Slurm partition overview

The desired partition is selected by specifying the --partition flag in a batch submission. On Wilson, the CPU workers are scheduled for exclusive use by a single job. The default is that the GPU workers permit shared use by mutiple jobs / users. The option --exclusive can be used to request that a job be given exclusive use of the worker nodes.

Slurm job dispatch and priority

Slurm on Wilson cluster primarily uses QoS to manage partition access and job priorities. All users submit their jobs to be run by Slurm on a particular resource within one of several partitions. We do not use any form of preemption on our cluster. The resource constraints we have in place are to ensure that multiple projects or accounts can be active on the cluster at any given time and to ensure fair share of available resources.

To see the list of jobs currently in the queue by partition, visit our cluster status page. Click on the “Start Time” column header to sort the table by start time. Don’t be alarmed if you see dates from 1969 for idle jobs. This just means Slurm hasn’t gotten to those jobs yet and monitoring is showing the default.

For running jobs, “Start Time” is the actual time that the jobs started. Following that are the pending jobs in the predicted order they may start. You can also click on the “Submit Time” column header to see which jobs have been waiting the longest. There are filters in the top right corner to select partitions and users.

From a command line, Slurm’s ‘squeue‘ command lists the jobs that are queued. It includes running jobs as well as those waiting to be started, aka dispatched. By changing the format of the commands output, one can get a lot of information about several things, such as:

Start time – actual or predicted
QoS the job is running under
Reason that the job is pending
Calculated dispatch real-time priority of the job

The following command should return a sorted (by priority) list of your jobs:

$ squeue --sort=P,-p --user=$USER

The following command uses the --Format option of squeue to provide even more details. You can tweak the decimal point values to adjust column width. The Slurm squeue manual page here has details on the options for --Format.

$ squeue --sort=P,-p --Format=Account:.10,UserName:.10,JobID:.8,Name:.12,PriorityLong:.10,State:.5,QOS:.8,SubmitTime:.20,StartTime:.20,TimeLimit:.11,Reason:.15 --user=$USER

Specifying the number and type of GPUs in a job

All GPU worker nodes are in the wc_gpu partition except the IBM Power9 worker which is in the wc_gpu_ppc partition. The WC has GPU worker nodes with different generations of NVIDIA GPUs. The “Slurm spec” in the table below is how how you tell Slurm the GPU type you want. If you do not specify a type, you job will run with first available GPU of any type.

nodes	GPU type	CUDA cores / GPU	device memory [GB]	Slurm spec	GPUs / node	CPU cores / GPU	Memory/ GPU (GB)
2	A100	6912	80	a100	2	32	256
1	A100	6912	80	a100	4	16	126
4	V100	5120	32	v100	2	20	92
1	P100	3584	16	p100	8	2	92
1	P100	3584	16	p100nvlink	2	14	500
1	V100	5120	32	v100nvlinkppc64	4	8 (32 threads)	250

Table 4 – Wilson cluster GPU overview

A GPU job can request more than one GPU to enable parallel GPU use by your code. Slurm, however, will not permit mixing different Slurm specifications within a job, e.g., a job cannot request eight V100 plus eight P100 devices. GPUs on a worker are partitioned by jobs up to the limit of the maximum number of GPUs in a node. Each job is assigned exclusive of its GPUs, an individual GPU is never shared by different jobs.

Slurm manages GPUs as generic resource via the --gres flag to the commands sbatch, salloc, or srun. The table below shows examples of how to choose GPUs in a batch submission.

Slurm options	Description
`--gres=gpu:1`	one GPU per node of any type
`--gres=gpu:p100:1`	one P100 GPU per node
`--gres=gpu:v100:2`	two V100 GPUs per node — the max on the Intel V100 workers
`--gres=gpu:v100:2 --nodes=2`	total of four V100 GPUs, requires two worker nodes

Table 5 – Wilson cluster GPU access flags overview

For the latter two examples, the submission request should include the --exclusive option since the request asks for all GPUs on the worker nodes.

The batch system is configured to portion out CPU and RAM to each batch job. GPU jobs are assigned default values --cpus-per-gpu=2 and --mem-per-gpu=30G. Jobs may request more than the default values. Users should override the defaults using the suggested Cores/GPU and Mem/GPU values in the above table.

Interactive jobs

Interactive job on CPU-only workers

The command below starts an interactive batch job.

The job uses a single worker node of the wc_cpu (CPU-only) partition.
The job is run under the Slurm account myAccount at regular QoS.
The job requests a time limit of 50 minutes.
The job requests one task and 16 CPUs per task. Specifying CPUs per task is important for threaded applications. If the single task runs 16 threads, they will be able to use all 16 cores on this worker. Other combinations of nodes, tasks and cores per task are acceptable.
Since CPU-only workers are not shared among jobs, available compute resources are usually maximized by choosing ntasks * cpus-per-task = nodes * SLURM_CPUS_ON_NODE.

$ cd /work1_or_wclustre/your_projects_directory/
$ srun --unbuffered --pty --partition=wc_cpu --time=00:50:00 \
       --account=myAccount  --qos=regular \
       --nodes=1 --ntasks=1 --cpus-per-task=16 /bin/bash
$ hostname
wcwn001.fnal.gov
$ env | grep -e TASKS -e CPUS -e CPU_BIND_
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPUS_PER_TASK=16
SLURM_TASKS_PER_NODE=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NTASKS=1
SLURM_JOB_CPUS_PER_NODE=16
SLURM_CPUS_ON_NODE=16
SLURM_CPU_BIND_LIST=0xFFFF
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_STEP_NUM_TASKS=1
$ exit  # end the batch job
$ hostname
wc.fnal.gov

Interactive job on a GPU node

The command which follows request a one-hour interactive job on a GPU worker. In this example, we specify a single GPU device of type V100 using the syntax from Table 5. The request also specifies the number of CPU cores and system RAM per GPU from Table 4 for the V100 worker nodes. Specifying no more than the maximum values of cores and memory per GPU listed in Table 4 will fairly share these resources among jobs that may share the worker node.

$ srun --unbuffered --pty  --partition=wc_gpu --time=01:00:00 \
       --account=myAccount --qos=regular \
       --nodes=1 --ntasks=1 \
       --gres=gpu:v100:1 --cpus-per-gpu=20 --mem-per-gpu=92G \
       /bin/bash
$ hostname
wcgpu04.fnal.gov
$ nvidia-smi --list-gpus
GPU 0: Tesla V100-PCIE-32GB
$ env | grep -e TASKS -e CPUS -e CPU_BIND_ -e GPU_
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_TASKS_PER_NODE=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NTASKS_PER_NODE=1
SLURM_CPUS_PER_GPU=20
SLURM_NTASKS=1
SLURM_JOB_CPUS_PER_NODE=20
SLURM_CPUS_ON_NODE=20
SLURM_CPU_BIND_LIST=0x003FF003FF
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_STEP_NUM_TASKS=1
GPU_DEVICE_ORDINAL=0

Batch jobs and scripts

Slurm can run a sequence of shell commands specified within a batch script. Typically many such scripts specify a set of parameters for slurm at the top of the script file. These parameters are prefaced by #SBATCH at the beginning of the lines.

Running a simple CPU batch script

In this example we request a single CPU-only worker node and request four tasks and four CPU cores per task. The product equals the sixteen cores on these workers. This combination of tasks and cores per task is typical when running a total four MPI ranks where each MPI rank will run four threads.

$ cat batch_cpu.sh
#! /bin/bash
#SBATCH --account=myAccount
#SBATCH --qos=test
#SBATCH --time=00:15:00
#SBATCH --partition=wc_cpu
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --job-name=cpu_test
#SBATCH --mail-type=NONE
#SBATCH --output=job_%x.o%A
#SBATCH --no-requeue

hostname

env | grep -e TASKS -e CPUS -e CPU_BIND_

We submit the batch job with the command below

$ sbatch batch_cpu.sh

This example produced a batch output fine named job_cpu_test.o527837 in the directory where the job was submitted. The ouput is shown below.

$ cat job_cpu_test.o527837
wcwn026.fnal.gov
SLURM_CPUS_PER_TASK=4
SLURM_TASKS_PER_NODE=4
SLURM_NTASKS=4
SLURM_JOB_CPUS_PER_NODE=16
SLURM_CPUS_ON_NODE=16

Running a simple GPU batch script

In this example, we will run a batch script called batch_gpu.sh. The cat command that follows displays the content of this file. Note that this job requests a single V100 GPU a “fair share” of CPU cores and host memory per GPU on the shared GPU worker.

$ cat batch_gpu.sh
#! /bin/bash
#SBATCH --account=myAccount
#SBATCH --qos=test
#SBATCH --time=00:15:00
#SBATCH --partition=wc_gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:v100:1
#SBATCH --cpus-per-gpu=20
#SBATCH --mem-per-gpu=92G
#SBATCH --job-name=gpu_test
#SBATCH --mail-type=NONE
#SBATCH --output=job_%x.o%A
#SBATCH --no-requeue

hostname

nvidia-smi --list-gpus

env | grep -e TASKS -e CPUS -e CPU_BIND_ -e GPU_

We submit this script to slurm

$ sbatch batch_gpu.sh

The job will be placed in the batch queue and upon job completion the directory where the job was submitted contains a file named job_gpu_test.o527838. Below we display the result form this job.

$ cat job_gpu_test.o527838
wcgpu04.fnal.gov
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-dca4cd4b-03e3-1021-36d2-916a1e59be96)
SLURM_TASKS_PER_NODE=1
SLURM_CPUS_PER_GPU=20
SLURM_NTASKS=1
SLURM_JOB_CPUS_PER_NODE=20
SLURM_CPUS_ON_NODE=20
GPU_DEVICE_ORDINAL=0

Running MPI application under slurm

Please refer to the XXXXX for instructions on running MPI in batch jobs.

There’s a good description of MPI process affinity binding and srun here: task binding and distribution.

Developer access to GPU workers

Several of A100, V100, and P100 equipped GPU workers are reserved for use by GPU developers during weekday business hours (09:00-17:00 M-F). This reservation is intended to provide developers rapid access to GPUs for testing without having to wait for long batch jobs to finish. Both interactive and batch job access is permitted on the reserved nodes. Jobs must include the parameter --reservation=gpudevtest to access the reserved nodes. The reservation is only available to members of a special GPU developer project and the batch request must include --project=scd_devs. Users wishing to be added to the developer group must complete the user request form asking to be added to project scd_devs. Please provide a detailed justification for you request in the reason field of the request form. Approval is subject to review of your justification.

Slurm Environment variables

The table below list some of the commonly used environment variables. A full list is found in the Slurm documentation for sbatch.

Variable Name	Description	Example Value	PBS/Torque analog
$SLURM_JOB_ID	Job ID	5741192	$PBS_JOBID
$SLURM_JOB_NAME	Job Name	myjob	$PBS_JOBNAME
$SLURM_SUBMIT_DIR	Submit Directory	/work1/user	$PBS_O_WORKDIR
$SLURM_JOB_NODELIST	Nodes assigned to job	wcwn[001-005]	cat $PBS_NODEFILE
$SLURM_SUBMIT_HOST	Host submitted from	wc.fnal.gov	$PBS_O_HOST
$SLURM_JOB_NUM_NODES	Number of nodes allocated to job	2	$PBS_NUM_NODES
$SLURM_CPUS_ON_NODE	Number of cores/node	8,3	$PBS_NUM_PPN
$SLURM_NTASKS	Total number of cores for job	11	$PBS_NP
$SLURM_NODEID	Index to node running on relative to nodes assigned to job	0	$PBS_O_NODENUM
$PBS_O_VNODENUM	Index to core running on within node	4	$SLURM_LOCALID
$SLURM_PROCID	Index to task relative to job	0	$PBS_O_TASKNUM – 1
$SLURM_ARRAY_TASK_ID	Job Array Index	0	$PBS_ARRAYID

Table 6 – Slurm environment variables

Slurm fixed reservation requests

Projects that require access to Wilson cluster resources within a fixed period of time or before a fixed deadline are able to request ahead a reservation for a designated amount of compute resources. Examples where projects may benefit from having a reservation in place include:

Making Wilson compute resources available to a workshop or a hackathon.
Having high-priority access to Wilson computing to meet a high-priority deadline such as preparation of a scientific paper or conference presentation.

Reservations should not be considered unless there is a well-defined need that cannot be met by the standard batch system queuing policies. Reservation requests are carefully reviewed on their scientific and engineering merit and approval is not automatic. A reservation request must be made though a service desk request using this link. Requests should be made at least two business days in advance of the reservation start date. Your request must contain at least the following

Name of your Wilson cluster computing project
Supervisor name(s)
Name of the event requiring a Slurm reservation
Type of event, e.g., workshop, presentation, paper publication
Date and time ranges of the Slurm reservation
Type (cpu or gpu) and number of workers to be reserved
Justification for a special batch reservation. In particular, why do the normal batch policies not meet your needs.

Additional useful information

A quick two page summary of SLURM Commands
Quick Start SLURM User Guide
Comparison between SLURM and other popular batch schedulers
Official SLURM documentation