SLURM job scheduler

SLURM (Simple Linux Utility For Resource Management) is a very powerful open source, fault-tolerant, and highly scalable resource manager and job scheduling system of high availability currently developed by SchedMD. Initially developed for large Linux Clusters at the Lawrence Livermore National Laboratory, SLURM is used extensively on most Top 500 supercomputers around the globe.

If you have questions about job dispatch priorities on the Wilson Cluster then please visit this page.

Slurm Commands
One must log in to the appropriate submit host (see Start Here in the graphics above) in order to run Slurm commands for the appropriate accounts and resources.

scontrol and squeue: Job control and monitoring. 
sbatch: Batch jobs submission. 
salloc: Interactive job sessions are request. 
srun: Command to launch a job. 
sinfo: Nodes info and cluster status.
sacct: Job and job steps accounting data.

Useful environment variables are $SLURM_NODELIST and $SLURM_JOBID.

Slurm User Accounts

In order to check your “default” SLURM account use the following command:

[@wc ~]$ sacctmgr list user name=johndoe
      User   Def Acct       Admin  
----------  ---------- ----------
   johndoe    project       None 

To check “all” the SLURM accounts you are associated with use the following command:

[@wc ~]$ sacctmgr list user name=johndoe withassoc       
User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------    
johndoe   projectx      None     wilson   projecta                    1                                                                            opp,regular,test       opp 

NOTE: If you do not specify an account name during your job submission (using --account), the “default” account will be used to track usage.

Slurm Resource Types

SLURM TypeResource TypeDescriptionGPU TypeNumber of resourcesNumber of tasks per resourceSLURM PartitionNodenamesShared Resource **
–constraint–nodes–ntasks-per-node–partitionnodelist–exclusive
intel2650CPU2.6 GHz Intel E5-2650v2 “Ivy Bridge” Eight CoreNone9016cpu_gcewcwn[001-090]No
intel2650CPUSame as aboveNone1016cpu_gce_testwcwn[091-100]No
nvidiak40GPUSame as above4x NVIDIA Kepler K40 GPU/node with NO NVLINK616gpu_gcewcgwn[001-007]Yes
p100nvlinkGPU2.4GHz Dual CPU Fourteen Core Intel2x NVIDIA P100 with NVLINK156gpu_gcewcgpu01Yes
p100GPU1.7GHz Dual CPU Eight Core Intel8x NVIDIA P100 with NO NVLINK 116gpu_gcewcgpu02Yes
v100GPU2.5GHz Dual CPU Twenty Core Intel2x NVIDIA V100 with NO NVLINK 440gpu_gcewcgpu[03-06]Yes
v100nvlinkppc64GPU3.8GHz Dual CPU Sixteen Core IBM Power94x NVIDIA V100 with NVLINK1128gpu_gce_ppcwcibmpower01Yes
intelknlCPU1.30GHz Single CPU 64 Core Intel Xeon Phi (Developer Edition)None1256knl_gcewcknl01Yes
** Once assigned to a user job this resource is either by default shared (Yes) (if sufficient resources are available) or allocated exclusively (No) to a single user job.

Using SLURM: examples

Do not launch batch jobs from /nashome (your home directory). They will silently fail since your batch job will not have a valid Kerberos ticket. Likewise, your batch job should not depend on /nashome access in any way!

Submit an interactive job requesting 12 “cpu_gce” nodes 

[@wc]$ srun --pty --nodes=12 --ntasks-per-node=16 --partition cpu_gce bash
[user@wcwn001]$ env | grep NTASKS
SLURM_NTASKS_PER_NODE=16
SLURM_NTASKS=192
[user@wcwn001]$ exit

Submit an interactive job requesting two “gpu_gce” nodes (or 4 GPUs/node) 

[@wc]$ srun --pty --nodes=2 --partition gpu_gce --gres=gpu:4 bash
[user@wcgwn001]$ PBS_NODEFILE=`generate_pbs_nodefile`
[user@wcgwn001]$ rgang --rsh=/usr/bin/rsh $PBS_NODEFILE nvidia-smi -L 
wcgwn001= 
GPU 0: Tesla K40m (UUID: GPU-2fe2a84f-3de9-2ca0-60f0-db011d53a20c)
GPU 1: Tesla K40m (UUID: GPU-9afce23b-cfdf-2318-ed00-2b23c14337f1)
GPU 2: Tesla K40m (UUID: GPU-782960ea-d854-e6ee-26ce-363a4c9c01e2)
GPU 3: Tesla K40m (UUID: GPU-ee804701-10ac-919e-ae64-27888dcb4645)
wcgwn002= 
GPU 0: Tesla K40m (UUID: GPU-b20a4059-56c2-b36a-ba31-1403fa6de2dc)
GPU 1: Tesla K40m (UUID: GPU-af290605-caeb-50e8-a4ca-fd533098c302)
GPU 2: Tesla K40m (UUID: GPU-16ab19e4-9835-5eb2-9b8b-1e479753d20b)
GPU 3: Tesla K40m (UUID: GPU-2b3d082e-3113-617a-dcc6-26eee33e3b2d)
[user@wcgwn001]$exit 

Submit a batch job requesting 4 GPUs i.e. one “gpu_gce” node 

[@wc]$ cat myscript.sh
#!/bin/sh
#SBATCH --job-name=test
#SBATCH --partition=pigpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:4 

nvidia-smi -L
sleep 5
exit 

[@wc]$ sbatch myscript.sh
Submitted batch job 46

[@wc]$ squeue
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 46   gpu_gce     test  johndoe R        0:03      2 wcgwn[001-002]

Once the batch job completes the output is available as follows:

[@wc]$ cat slurm-46.out
GPU 0: Tesla K40m (UUID: GPU-2fe2a84f-3de9-2ca0-60f0-db011d53a20c)
GPU 1: Tesla K40m (UUID: GPU-9afce23b-cfdf-2318-ed00-2b23c14337f1)
GPU 2: Tesla K40m (UUID: GPU-782960ea-d854-e6ee-26ce-363a4c9c01e2)
GPU 3: Tesla K40m (UUID: GPU-ee804701-10ac-919e-ae64-27888dcb4645)

SLURM Environment variables

Variable NameDescriptionExample ValuePBS/Torque analog
$SLURM_JOB_IDJob ID5741192$PBS_JOBID
$SLURM_JOBIDDeprecated. Same as SLURM_JOB_ID
$SLURM_JOB_NAMEJob Namemyjob$PBS_JOBNAME
$SLURM_SUBMIT_DIRSubmit Directory/work1/user$PBS_O_WORKDIR
$SLURM_JOB_NODELISTNodes assigned to jobwcwn[001-005]cat $PBS_NODEFILE
$SLURM_SUBMIT_HOSTHost submitted fromwc.fnal.gov$PBS_O_HOST
$SLURM_JOB_NUM_NODESNumber of nodes allocated to job2$PBS_NUM_NODES
$SLURM_CPUS_ON_NODENumber of cores/node8,3$PBS_NUM_PPN
$SLURM_NTASKSTotal number of cores for job11$PBS_NP
$SLURM_NODEIDIndex to node running on relative to nodes assigned to job0$PBS_O_NODENUM
$PBS_O_VNODENUMIndex to core running on within node4$SLURM_LOCALID
$SLURM_PROCIDIndex to task relative to job0$PBS_O_TASKNUM – 1
$SLURM_ARRAY_TASK_IDJob Array Index0$PBS_ARRAYID
Launching MPI processes
Please refer to the following page for recommended MPI launch options.
Binding and Distribution of tasks
There’s a good description of MPI process affinity binding and srun here: Click here 

Reasonable affinity choices by partition types on the Wilson cluster are: 

Intel (lq1) --distribution=cyclic:cyclic --cpu_bind=sockets --mem_bind=no
Additional useful information
A quick two page summary of SLURM Commands
Quick Start SLURM User Guide
Comparison between SLURM and other popular batch schedulers
Official SLURM documentation