Slurm on Fermilab USQCD Clusters

Slurm workload manager, formerly known as Simple Linux Utility For Resource Management (SLURM), is an open source, fault-tolerant, and highly scalable resource manager and job scheduling system of high availability currently developed by SchedMD. Initially developed for large Linux Clusters at the Lawrence Livermore National Laboratory, Slurm is used extensively on most Top 500 supercomputers around the globe.

If you have questions about job dispatch priorities on the Fermilab LQCD clusters then please visit this page or send us an email with your question to hpc-admin@fnal.gov.

Slurm Commands

One must log in to the appropriate submit host (see Start Here in the graphics above) in order to run Slurm commands for the appropriate accounts and resources.

  • scontrol and squeue: Job control and monitoring. 
  • sbatch: Batch jobs submission. 
  • salloc: Interactive job sessions are request. 
  • srun: Command to launch a job. 
  • sinfo: Nodes info and cluster status.
  • sacct: Job and job steps accounting data.
  • Useful environment variables are $SLURM_NODELIST and $SLURM_JOBID.

Slurm User Accounts

In order to check your “default” Slurm account use the following command:

[@lq ~]$ sacctmgr list user name=johndoe
      User   Def Acct       Admin  
----------  ---------- ----------
   johndoe    project       None 

To check β€œall” the Slurm accounts you are associated with use the following command:

[@lq ~]$ sacctmgr list user name=johndoe withassoc       
User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------    
johndoe   projectx      None     lq   projecta                    1                                                                            opp,regular,test       opp 

Slurm Resource Types

For details on available resource types and their associated constraints, please visit this page. In summary, resources are available across two partitions and through three different Slurm QoS as shown in table below.

Partition nameResource typeNumber of resourcesAvailable QoSSlurm directives for access
lq1_cpu2.50GHz Intel Xeon Gold 6248 β€œCascade Lakeβ€œ, 196GB memory per node (4.9GB/core), EDR Omni-Path179 worker nodes with 40 CPU cores eachnormal (for approved allocations)
opp (for opportunistic usage)
test (for quick testing)
For partition selection:
--partition=lq1_cpu

Node allocation is done using the standard directives like --nodes, --ntasks-per-node etc.
lq2_gpuNVIDIA A100 GPU nodes18 worker nodes with 4 GPUs eachnormal (for approved allocations)
opp (for opportunistic usage)
test (for quick testing)
For partition selection:
--partition=lq2_gpu

To request two GPUs:
--gres=gpu:2
OR
--gres=gpu:a100:2
OR
--gpus=a100:2

Using Slurm: examples

Submit an interactive job requesting 12 worker nodes :

[@q:~]$ srun --pty --nodes=12 --ntasks-per-node=40 --partition lq1_cpu bash
[user@lq1wn001:~]$ env | grep NTASKS
SLURM_NTASKS_PER_NODE=40
SLURM_NTASKS=480
[user@lq1wn001:~]$ exit

Submit a batch job requesting 4 worker nodes :

[@lq ~]$ cat myscript.sh
#!/bin/sh
#SBATCH --job-name=test
#SBATCH --partition=lq1_cpu
#SBATCH --nodes=4
 
# print hostname of worker node
hostname
sleep 5
exit
 
[@q ~]$ sbatch myscript.sh
Submitted batch job 46

Once the batch job completes the output is available as follows :

[@lq ~]$ cat slurm-46.out
 lq1wn053.fnal.gov

Submit a batch job requesting 2 GPUs :

[@lq ~]$ cat myscript.sh
#!/bin/sh
#SBATCH --account=scd_csi
#SBATCH --qos=normal
#SBATCH --partition=lq2_gpu
#SBATCH --gpus=a100:2

# print hostname of worker node
hostname
sleep 5
exit

Slurm Reporting

The lquota command run on lq.fnal.gov will provide allocation usage reporting as shown below.

[amitoj@lq ~]$ lquota
 Last Updated on: Tue Feb 23 15:00:01 CST 2021
 |--------------- |--------------------- |-------------- |--------------- |--------
 | Account        | Used Sky-ch on LQ1   | Pace          | Allocation     | % Used
 |                | since Jul-1-2020     | MM-DD-YYYY    | in Sky-ch      |
 |--------------- |--------------------- |-------------- |--------------- |--------
 | chiqcd         | 7,825,317            | Jun-29-2021   | 11,992,366     | 65%
 | gluonpdf       | 18,525               | Jan-4-2038    | 500,000        | 4%
 | hadstruc       | 3,269,152            | Apr-26-2021   | 4,137,963      | 79%
 | hiq2ff         | 36                   | Dec-15-3822   | 100,000        | 0%
 | lattsusy       | 452,441              | May-11-2021   | 600,000        | 75%
 | lp3            | 6,178,070            | Oct-23-2021   | 12,500,000     | 49%
<---snip--->

lq1-ch=lq1-core-hour , Sky-ch=Sky-core-hour ,1 lq1-ch=1.05 Sky-ch

For questions regarding the reports or should you notice discrepancies in data please email us at hpc-admin@fnal.gov

Slurm Environment Variables

Variable NameDescriptionExample ValuePBS/Torque analog
$SLURM_JOB_IDJob ID5741192$PBS_JOBID
$SLURM_JOBIDDeprecated. Same as SLURM_JOB_ID
$SLURM_JOB_NAMEJob Namemyjob$PBS_JOBNAME
$SLURM_SUBMIT_DIRSubmit Directory/project/charmonium$PBS_O_WORKDIR
$SLURM_JOB_NODELISTNodes assigned to joblq1wn00[1-5]cat $PBS_NODEFILE
$SLURM_SUBMIT_HOSTHost submitted fromlq.fnal.gov$PBS_O_HOST
$SLURM_JOB_NUM_NODESNumber of nodes allocated to job2$PBS_NUM_NODES
$SLURM_CPUS_ON_NODENumber of cores/node8,3$PBS_NUM_PPN
$SLURM_NTASKSTotal number of cores for job11$PBS_NP
$SLURM_NODEIDIndex to node running on relative to nodes assigned to job0$PBS_O_NODENUM
$SLURM_LOCALIDIndex to core running on within node4
$PBS_O_VNODENUM
$SLURM_PROCIDIndex to task relative to job0$PBS_O_TASKNUM – 1
$SLURM_ARRAY_TASK_IDJob Array Index0$PBS_ARRAYID

Binding and Distribution of tasks

There’s a good description of MPI process affinity binding for srun here: Click here

A reasonable distribution / affinity choice for the lq1_cpu partition of the Fermilab LQCD clusters is:

srun --distribution=cyclic:cyclic --cpu_bind=sockets --mem_bind=no

Launching MPI processes

Please refer to the following page for recommended MPI launch options.

Additional useful information