Dispatch priority under Slurm

The Software Program Committee allocates resources during each program year. Each project is allocated a certain number of hours on various resources such as CPU core hours, GPU hours or time on an Institutional Cluster. We use a Slurm feature called QoS (Quality of Service) to manage access to partitions by projects and job dispatch priority. This is all part of maintaining a Fair Share usage of allocated resources.

Slurm prioritization

The only prioritization that is managed by Slurm is the dispatch or scheduling priority. All users submit their jobs to be run by Slurm on a particular resource, such as a partition. On a billable or allocated partition, the projects that have allocated time available should run before those that do not have an allocation. This is true regardless of whether it is a Type A, B or C allocation. An unallocated project is said to be running opportunistically.

Partitions at Fermilab

We currently have two partitions within the Fermilab Lattice QCD Computing Facility as shown in the table below. Both these partitions are billable against an allocation. There are limits in place to make sure that at least two (in some cases three) projects can be active at any given time. These limits are enforced through Slurm QoS as described in the next section.

NameDescriptionBillableTotal NodesMaximum RuntimeDefault Runtime
lq1_cpuLQ1 CPU CascadeLakeYes1791-00:00:008:00:00
lq2_gpuLQ2 A100 GPU clusterYes181-00:00:008:00:00

Both these partitions are accessible from the login node (lq.fnal.gov) and can be selected using the --partition directive of Slurm submit commands like sbatch and srun.

Slurm QoS defined at Fermilab

Jobs submitted to Slurm are associated with an appropriate QoS (or Quality of Service) configuration. Admins assign parameters to a QoS that are used to manage dispatch priority and resource use limits.

NameDescriptionPriorityGlobal Resource ConstraintsMax WalltimePer Account ConstraintsPer User Constraints
adminadmin testing600NoneNoneNoneNone
testquick tests of scripts500Max nodes = 2
Max GPUs = 4
00:30:00Max jobs = 3Max jobs = 1
normalQoS available to accounts with allocations250None1-00:00:00Max jobs = 125
Max nodes = 128
Max GPUs = 40
oppQoS available to all accounts for opportunistic usage10None08:00:00Max jobs = 125
Max nodes = 64
Max GPUs = 4

A few notes about the resource constraints:

  • Global resource constraints are enforced across all the accounts in the cluster. For example, test QoS restricts access to 4 GPUs globally. If account A is using 4 GPUs, account B has to wait until the resources are freed up.
  • Per account constraints are enforced on an account basis. For example, normal QoS restricts access to 128 nodes. If user A in account X is using 128 nodes, user B has to wait until resources are freed up. User C belonging to a different account Y can continue to run in such a scenario.
  • Similarly, per user constraints are enforced on a user basis. For example, test QoS restricts the number of jobs per user to 1. This means a single user, regardless of their account, cannot submit more than a single job using the QoS.
  • Finally, these constraints may be relaxed or adjusted from time to time based on the job mix and to maximize cluster utilization.

A few notes about available QoS:

  • Users can select QoS appropriately by using --qos directive with their submit commands. The default QoS for all projects is opp. Jobs running under this QoS have the lowest priority and will only start when there aren’t any eligible normal QoS jobs waiting in the queue. When a project uses up all of the hours that they were allocated for the program year, their jobs will be limited to the opp QoS.
  • We have defined the test QoS for users to run small test jobs to see that their scripts work and their programs run as expected. These test jobs run at a relatively high priority so that they will start as soon as nodes are available. Any user can have no more than three jobs submitted and no more than one job running at any given time.
  • The normal QoS is only available to projects (or accounts) with approved allocations. As long as an account is under their resource constraint limits, their jobs are scheduled as soon as resources become available.

Slurm Commands to see current priorities

To see the list of jobs currently in queue by partition, visit our HPC cluster status monitoring and select the LQCD Cluster Status dashboard. By default jobs are sorted by their state but clicking on different column headers changes the sorting. We also provide a Username filter up top to view queued jobs for a given user. You can see the submit time, start time and end time for jobs along with their associated priorities on the dashboard.

From a command line, Slurm’s ‘squeue‘ command lists the jobs that are queued. It includes running jobs as well as those waiting to be started, or dispatched. By changing the format of the command output, one can get a lot of information about several things such as:

  • Start time – actual or predicted
  • QoS the job is running under
  • Reason that the job is pending
  • Calculated dispatch real-time priority of the job

The following is just a sample output. Use your project name after the “-A” option to get a listing of jobs for your account.