Dispatch priority under Slurm

The Software Program Committee allocates resources during each program year. Each project is allocated a certain number of hours on various resources such as CPU core hours, GPU hours or time on an Institutional Cluster. We use a Slurm feature called QoS (Quality of Service) to manage access to partitions by projects and job dispatch priority. This is all part of maintaining a Fair Share usage of allocated resources.

SLURM prioritization

The only prioritization that is managed by Slurm is the dispatch or scheduling priority. All users submit their jobs to be run by Slurm on a particular resource, such as a Partition. On a billable or allocated partition, the projects that have allocated time available should run before those that do not have an allocation. This is true regardless of whether it is a Type A, B or C allocation. An unallocated project is said to be running opportunistically.

Partitions at Fermilab

We currently have a single partition within the Fermilab Lattice QCD Computing Facility. LQ1 cluster has the ‘lq1csl’ CPU computing partition. This is billable against an allocation. There are limits in place to make sure that at least two (in some cases 3) projects can be active at any given time.

NameDescriptionBillableTotalNodesMaxNodesMaxTimeDefaultTime
lq1cslLQ1 CPU CascadeLakeYes183641-00:00:008:00:00
LQ1 cluster – Submit host: lq.fnal.gov

Slurm QoS defined at Fermilab

Jobs submitted to Slurm are associated with an appropriate QoS (or Quality of Service) configuration. Admins assign parameters to a QoS that are used to manage dispatch priority and resource use limits. Additional limits can be defined at the Account or Partition level.

NameDescriptionPriorityGrpTRESMaxWallMaxJobsPUMaxSubmitPA
adminadmin testing600
testquick tests of scripts500cpu=8000:30:0013
normalNormal QoS (default)250125
oppunallocated/opportunistic1008:00:00125

The default QoS for all allocated projects is called normal. The default QoS for all projects without a current allocation is called opp (Opportunistic). Jobs running in this QoS are all dispatched at the same priority but will not start if there are normal jobs waiting in queue. Both of these run with a default wall-time limit of 8 hours. The normal QoS has a MaxWall limit of 24 hours.

We have defined a test QoS for users to run small test jobs to see that their scripts work and their programs run as expected. These test jobs run at a relatively high priority so that they will start as soon as nodes are available. Any user can have no more than three jobs submitted and no more than one job running at any given time. Test jobs are limited to 30 mins of wall-time and just two nodes (limit gpu=80).

We also have a QoS defined as opp for opportunistic or unallocated running. This QoS has a simple priority of 10 with a wall-time limit of just 8 hours. Opportunistic jobs will only run when there are nodes sitting idle. When a project uses up all of the hours that they were allocated for the program year, their jobs will be limited to the opp QoS.

SLURM Commands to see current priorities

To see the list of jobs currently in queue by partition, visit our cluster status web page. Click on the “Start Time” column header to sort the table by start time. For running jobs, this is the actual time that the jobs started. Following that are the Pending jobs in the predicted order they will start.

From a command line, Slurm’s ‘squeue‘ command lists the jobs that are queued. It includes running jobs as well as those waiting to be started, aka dispatched. By changing the format of the commands output, one can get a lot of information about several things, such as:

  • Start time – actual or predicted
  • QoS the job is running under
  • Reason that the job is pending
  • Calculated dispatch real-time priority of the job

The following is just a sample output. Use your project name after the “-A” option to get a listing of jobs for your account.