Job dispatch explained

Each VO/Experiment is granted access to various resources such as CPU core hours, GPU hours, or other hardware platforms on our Institutional Cluster. We use a Slurm feature called QoS (Quality of Service) to manage access to partitions by projects and job dispatch priority. This is all part of maintaining a Fair Share usage of available resources.

Project Types

We have two types of projects defined in SLURM. We have basic projects defined for each Fermilab Experiment (aka. VO or Virtual Organization). These projects can submit Opportunistic jobs via SLURM. Groups that have filed a more detailed and specific project request have higher priority access via the regular QoS.

Slurm prioritization

The only prioritization that is managed by Slurm is the dispatch or scheduling priority. All users submit their jobs to be run by Slurm on a particular resource within one of several partitions. We do not use any form of preemption on our cluster.

Partitions at Fermilab on the Wilson Cluster

We currently have several partitions within the Fermilab Wilson Cluster. There are limits in place to make sure that at least two (in some cases 3) projects can be active on a partition at any given time.

Partition (queue)DescriptionTotal NodesMax NodesMax TimeDefault Time
cpu_gce2.6 GHz Intel E5-2650v2 “Ivy Bridge”, (16 cores/node), 8GB/core memory, ~280GB local scratch disk, Inter-node QDR (40Gbps) Infiniband.905048:00:008:00:00
cpu_gce_testsame as cpu_gce above1054:00:001:00:00
gpu_gceSame as cpu_gce above and with 4x NVIDIA Kepler K40 GPU/node1248:00:008:00:00
gpu_gce_ppc3.8GHz Dual CPU Sixteen Core IBM Power9148:00:008:00:00
gpu_oseSame as above but only authorized to accept jobs from the Open Science Grid.2148:00:008:00:00
knl_gce1.30GHz Single CPU 64 Core Intel Xeon Phi1148:00:008:00:00

Slurm QoS defined at Fermilab

Jobs submitted to Slurm are associated with an appropriate QoS (or Quality of Service) configuration. Admins assign parameters to a QoS that are used to manage dispatch priority and resource use limits. Additional limits can be defined at the Account or Partition level.

NameDescriptionPriorityMaxWallMaxJobs Per UserMaxSubmit Per User
adminadmin testing100
testquick tests of scripts7504:00:0013
pilotOSG Pilot (gpu_ose)2548:00:00
regularNormal QoS (default)2524:00:00
oppopportunistic008:00:0050

The default QoS for all projects is called opp. This opp QoS has shorter wall time limits so that jobs turn over faster allowing others to share the system resources. Projects that have requested and been approved for greater access can run in the regular QoS. Jobs running in the opp QoS are dispatched at a low priority and generally will not start if there are higher priority jobs waiting in the regular QoS. The exception is back-fill jobs that use a small amount of resources while the system is waiting for enough resources to be available for a larger higher priority job.

We have defined a test QoS for users to run small test jobs to see that their scripts work and their programs run as expected. We have a separate cpu_gce_test partition with nodes that are dedicated to test jobs. These test jobs are dispatched at a relatively high priority so that they will start as soon as nodes are available. Any user can have no more than three test jobs submitted and no more than one test job running at any given time. Test jobs are limited to 4 hours of wall-time and up to five nodes.

SLURM Commands to see current priorities

To see the list of jobs currently in the queue by partition, visit our cluster status web page. Click on the “Start Time” column header to sort the table by start time. For running jobs, this is the actual time that the jobs started. Following that are the Pending jobs in the predicted order they may start. You can also click on the “Submit Time” column header to see which jobs have been waiting the longest.

From a command line, Slurm’s ‘squeue‘ command lists the jobs that are queued. It includes running jobs as well as those waiting to be started, aka dispatched. By changing the format of the commands output, one can get a lot of information about several things, such as:

  • Start time – actual or predicted
  • QoS the job is running under
  • Reason that the job is pending
  • Calculated dispatch real-time priority of the job

The following is just a sample output. Add your project name after a “--account=” option to get a listing of jobs for your account. You can use a comma separated list of account names such as:

squeue --sort=P,-p --account=dune,dunepro

[kschu@wc ~]$ /usr/bin/squeue --sort=P,-p --Format=Account:.10,UserName:.10,JobID:.8,Name:.12,MaxNodes:.8,PriorityLong:.9,State:.2,QOS:.8,SubmitTime:.16,StartTime:.16,TimeLimit:.11,Reason --account=novapro,triad_trg,desy3,qis_algo,g4p
Wed Mar 24 12:39:44 CDT 2021
ACCOUNT USER JOBID NAMEMAX_NODE PRIORITYST QOS SUBMIT_TIME START_TIME TIME_LIMITREASON
novapro grohmc 83327 CVN_BASE 1 81569PE regular2021-03-24T11:342021-03-24T17:16 1-00:00:00Resources
novapro grohmc 83326 CVN_BASE 1 81295RU regular2021-03-24T11:342021-03-24T11:34 1-00:00:00None
novapro grohmc 83325 cvn 1 81295RU regular2021-03-24T11:322021-03-24T11:32 1-00:00:00None
triad_trg junmuthy 83287run_t0.1992_ 1 66634RU regular2021-03-23T15:502021-03-24T03:49 12:00:00None
desy3 kherner 83314desy3-lowz_l 40 63925PE regular2021-03-24T10:352021-03-24T13:28 1-00:00:00Resources
qis_algo macridin 83313d32k32.0_res 16 63531RU regular2021-03-24T10:202021-03-24T10:21 8:00:00None
g4p syjun 83309 bash 1 257RU opp2021-03-24T09:162021-03-24T09:16 8:00:00None