MPI: affinity and binding

Launching MPI processes with srun

MPI implementations Open MPI, MVAPICH, and Intel MPI are slurm “aware”. They will detect slurm and use its services to distribute and start MPI binaries. The slurm srun command must be told which API to use for MPI. The command

$ srun --mpi=list
MPI plugin types are...
specific pmix plugin versions available: pmix_v4,pmix_v5

lists the supported APIs.

The table below lists recommended launchers for the different MPI implementations. These combinations have been proved to work. Combinations that are not listed either fail, or do not properly launch MPI.

Open MPIsrun --mpi=pmix
Intel MPIsrun --mpi=pmi2
nvhpc (Open MPI)srun --mpi=pmix
MVAPICHsrun --mpi=pmi2

Binding and distribution of tasks

The srun command provides command line options to specify the distribution and binding of MPI ranks to CPU cores and local memory. Careful specification of the distribution and affinities is especially important when running MPI in the hybrid approach combining MPI with thread parallelism. TU Dresden has a nice compendium illustrating different CPU MPI rank+threads distribution and binding options for MPI.

LQ2 GPU workers

Each LQ2 worker is equipped with four NVIDIA A100-80 GPU devices interconnected by an NVLink mesh. The system is a dual socket with 3rd Gen. AMD EPYC 7543 32-Core Processors (64 codes total). Each worker has two InfiniBand adapters. The figure below shows the topology reported by the hwloc-ls command.

The nvidia-smi command is used to interrogate the affinities between each GPU and system resources.

$ nvidia-smi topo -m
GPU0	 X 	NV4	NV4	NV4	PXB	SYS	0-31	0		N/A
GPU1	NV4	 X 	NV4	NV4	PXB	SYS	0-31	0		N/A
GPU2	NV4	NV4	 X 	NV4	SYS	PXB	32-63	1		N/A
GPU3	NV4	NV4	NV4	 X 	SYS	PXB	32-63	1		N/A

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:
  NIC0: mlx5_0
  NIC1: mlx5_1

Example LQ2 batch script

#! /bin/bash
#SBATCH --account=yourAccountName
#SBATCH --qos=normal
#SBATCH --partition=lq2_gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:10:00

module purge
module load gompi ucx_cuda ucc_cuda

# enable RDMA and performance tuning options
export UCX_RNDV_THRESH=1mb


(( nthreads = SLURM_CPUS_PER_TASK ))
export OMP_NUM_THREADS=${nthreads}

cat /project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/gpu-topo.txt

if [ ${SLURM_NTASKS_PER_NODE} -eq 1 ] ; then

bind="--gpu-bind=none --cpus-per-task=${SLURM_CPUS_PER_TASK} --cpu-bind=mask_cpu:${cpumask}"
cmd="srun --mpi=pmix ${bind} ${bin} ${args}"
echo CMD: ${cmd}

exit 0

Here is the batch output from the script above

GPU    bus-id    CPU-affinity  preferred-NIC  NUMA-affinity
---    --------  ------------  -------------  -------------
 0     00:2F:00     0-31          mlx5_0      0
 1     00:30:00     0-31          mlx5_0      0
 2     00:AF:00     32-63         mlx5_1      1
 3     00:B0:00     32-63         mlx5_1      1

CMD: srun --mpi=pmix --gpu-bind=none --cpus-per-task=16 --cpu-bind=mask_cpu:0x000000000000FFFF,0x00000000FFFF0000,0x0000FFFF00000000,0xFFF\
F000000000000 /project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-gpu
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 0 CPU= 0 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 1 CPU=15 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 2 CPU= 6 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 3 CPU=11 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 4 CPU= 1 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 5 CPU=14 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 6 CPU= 5 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 7 CPU=10 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 8 CPU= 2 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 9 CPU=13 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=10 CPU= 4 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=11 CPU= 9 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=12 CPU= 3 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=13 CPU=12 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=14 CPU= 7 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=15 CPU= 8 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 0 CPU=16 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 1 CPU=27 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 2 CPU=29 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 3 CPU=20 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 4 CPU=17 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 5 CPU=26 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 6 CPU=30 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 7 CPU=21 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 8 CPU=19 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 9 CPU=25 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=10 CPU=31 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=11 CPU=22 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=12 CPU=18 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=13 CPU=24 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=14 CPU=28 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=15 CPU=23 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 0 CPU=32 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 1 CPU=37 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 2 CPU=45 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 3 CPU=33 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 4 CPU=39 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 5 CPU=44 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 6 CPU=41 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 7 CPU=35 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 8 CPU=40 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 9 CPU=38 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=10 CPU=46 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=11 CPU=34 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=12 CPU=43 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=13 CPU=36 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=14 CPU=47 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=15 CPU=41 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 0 CPU=48 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 1 CPU=57 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 2 CPU=60 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 3 CPU=52 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 4 CPU=50 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 5 CPU=56 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 6 CPU=61 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 7 CPU=53 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 8 CPU=51 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 9 CPU=58 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=10 CPU=62 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=11 CPU=55 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=12 CPU=49 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=13 CPU=59 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=14 CPU=63 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=15 CPU=54 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 0 CPU= 0 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 1 CPU=11 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 2 CPU= 6 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 3 CPU=15 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 4 CPU= 1 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 5 CPU= 8 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 6 CPU= 5 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 7 CPU=14 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 8 CPU= 2 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 9 CPU=10 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=10 CPU= 4 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=11 CPU=13 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=12 CPU= 3 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=13 CPU= 9 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=14 CPU= 7 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=15 CPU=12 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 0 CPU=17 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 1 CPU=28 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 2 CPU=23 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 3 CPU=26 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 4 CPU=18 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 5 CPU=30 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 6 CPU=25 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 7 CPU=22 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 8 CPU=19 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 9 CPU=29 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=10 CPU=27 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=11 CPU=20 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=12 CPU=16 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=13 CPU=31 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=14 CPU=24 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=15 CPU=21 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 0 CPU=32 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 1 CPU=36 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 2 CPU=42 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 3 CPU=44 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 4 CPU=33 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 5 CPU=39 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 6 CPU=40 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 7 CPU=47 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 8 CPU=38 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 9 CPU=34 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=10 CPU=43 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=11 CPU=45 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=12 CPU=37 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=13 CPU=41 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=14 CPU=46 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=15 CPU=35 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 0 CPU=57 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 1 CPU=51 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 2 CPU=52 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 3 CPU=58 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 4 CPU=60 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 5 CPU=48 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 6 CPU=55 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 7 CPU=62 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 8 CPU=59 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 9 CPU=49 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=10 CPU=54 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=11 CPU=63 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=12 CPU=56 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=13 CPU=50 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=14 CPU=53 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=15 CPU=61 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00


LQ1 CPU-only workers

Each LQ1 worker is a dual core system with Intel “Cascade Lake” Xeon Gold 6248 CPUs. Each system has a total of 40 cores. The hardware topology is shown in the diagram below generated by hwloc-ls.

Example LQ1 batch script

#! /bin/bash
#SBATCH --account=yourAccountName
#SBATCH --qos=normal
#SBATCH --partition=lq1_cpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=5
#SBATCH --time=00:10:00

module purge
module load gompi


(( nthreads = SLURM_CPUS_PER_TASK ))
export OMP_NUM_THREADS=${nthreads}

cmd="srun --mpi=pmix ${bind} ${bin} ${args}"
echo CMD: ${cmd}

exit 0

Here is the batch output from running this script

CMD: srun --mpi=pmix --cpus-per-task=5 /project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-cpu
Host=lq1wn001  MPI Rank= 0  OMP Thread=0  CPU= 0  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 0  OMP Thread=1  CPU= 2  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 0  OMP Thread=2  CPU= 4  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 0  OMP Thread=3  CPU= 3  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 0  OMP Thread=4  CPU= 1  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 1  OMP Thread=0  CPU=20  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 1  OMP Thread=1  CPU=22  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 1  OMP Thread=2  CPU=24  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 1  OMP Thread=3  CPU=23  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 1  OMP Thread=4  CPU=21  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 2  OMP Thread=0  CPU= 6  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 2  OMP Thread=1  CPU= 9  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 2  OMP Thread=2  CPU= 5  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 2  OMP Thread=3  CPU= 8  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 2  OMP Thread=4  CPU= 7  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 3  OMP Thread=0  CPU=25  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 3  OMP Thread=1  CPU=27  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 3  OMP Thread=2  CPU=28  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 3  OMP Thread=3  CPU=29  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 3  OMP Thread=4  CPU=26  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 4  OMP Thread=0  CPU=10  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 4  OMP Thread=1  CPU=14  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 4  OMP Thread=2  CPU=12  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 4  OMP Thread=3  CPU=11  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 4  OMP Thread=4  CPU=13  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 5  OMP Thread=0  CPU=31  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 5  OMP Thread=1  CPU=33  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 5  OMP Thread=2  CPU=34  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 5  OMP Thread=3  CPU=30  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 5  OMP Thread=4  CPU=32  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 6  OMP Thread=0  CPU=16  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 6  OMP Thread=1  CPU=18  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 6  OMP Thread=2  CPU=19  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 6  OMP Thread=3  CPU=15  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 6  OMP Thread=4  CPU=17  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 7  OMP Thread=0  CPU=36  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn001  MPI Rank= 7  OMP Thread=1  CPU=38  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn001  MPI Rank= 7  OMP Thread=2  CPU=39  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn001  MPI Rank= 7  OMP Thread=3  CPU=35  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn001  MPI Rank= 7  OMP Thread=4  CPU=37  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank= 8  OMP Thread=0  CPU= 1  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 8  OMP Thread=1  CPU= 0  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 8  OMP Thread=2  CPU= 3  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 8  OMP Thread=3  CPU= 2  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 8  OMP Thread=4  CPU= 4  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 9  OMP Thread=0  CPU=21  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank= 9  OMP Thread=1  CPU=20  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank= 9  OMP Thread=2  CPU=23  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank= 9  OMP Thread=3  CPU=24  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank= 9  OMP Thread=4  CPU=22  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank=10  OMP Thread=0  CPU= 6  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=10  OMP Thread=1  CPU= 5  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=10  OMP Thread=2  CPU= 7  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=10  OMP Thread=3  CPU= 9  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=10  OMP Thread=4  CPU= 8  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=11  OMP Thread=0  CPU=25  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=11  OMP Thread=1  CPU=29  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=11  OMP Thread=2  CPU=27  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=11  OMP Thread=3  CPU=26  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=11  OMP Thread=4  CPU=28  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=12  OMP Thread=0  CPU=10  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=12  OMP Thread=1  CPU=13  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=12  OMP Thread=2  CPU=12  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=12  OMP Thread=3  CPU=14  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=12  OMP Thread=4  CPU=11  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=13  OMP Thread=0  CPU=30  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=13  OMP Thread=1  CPU=33  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=13  OMP Thread=2  CPU=34  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=13  OMP Thread=3  CPU=32  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=13  OMP Thread=4  CPU=31  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=14  OMP Thread=0  CPU=15  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=14  OMP Thread=1  CPU=16  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=14  OMP Thread=2  CPU=18  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=14  OMP Thread=3  CPU=17  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=14  OMP Thread=4  CPU=19  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=15  OMP Thread=0  CPU=39  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank=15  OMP Thread=1  CPU=38  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank=15  OMP Thread=2  CPU=37  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank=15  OMP Thread=3  CPU=36  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank=15  OMP Thread=4  CPU=35  NUMA Node=1  CPU Affinity=35-39