MPI: affinity and binding

Launching MPI processes with srun

MPI implementations Open MPI, MVAPICH, and Intel MPI are slurm “aware”. They will detect slurm and use its services to distribute and start MPI binaries. The slurm srun command must be told which API to use for MPI. The command

$ srun --mpi=list
MPI plugin types are...
	pmix
	cray_shasta
	none
	pmi2
specific pmix plugin versions available: pmix_v4,pmix_v5

lists the supported APIs.

The table below lists recommended launchers for the different MPI implementations. These combinations have been proved to work. Combinations that are not listed either fail, or do not properly launch MPI.

MPIcommand
Open MPIsrun --mpi=pmix
Intel MPIsrun --mpi=pmi2
nvhpc (Open MPI)srun --mpi=pmix
MVAPICHsrun --mpi=pmi2

Binding and distribution of tasks

The srun command provides command line options to specify the distribution and binding of MPI ranks to CPU cores and local memory. Careful specification of the distribution and affinities is especially important when running MPI in the hybrid approach combining MPI with thread parallelism. TU Dresden has a nice compendium illustrating different CPU MPI rank+threads distribution and binding options for MPI.

LQ2 GPU workers

Each LQ2 worker is equipped with four NVIDIA A100-80 GPU devices interconnected by an NVLink mesh. The system is a dual socket with 3rd Gen. AMD EPYC 7543 32-Core Processors (64 codes total). Each worker has two InfiniBand adapters. The figure below shows the topology reported by the hwloc-ls command.

The nvidia-smi command is used to interrogate the affinities between each GPU and system resources.

$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV4	NV4	NV4	PXB	SYS	0-31	0		N/A
GPU1	NV4	 X 	NV4	NV4	PXB	SYS	0-31	0		N/A
GPU2	NV4	NV4	 X 	NV4	SYS	PXB	32-63	1		N/A
GPU3	NV4	NV4	NV4	 X 	SYS	PXB	32-63	1		N/A
NIC0	PXB	PXB	SYS	SYS	 X 	SYS
NIC1	SYS	SYS	PXB	PXB	SYS	 X

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:
  NIC0: mlx5_0
  NIC1: mlx5_1

Example LQ2 batch script

#! /bin/bash
#SBATCH --account=yourAccountName
#SBATCH --qos=normal
#SBATCH --partition=lq2_gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:10:00

module purge
module load gompi ucx_cuda ucc_cuda

# enable RDMA and performance tuning options
export QUDA_ENABLE_GDR=1
export UCX_IB_GPU_DIRECT_RDMA=yes
export UCX_MAX_RNDV_RAILS=1
export UCX_RNDV_THRESH=1mb

bin=/project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-gpu
args=""

(( nthreads = SLURM_CPUS_PER_TASK ))
export OMP_NUM_THREADS=${nthreads}

cat /project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/gpu-topo.txt

if [ ${SLURM_NTASKS_PER_NODE} -eq 1 ] ; then
    cpumask="0x000000000000FFFF"
else
    cpumask="0x000000000000FFFF,0x00000000FFFF0000,0x0000FFFF00000000,0xFFFF000000000000"
fi

bind="--gpu-bind=none --cpus-per-task=${SLURM_CPUS_PER_TASK} --cpu-bind=mask_cpu:${cpumask}"
cmd="srun --mpi=pmix ${bind} ${bin} ${args}"
echo CMD: ${cmd}
${cmd}
echo

echo BATCH JOB EXIT
exit 0

Here is the batch output from the script above

GPU    bus-id    CPU-affinity  preferred-NIC  NUMA-affinity
---    --------  ------------  -------------  -------------
 0     00:2F:00     0-31          mlx5_0      0
 1     00:30:00     0-31          mlx5_0      0
 2     00:AF:00     32-63         mlx5_1      1
 3     00:B0:00     32-63         mlx5_1      1

CMD: srun --mpi=pmix --gpu-bind=none --cpus-per-task=16 --cpu-bind=mask_cpu:0x000000000000FFFF,0x00000000FFFF0000,0x0000FFFF00000000,0xFFF\
F000000000000 /project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-gpu
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 0 CPU= 0 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 1 CPU=15 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 2 CPU= 6 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 3 CPU=11 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 4 CPU= 1 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 5 CPU=14 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 6 CPU= 5 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 7 CPU=10 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 8 CPU= 2 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 9 CPU=13 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=10 CPU= 4 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=11 CPU= 9 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=12 CPU= 3 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=13 CPU=12 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=14 CPU= 7 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=15 CPU= 8 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 0 CPU=16 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 1 CPU=27 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 2 CPU=29 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 3 CPU=20 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 4 CPU=17 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 5 CPU=26 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 6 CPU=30 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 7 CPU=21 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 8 CPU=19 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 9 CPU=25 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=10 CPU=31 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=11 CPU=22 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=12 CPU=18 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=13 CPU=24 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=14 CPU=28 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=15 CPU=23 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 0 CPU=32 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 1 CPU=37 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 2 CPU=45 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 3 CPU=33 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 4 CPU=39 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 5 CPU=44 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 6 CPU=41 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 7 CPU=35 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 8 CPU=40 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 9 CPU=38 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=10 CPU=46 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=11 CPU=34 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=12 CPU=43 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=13 CPU=36 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=14 CPU=47 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=15 CPU=41 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 0 CPU=48 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 1 CPU=57 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 2 CPU=60 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 3 CPU=52 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 4 CPU=50 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 5 CPU=56 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 6 CPU=61 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 7 CPU=53 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 8 CPU=51 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 9 CPU=58 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=10 CPU=62 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=11 CPU=55 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=12 CPU=49 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=13 CPU=59 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=14 CPU=63 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=15 CPU=54 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 0 CPU= 0 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 1 CPU=11 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 2 CPU= 6 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 3 CPU=15 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 4 CPU= 1 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 5 CPU= 8 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 6 CPU= 5 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 7 CPU=14 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 8 CPU= 2 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 9 CPU=10 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=10 CPU= 4 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=11 CPU=13 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=12 CPU= 3 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=13 CPU= 9 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=14 CPU= 7 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=15 CPU=12 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 0 CPU=17 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 1 CPU=28 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 2 CPU=23 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 3 CPU=26 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 4 CPU=18 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 5 CPU=30 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 6 CPU=25 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 7 CPU=22 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 8 CPU=19 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 9 CPU=29 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=10 CPU=27 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=11 CPU=20 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=12 CPU=16 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=13 CPU=31 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=14 CPU=24 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=15 CPU=21 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 0 CPU=32 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 1 CPU=36 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 2 CPU=42 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 3 CPU=44 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 4 CPU=33 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 5 CPU=39 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 6 CPU=40 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 7 CPU=47 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 8 CPU=38 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 9 CPU=34 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=10 CPU=43 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=11 CPU=45 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=12 CPU=37 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=13 CPU=41 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=14 CPU=46 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=15 CPU=35 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 0 CPU=57 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 1 CPU=51 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 2 CPU=52 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 3 CPU=58 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 4 CPU=60 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 5 CPU=48 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 6 CPU=55 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 7 CPU=62 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 8 CPU=59 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 9 CPU=49 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=10 CPU=54 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=11 CPU=63 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=12 CPU=56 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=13 CPU=50 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=14 CPU=53 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=15 CPU=61 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00

BATCH JOB EXIT

LQ1 CPU-only workers

Each LQ1 worker is a dual core system with Intel “Cascade Lake” Xeon Gold 6248 CPUs. Each system has a total of 40 cores. The hardware topology is shown in the diagram below generated by hwloc-ls.

Example LQ1 batch script

#! /bin/bash
#SBATCH --account=yourAccountName
#SBATCH --qos=normal
#SBATCH --partition=lq1_cpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=5
#SBATCH --time=00:10:00

module purge
module load gompi

bin=/project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-cpu
args=""

(( nthreads = SLURM_CPUS_PER_TASK ))
export OMP_NUM_THREADS=${nthreads}

bind="--cpus-per-task=${SLURM_CPUS_PER_TASK}"
cmd="srun --mpi=pmix ${bind} ${bin} ${args}"
echo CMD: ${cmd}
${cmd}

echo
echo BATCH JOB EXIT
exit 0

Here is the batch output from running this script

CMD: srun --mpi=pmix --cpus-per-task=5 /project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-cpu
Host=lq1wn001  MPI Rank= 0  OMP Thread=0  CPU= 0  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 0  OMP Thread=1  CPU= 2  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 0  OMP Thread=2  CPU= 4  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 0  OMP Thread=3  CPU= 3  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 0  OMP Thread=4  CPU= 1  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn001  MPI Rank= 1  OMP Thread=0  CPU=20  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 1  OMP Thread=1  CPU=22  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 1  OMP Thread=2  CPU=24  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 1  OMP Thread=3  CPU=23  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 1  OMP Thread=4  CPU=21  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn001  MPI Rank= 2  OMP Thread=0  CPU= 6  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 2  OMP Thread=1  CPU= 9  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 2  OMP Thread=2  CPU= 5  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 2  OMP Thread=3  CPU= 8  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 2  OMP Thread=4  CPU= 7  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn001  MPI Rank= 3  OMP Thread=0  CPU=25  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 3  OMP Thread=1  CPU=27  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 3  OMP Thread=2  CPU=28  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 3  OMP Thread=3  CPU=29  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 3  OMP Thread=4  CPU=26  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn001  MPI Rank= 4  OMP Thread=0  CPU=10  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 4  OMP Thread=1  CPU=14  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 4  OMP Thread=2  CPU=12  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 4  OMP Thread=3  CPU=11  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 4  OMP Thread=4  CPU=13  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn001  MPI Rank= 5  OMP Thread=0  CPU=31  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 5  OMP Thread=1  CPU=33  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 5  OMP Thread=2  CPU=34  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 5  OMP Thread=3  CPU=30  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 5  OMP Thread=4  CPU=32  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn001  MPI Rank= 6  OMP Thread=0  CPU=16  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 6  OMP Thread=1  CPU=18  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 6  OMP Thread=2  CPU=19  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 6  OMP Thread=3  CPU=15  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 6  OMP Thread=4  CPU=17  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn001  MPI Rank= 7  OMP Thread=0  CPU=36  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn001  MPI Rank= 7  OMP Thread=1  CPU=38  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn001  MPI Rank= 7  OMP Thread=2  CPU=39  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn001  MPI Rank= 7  OMP Thread=3  CPU=35  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn001  MPI Rank= 7  OMP Thread=4  CPU=37  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank= 8  OMP Thread=0  CPU= 1  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 8  OMP Thread=1  CPU= 0  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 8  OMP Thread=2  CPU= 3  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 8  OMP Thread=3  CPU= 2  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 8  OMP Thread=4  CPU= 4  NUMA Node=0  CPU Affinity=  0-4
Host=lq1wn006  MPI Rank= 9  OMP Thread=0  CPU=21  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank= 9  OMP Thread=1  CPU=20  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank= 9  OMP Thread=2  CPU=23  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank= 9  OMP Thread=3  CPU=24  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank= 9  OMP Thread=4  CPU=22  NUMA Node=1  CPU Affinity=20-24
Host=lq1wn006  MPI Rank=10  OMP Thread=0  CPU= 6  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=10  OMP Thread=1  CPU= 5  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=10  OMP Thread=2  CPU= 7  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=10  OMP Thread=3  CPU= 9  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=10  OMP Thread=4  CPU= 8  NUMA Node=0  CPU Affinity=  5-9
Host=lq1wn006  MPI Rank=11  OMP Thread=0  CPU=25  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=11  OMP Thread=1  CPU=29  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=11  OMP Thread=2  CPU=27  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=11  OMP Thread=3  CPU=26  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=11  OMP Thread=4  CPU=28  NUMA Node=1  CPU Affinity=25-29
Host=lq1wn006  MPI Rank=12  OMP Thread=0  CPU=10  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=12  OMP Thread=1  CPU=13  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=12  OMP Thread=2  CPU=12  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=12  OMP Thread=3  CPU=14  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=12  OMP Thread=4  CPU=11  NUMA Node=0  CPU Affinity=10-14
Host=lq1wn006  MPI Rank=13  OMP Thread=0  CPU=30  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=13  OMP Thread=1  CPU=33  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=13  OMP Thread=2  CPU=34  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=13  OMP Thread=3  CPU=32  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=13  OMP Thread=4  CPU=31  NUMA Node=1  CPU Affinity=30-34
Host=lq1wn006  MPI Rank=14  OMP Thread=0  CPU=15  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=14  OMP Thread=1  CPU=16  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=14  OMP Thread=2  CPU=18  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=14  OMP Thread=3  CPU=17  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=14  OMP Thread=4  CPU=19  NUMA Node=0  CPU Affinity=15-19
Host=lq1wn006  MPI Rank=15  OMP Thread=0  CPU=39  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank=15  OMP Thread=1  CPU=38  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank=15  OMP Thread=2  CPU=37  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank=15  OMP Thread=3  CPU=36  NUMA Node=1  CPU Affinity=35-39
Host=lq1wn006  MPI Rank=15  OMP Thread=4  CPU=35  NUMA Node=1  CPU Affinity=35-39