Launching MPI processes with srun
MPI implementations Open MPI, MVAPICH, and Intel MPI are slurm “aware”. They will detect slurm and use its services to distribute and start MPI binaries. The slurm srun command must be told which API to use for MPI. The command
$ srun --mpi=list MPI plugin types are... pmix cray_shasta none pmi2 specific pmix plugin versions available: pmix_v4,pmix_v5
lists the supported APIs.
The table below lists recommended launchers for the different MPI implementations. These combinations have been proved to work. Combinations that are not listed either fail, or do not properly launch MPI.
MPI | command |
Open MPI | srun --mpi=pmix |
Intel MPI | srun --mpi=pmi2 |
nvhpc (Open MPI) | srun --mpi=pmix |
MVAPICH | srun --mpi=pmi2 |
Binding and distribution of tasks
The srun command provides command line options to specify the distribution and binding of MPI ranks to CPU cores and local memory. Careful specification of the distribution and affinities is especially important when running MPI in the hybrid approach combining MPI with thread parallelism. TU Dresden has a nice compendium illustrating different CPU MPI rank+threads distribution and binding options for MPI.
LQ2 GPU workers
Each LQ2 worker is equipped with four NVIDIA A100-80 GPU devices interconnected by an NVLink mesh. The system is a dual socket with 3rd Gen. AMD EPYC 7543 32-Core Processors (64 codes total). Each worker has two InfiniBand adapters. The figure below shows the topology reported by the hwloc-ls
command.
The nvidia-smi
command is used to interrogate the affinities between each GPU and system resources.
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 NV4 NV4 PXB SYS 0-31 0 N/A
GPU1 NV4 X NV4 NV4 PXB SYS 0-31 0 N/A
GPU2 NV4 NV4 X NV4 SYS PXB 32-63 1 N/A
GPU3 NV4 NV4 NV4 X SYS PXB 32-63 1 N/A
NIC0 PXB PXB SYS SYS X SYS
NIC1 SYS SYS PXB PXB SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
Example LQ2 batch script
#! /bin/bash
#SBATCH --account=yourAccountName
#SBATCH --qos=normal
#SBATCH --partition=lq2_gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:10:00
module purge
module load gompi ucx_cuda ucc_cuda
# enable RDMA and performance tuning options
export QUDA_ENABLE_GDR=1
export UCX_IB_GPU_DIRECT_RDMA=yes
export UCX_MAX_RNDV_RAILS=1
export UCX_RNDV_THRESH=1mb
bin=/project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-gpu
args=""
(( nthreads = SLURM_CPUS_PER_TASK ))
export OMP_NUM_THREADS=${nthreads}
cat /project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/gpu-topo.txt
if [ ${SLURM_NTASKS_PER_NODE} -eq 1 ] ; then
cpumask="0x000000000000FFFF"
else
cpumask="0x000000000000FFFF,0x00000000FFFF0000,0x0000FFFF00000000,0xFFFF000000000000"
fi
bind="--gpu-bind=none --cpus-per-task=${SLURM_CPUS_PER_TASK} --cpu-bind=mask_cpu:${cpumask}"
cmd="srun --mpi=pmix ${bind} ${bin} ${args}"
echo CMD: ${cmd}
${cmd}
echo
echo BATCH JOB EXIT
exit 0
Here is the batch output from the script above
GPU bus-id CPU-affinity preferred-NIC NUMA-affinity
--- -------- ------------ ------------- -------------
0 00:2F:00 0-31 mlx5_0 0
1 00:30:00 0-31 mlx5_0 0
2 00:AF:00 32-63 mlx5_1 1
3 00:B0:00 32-63 mlx5_1 1
CMD: srun --mpi=pmix --gpu-bind=none --cpus-per-task=16 --cpu-bind=mask_cpu:0x000000000000FFFF,0x00000000FFFF0000,0x0000FFFF00000000,0xFFF\
F000000000000 /project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-gpu
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 0 CPU= 0 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 1 CPU=15 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 2 CPU= 6 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 3 CPU=11 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 4 CPU= 1 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 5 CPU=14 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 6 CPU= 5 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 7 CPU=10 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 8 CPU= 2 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread= 9 CPU=13 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=10 CPU= 4 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=11 CPU= 9 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=12 CPU= 3 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=13 CPU=12 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=14 CPU= 7 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=0 OMP-Thread=15 CPU= 8 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 0 CPU=16 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 1 CPU=27 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 2 CPU=29 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 3 CPU=20 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 4 CPU=17 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 5 CPU=26 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 6 CPU=30 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 7 CPU=21 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 8 CPU=19 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread= 9 CPU=25 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=10 CPU=31 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=11 CPU=22 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=12 CPU=18 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=13 CPU=24 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=14 CPU=28 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=1 OMP-Thread=15 CPU=23 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 0 CPU=32 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 1 CPU=37 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 2 CPU=45 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 3 CPU=33 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 4 CPU=39 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 5 CPU=44 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 6 CPU=41 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 7 CPU=35 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 8 CPU=40 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread= 9 CPU=38 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=10 CPU=46 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=11 CPU=34 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=12 CPU=43 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=13 CPU=36 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=14 CPU=47 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=2 OMP-Thread=15 CPU=41 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 0 CPU=48 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 1 CPU=57 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 2 CPU=60 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 3 CPU=52 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 4 CPU=50 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 5 CPU=56 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 6 CPU=61 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 7 CPU=53 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 8 CPU=51 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread= 9 CPU=58 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=10 CPU=62 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=11 CPU=55 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=12 CPU=49 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=13 CPU=59 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=14 CPU=63 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu03 MPI-Rank=3 OMP-Thread=15 CPU=54 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 0 CPU= 0 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 1 CPU=11 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 2 CPU= 6 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 3 CPU=15 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 4 CPU= 1 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 5 CPU= 8 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 6 CPU= 5 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 7 CPU=14 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 8 CPU= 2 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread= 9 CPU=10 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=10 CPU= 4 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=11 CPU=13 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=12 CPU= 3 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=13 CPU= 9 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=14 CPU= 7 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=4 OMP-Thread=15 CPU=12 NUMA-Node=0 CPU-Affinity= 0-15 GPU-IDs=00:2F:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 0 CPU=17 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 1 CPU=28 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 2 CPU=23 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 3 CPU=26 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 4 CPU=18 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 5 CPU=30 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 6 CPU=25 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 7 CPU=22 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 8 CPU=19 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread= 9 CPU=29 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=10 CPU=27 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=11 CPU=20 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=12 CPU=16 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=13 CPU=31 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=14 CPU=24 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=5 OMP-Thread=15 CPU=21 NUMA-Node=0 CPU-Affinity=16-31 GPU-IDs=00:30:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 0 CPU=32 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 1 CPU=36 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 2 CPU=42 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 3 CPU=44 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 4 CPU=33 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 5 CPU=39 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 6 CPU=40 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 7 CPU=47 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 8 CPU=38 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread= 9 CPU=34 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=10 CPU=43 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=11 CPU=45 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=12 CPU=37 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=13 CPU=41 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=14 CPU=46 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=6 OMP-Thread=15 CPU=35 NUMA-Node=1 CPU-Affinity=32-47 GPU-IDs=00:AF:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 0 CPU=57 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 1 CPU=51 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 2 CPU=52 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 3 CPU=58 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 4 CPU=60 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 5 CPU=48 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 6 CPU=55 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 7 CPU=62 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 8 CPU=59 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread= 9 CPU=49 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=10 CPU=54 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=11 CPU=63 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=12 CPU=56 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=13 CPU=50 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=14 CPU=53 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
Host=lq2gpu04 MPI-Rank=7 OMP-Thread=15 CPU=61 NUMA-Node=1 CPU-Affinity=48-63 GPU-IDs=00:B0:00
BATCH JOB EXIT
LQ1 CPU-only workers
Each LQ1 worker is a dual core system with Intel “Cascade Lake” Xeon Gold 6248 CPUs. Each system has a total of 40 cores. The hardware topology is shown in the diagram below generated by hwloc-ls
.
Example LQ1 batch script
#! /bin/bash
#SBATCH --account=yourAccountName
#SBATCH --qos=normal
#SBATCH --partition=lq1_cpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=5
#SBATCH --time=00:10:00
module purge
module load gompi
bin=/project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-cpu
args=""
(( nthreads = SLURM_CPUS_PER_TASK ))
export OMP_NUM_THREADS=${nthreads}
bind="--cpus-per-task=${SLURM_CPUS_PER_TASK}"
cmd="srun --mpi=pmix ${bind} ${bin} ${args}"
echo CMD: ${cmd}
${cmd}
echo
echo BATCH JOB EXIT
exit 0
Here is the batch output from running this script
CMD: srun --mpi=pmix --cpus-per-task=5 /project/admin/benchmark_FNAL/el8/x86_64/apps/xthi/build_gnu12_cuda12_ompi/xthi-cpu
Host=lq1wn001 MPI Rank= 0 OMP Thread=0 CPU= 0 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn001 MPI Rank= 0 OMP Thread=1 CPU= 2 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn001 MPI Rank= 0 OMP Thread=2 CPU= 4 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn001 MPI Rank= 0 OMP Thread=3 CPU= 3 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn001 MPI Rank= 0 OMP Thread=4 CPU= 1 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn001 MPI Rank= 1 OMP Thread=0 CPU=20 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn001 MPI Rank= 1 OMP Thread=1 CPU=22 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn001 MPI Rank= 1 OMP Thread=2 CPU=24 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn001 MPI Rank= 1 OMP Thread=3 CPU=23 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn001 MPI Rank= 1 OMP Thread=4 CPU=21 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn001 MPI Rank= 2 OMP Thread=0 CPU= 6 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn001 MPI Rank= 2 OMP Thread=1 CPU= 9 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn001 MPI Rank= 2 OMP Thread=2 CPU= 5 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn001 MPI Rank= 2 OMP Thread=3 CPU= 8 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn001 MPI Rank= 2 OMP Thread=4 CPU= 7 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn001 MPI Rank= 3 OMP Thread=0 CPU=25 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn001 MPI Rank= 3 OMP Thread=1 CPU=27 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn001 MPI Rank= 3 OMP Thread=2 CPU=28 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn001 MPI Rank= 3 OMP Thread=3 CPU=29 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn001 MPI Rank= 3 OMP Thread=4 CPU=26 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn001 MPI Rank= 4 OMP Thread=0 CPU=10 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn001 MPI Rank= 4 OMP Thread=1 CPU=14 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn001 MPI Rank= 4 OMP Thread=2 CPU=12 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn001 MPI Rank= 4 OMP Thread=3 CPU=11 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn001 MPI Rank= 4 OMP Thread=4 CPU=13 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn001 MPI Rank= 5 OMP Thread=0 CPU=31 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn001 MPI Rank= 5 OMP Thread=1 CPU=33 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn001 MPI Rank= 5 OMP Thread=2 CPU=34 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn001 MPI Rank= 5 OMP Thread=3 CPU=30 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn001 MPI Rank= 5 OMP Thread=4 CPU=32 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn001 MPI Rank= 6 OMP Thread=0 CPU=16 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn001 MPI Rank= 6 OMP Thread=1 CPU=18 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn001 MPI Rank= 6 OMP Thread=2 CPU=19 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn001 MPI Rank= 6 OMP Thread=3 CPU=15 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn001 MPI Rank= 6 OMP Thread=4 CPU=17 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn001 MPI Rank= 7 OMP Thread=0 CPU=36 NUMA Node=1 CPU Affinity=35-39
Host=lq1wn001 MPI Rank= 7 OMP Thread=1 CPU=38 NUMA Node=1 CPU Affinity=35-39
Host=lq1wn001 MPI Rank= 7 OMP Thread=2 CPU=39 NUMA Node=1 CPU Affinity=35-39
Host=lq1wn001 MPI Rank= 7 OMP Thread=3 CPU=35 NUMA Node=1 CPU Affinity=35-39
Host=lq1wn001 MPI Rank= 7 OMP Thread=4 CPU=37 NUMA Node=1 CPU Affinity=35-39
Host=lq1wn006 MPI Rank= 8 OMP Thread=0 CPU= 1 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn006 MPI Rank= 8 OMP Thread=1 CPU= 0 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn006 MPI Rank= 8 OMP Thread=2 CPU= 3 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn006 MPI Rank= 8 OMP Thread=3 CPU= 2 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn006 MPI Rank= 8 OMP Thread=4 CPU= 4 NUMA Node=0 CPU Affinity= 0-4
Host=lq1wn006 MPI Rank= 9 OMP Thread=0 CPU=21 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn006 MPI Rank= 9 OMP Thread=1 CPU=20 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn006 MPI Rank= 9 OMP Thread=2 CPU=23 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn006 MPI Rank= 9 OMP Thread=3 CPU=24 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn006 MPI Rank= 9 OMP Thread=4 CPU=22 NUMA Node=1 CPU Affinity=20-24
Host=lq1wn006 MPI Rank=10 OMP Thread=0 CPU= 6 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn006 MPI Rank=10 OMP Thread=1 CPU= 5 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn006 MPI Rank=10 OMP Thread=2 CPU= 7 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn006 MPI Rank=10 OMP Thread=3 CPU= 9 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn006 MPI Rank=10 OMP Thread=4 CPU= 8 NUMA Node=0 CPU Affinity= 5-9
Host=lq1wn006 MPI Rank=11 OMP Thread=0 CPU=25 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn006 MPI Rank=11 OMP Thread=1 CPU=29 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn006 MPI Rank=11 OMP Thread=2 CPU=27 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn006 MPI Rank=11 OMP Thread=3 CPU=26 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn006 MPI Rank=11 OMP Thread=4 CPU=28 NUMA Node=1 CPU Affinity=25-29
Host=lq1wn006 MPI Rank=12 OMP Thread=0 CPU=10 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn006 MPI Rank=12 OMP Thread=1 CPU=13 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn006 MPI Rank=12 OMP Thread=2 CPU=12 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn006 MPI Rank=12 OMP Thread=3 CPU=14 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn006 MPI Rank=12 OMP Thread=4 CPU=11 NUMA Node=0 CPU Affinity=10-14
Host=lq1wn006 MPI Rank=13 OMP Thread=0 CPU=30 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn006 MPI Rank=13 OMP Thread=1 CPU=33 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn006 MPI Rank=13 OMP Thread=2 CPU=34 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn006 MPI Rank=13 OMP Thread=3 CPU=32 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn006 MPI Rank=13 OMP Thread=4 CPU=31 NUMA Node=1 CPU Affinity=30-34
Host=lq1wn006 MPI Rank=14 OMP Thread=0 CPU=15 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn006 MPI Rank=14 OMP Thread=1 CPU=16 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn006 MPI Rank=14 OMP Thread=2 CPU=18 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn006 MPI Rank=14 OMP Thread=3 CPU=17 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn006 MPI Rank=14 OMP Thread=4 CPU=19 NUMA Node=0 CPU Affinity=15-19
Host=lq1wn006 MPI Rank=15 OMP Thread=0 CPU=39 NUMA Node=1 CPU Affinity=35-39
Host=lq1wn006 MPI Rank=15 OMP Thread=1 CPU=38 NUMA Node=1 CPU Affinity=35-39
Host=lq1wn006 MPI Rank=15 OMP Thread=2 CPU=37 NUMA Node=1 CPU Affinity=35-39
Host=lq1wn006 MPI Rank=15 OMP Thread=3 CPU=36 NUMA Node=1 CPU Affinity=35-39
Host=lq1wn006 MPI Rank=15 OMP Thread=4 CPU=35 NUMA Node=1 CPU Affinity=35-39