Fermilab Computing Sector
Fermilab Homepage Computing Sector Homepage Computing Sector Banner

Lattice QCD Computing

Quantum chromodynamics is the study of how quarks and gluons interact through the strong force. This requires the study of interactions at distances smaller than the diameter of a proton, about 10-15 meters across.

Experimental physicists do not observe quarks in isolation in their detectors but instead bound together to form particles such as protons, kaons and pions. Predicting the properties of these particles requires Lattice QCD.

In Lattice QCD computations, physicists replace continuous space-time with a four-dimensional lattice that represents the three dimensions of space plus a dimension of time. The space-time box is made big enough for a proton to fit inside. Markov chain Monte Carlo simulations evolve QCD gauge fields in a fictitious simulation "time" sequence. Each gauge configuration file capture a snapshot of the QCD evolution. Quark interactions, such as two- and three-point functions, must be computed on every gauge configuration and then averaged over the whole set of gauge configurations to produce quantities such as particle masses or decay rates.

Physicists use Lattice QCD to make predictions of masses and decay rates. They then compare those predictions to measurements from experiments. Physicists look carefully for any inconsistencies between experiment and theoretical predictions. Such inconsistencies might be an exciting hint of new physics beyond the Standard Model.


Computing Requirements

Lattice QCD codes spend much of their time inverting very large and sparse matrices. For example, a 48x48x48x144 problem, typical for current simulations, has a complex matrix of size 47.8 million x 47.8 million. The matrix has 1.15 billion non-zero elements (about one in every 2 million).

Iterative techniques like “conjugate gradient” are used to perform these inversions. Nearly all Flops performed are matrix-vector multiplies (3x3 and 3x1 complex). The matrices describe gluons, and the vectors describe quarks. Memory bandwidth limits the speed of the calculation on a single computer.

Individual LQCD calculations require many TFlop/sec-yrs of computations. They can only be achieved by using large-scale parallel machines.

The 4-dimensional Lattice QCD simulations are divided across hundreds to thousands of cores. On each iteration of the inverter, each core interchanges data on the faces of its 4D-sub-volume with its nearest neighbor. The codes employ MPI or other message passing libraries for these communications. Networks such as Infiniband provide the required high bandwidth and low latency.



LQCD Clusters at Fermilab

Fermilab currently operates five LQCD clusters. We have three conventional Infiniband clusters based on multicore CPU processors and another two Infiniband clusters that use GPU-accelerators. The performance values listed are sustained GFlops per worker node on LQCD parallel applications using 128 cores, where the figure gives the performance average of improved staggered and domain wall fermion actions.

Cluster Name
Processors
Nodes
CPU cores
GPUs
Performance*
Ds
Quad 2.0 GHz 8-Core Opteron 6128
196
6272
50.9 GF/node
Dsg
Dual 2.53 GHz 4-Core Intel
E5630 Dual NVIDIA M2050 GPU
76

608
152
93.4 GF/GPU
Bc
Quad 2.8 GHz 8-core Opteron 6320
224
7168
57.0 GF/node
Pi0
Dual 2.6 GHz 8-Core Intel E5-2650 v2
314
5024
67.0 GF/node
Pi0g
Dual 2.6 GHz 8-Core Intel E5-2650 v2
Quad NVIDIA K40M GPU
32
512
128
1040 GF/GPU

* Performance based on LQCD computing benchmarks




LQCD HPC Storage

The LQCD clusters need a fast and reliable storage solution to read and write large files. Fermilab currently operates a Lustre LQCD filesystem that offers 1 petabyte of high speed InfiniBand storage capacity. 

Lustre is a high performance, highly scalable, distributed parallel file system used as volatile storage space by parallel jobs on the LQCD clusters. Once a project is done, the data for that project will be permanently stored on Tape (dCache + Ensotre) or moved offsite.


The Pi0g GPU-Accelerated Cluster

In 2014, Fermilab designed and purchased a GPU-accelerated cluster with optimal strong-scaling performance on calculations that require many GPUs operating in parallel. To meet this requirement, vendor-proposed host systems were restricted to those that provided sufficient PCI Express bandwidth to support the installed GPUs (16 gen3 lanes per GPU) and quad-data-rate Infiniband (8 gen3 lanes per HCA).

 Hardware details of this cluster:
  • Vendor: Supermicro server hosts
    • Dual socket, 8 cores/socket, 2.6 GHz, Intel “Ivy Bridge” processors
    • 128 GBytes memory and 4 GPUs per host, 32 hosts, 128 GPUs in total
  • GPUs:
    • NVIDIA Tesla K40m, 4.29 TFlop/sec peak single precision, 2880 cores
    • 12 GBytes memory per GPU, ECC-capable, hardware double precision
Large-scale (963 × 192 and 643 × 96) gauge configuration generation has been demonstrated on this cluster using parallel runs with 128 GPUs. The performance of GPUs varies with the specific LQCD application. In production running, users have reported application-dependent speed-ups of between 2.1 and 13.3, comparing Ds node hours to Dsg node hours for performing equivalent calculations.





 This page rendered in 0.0931 seconds