Lattice QCD Computing
Quantum chromodynamics is the study of how quarks and gluons interact
through the strong force. This requires the study of interactions at
distances smaller than the diameter of a proton, about 10^{15 }meters across.
Experimental physicists do not observe
quarks in isolation in their detectors but instead bound together to
form particles such as protons, kaons and pions. Predicting the
properties of these particles requires Lattice QCD.
In Lattice QCD computations, physicists
replace continuous spacetime with a fourdimensional lattice that
represents the three dimensions of space plus a dimension of time. The
spacetime box is made big enough for a proton to fit inside. Markov chain Monte Carlo simulations evolve QCD gauge fields in a fictitious simulation "time" sequence. Each gauge configuration file capture a snapshot of the QCD evolution. Quark interactions, such as two and threepoint functions, must be computed on every gauge configuration and then averaged over the whole set of gauge configurations to produce quantities such as particle masses or decay rates.
Physicists use Lattice QCD to make
predictions of masses and decay rates. They then compare those
predictions to measurements from experiments. Physicists look carefully
for any inconsistencies between experiment and theoretical
predictions. Such inconsistencies might be an exciting hint of new
physics beyond the Standard Model.
Computing Requirements
Lattice QCD codes spend much of their time
inverting very large and sparse matrices. For example, a 48x48x48x144
problem, typical for current simulations, has a complex matrix of size
47.8 million x 47.8 million. The matrix has 1.15 billion nonzero
elements (about one in every 2 million).
Iterative techniques like “conjugate
gradient” are used to perform these inversions. Nearly all Flops
performed are matrixvector multiplies (3x3 and 3x1 complex). The
matrices describe gluons, and the vectors describe quarks. Memory
bandwidth limits the speed of the calculation on a single computer.
Individual LQCD calculations require many
TFlop/secyrs of computations. They can only be achieved by using
largescale parallel machines.
The 4dimensional Lattice QCD simulations are
divided across hundreds to thousands of cores. On each iteration
of the inverter, each core interchanges data on the faces of its
4Dsubvolume with its nearest neighbor. The codes employ MPI or other
message passing libraries for these communications. Networks such as
Infiniband provide the required high bandwidth and low latency.
LQCD Clusters at Fermilab
Fermilab currently operates five LQCD
clusters. We have three conventional Infiniband clusters based on multicore CPU processors and another two Infiniband clusters that use GPUaccelerators. The
performance values listed are sustained GFlops per worker node on LQCD
parallel applications using 128 cores, where the figure gives the
performance average of improved staggered and domain wall fermion
actions.
Cluster
Name

Processors

Nodes

CPU cores

GPUs 
Performance*

Ds

Quad
2.0 GHz 8Core Opteron 6128

196

6272


50.9
GF/node

Dsg

Dual
2.53 GHz 4Core Intel
E5630 Dual NVIDIA M2050 GPU

76

608

152 
93.4 GF/GPU

Bc

Quad
2.8 GHz 8core Opteron 6320

224

7168


57.0 GF/node

Pi0

Dual
2.6 GHz 8Core Intel E52650 v2

314

5024


67.0 GF/node

Pi0g

Dual
2.6 GHz 8Core Intel E52650 v2 Quad NVIDIA K40M GPU

32

512

128 
1040 GF/GPU

* Performance based on LQCD computing benchmarks
LQCD HPC Storage
The LQCD clusters need a fast and reliable storage solution to read and write large files. Fermilab currently operates a Lustre LQCD filesystem that offers 1 petabyte of high speed InfiniBand storage capacity.
Lustre is a high performance, highly scalable, distributed parallel file system used as volatile storage space by parallel jobs on the LQCD clusters. Once a project is done, the data for that project will be permanently stored on Tape (dCache + Ensotre) or moved offsite.
The Pi0g GPUAccelerated Cluster
In 2014, Fermilab designed and purchased a GPUaccelerated cluster with optimal strongscaling performance on calculations that require many GPUs operating in parallel. To meet this requirement, vendorproposed host systems were restricted to those that provided sufficient PCI Express bandwidth to support the installed GPUs (16 gen3 lanes per GPU) and quaddatarate Infiniband (8 gen3 lanes per HCA).
Hardware details of this cluster:
Largescale (963 × 192 and 643 × 96) gauge
configuration generation has been demonstrated on this cluster using
parallel runs with 128 GPUs. The performance of GPUs varies with the
specific LQCD application. In production running, users have reported
applicationdependent speedups of between 2.1 and 13.3, comparing Ds
node hours to Dsg node hours for performing equivalent calculations.
