Introduction

The Wilson cluster (WC) is a High-Performance Computing (HPC) cluster available to the entire Fermilab scientific and engineering community. The WC is designed to be able to efficiently run and scale up parallel workloads over hundreds CPU cores and/or multiple GPUs. The Wilson cluster provides HPC services typical of larger HPC centers such as NERSC, OLCF, or ALCF. The WC is considered a medium-scale HPC facility which can provide a development on-ramp to the larger HPC centers.

Features include:

  • Up to O(800) CPU cores per job for tightly coupled parallel computations (MPI, OpenMP, …).
  • Access to multiple A100, V100, and P100 NVIDIA GPUs (CUDA, NVIDIA HPC sdk).
  • Workers equipped with multiple GPUs to efficiently scale jobs to multiple GPUs.
  • Ability to run containerized HPC and AI applications with Apptainer.
  • High-bandwidth, low-latency InfiniBand networking among workers and storage.
  • High-performance Lustre parallel filesystem for efficient access large data sets and files.
  • NFS /work1 filesystem allowing shared access among users in the same compute project.
  • Slurm batch system designed to run HPC workloads at scale.
  • Optional interactive access to worker nodes via a shell launched by slurm.
  • High-bandwidth data transfer node with Globus for transfers among data centers.
  • Access to the CernVM-FS software distribution service.

Use cases for include:

  • Code development and performance testing of parallel CPU codes.
  • GPU code development including the ability to test performance while running on multi-GPUs.
  • AI model training when the convergence of HPC and AI features are critical to performance.
  • Testbed to rapidly explore new algorithms and methods with minimal barriers to getting started and obtaining the needed computing resources.
  • Platform for modest small to medium scale non-critical parallel computing campaigns.
  • A development on-ramp for HPC workflows to be run at scale at larger HPC centers.
  • A reservable compute resource for workflows with tight deadlines or for use during hands-on workshops.
Q&A
  • Who has access to Wilson? In short, everyone within the Fermilab community having a Kerberos identity has opportunistic access to cluster resources. Opportunistic access means your HPC jobs run at lower priority and have more restrictive limits on compute resources.
  • How do I obtain resources beyond what opportunistic access permits? Groups of users that have scientific or engineering goals that require more resources are asked to provide justification and apply for a Wilson project account. See Projects and User Requests.
  • How do I login to the Wilson Cluster? Use ssh to login to either wc.fnal.gov or wc2.fnal.gov.
  • Is my workload suitable for Wilson? Wilson is specifically designed to efficiently run High Performance (HPC) workloads consisting of tightly coupled parallel applications. Examples of HPC applications include Lattice QCD, computational fluid dynamics, molecular dynamics simulations, and training large AI models. If your workload, instead, consists of many independent single-core tasks that can execute concurrently or in a distributed manner, then a High Throughput Computing (HTC) facility such as HepCloud or FermiGrid is better matched to your needs.
  • What are the advantages doing AI training on Wilson? Jobs on Wilson are provided whole data center GPU devices, not a partition of a device, or a lower performance “gamer” GPU. Large training jobs can take advantage of training on multiple GPUs. Lustre and InfiniBand provide low-latency high-bandwidth access to very large data sets.
  • I prefer to use JupyterHub for my computing, can I run Jupyter on Wilson? Yes, it is indeed possible to run Jupyter from Wilson worker nodes and login nodes via ssh tunneling, but it requires extra steps and you may need to wait in a batch queue before your session starts on a worker. Please note that Wilson does not officially support this mode operation. Fermilab offers the Elastic Analysis Facility specifically designed for JupyterHub. Note that VPN is required to access EAF offsite.

Schematic layout