What is a container?
Containers are a way to package a software in a format that can run in a highly isolated environment on a host operating system. Unlike virtual machines (VMs), containers do not emulate the full OS kernel – only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained environments and guarantees that software will always run the same, regardless of where it’s deployed. The best known container technology is Docker.
Unlike the Docker system, Singularity is designed for regular users to run containers on a shared host system, such as an HPC cluster. Singularity enables users to have full control of their environment. For example, the environment inside the container might be Ubuntu 18.04 or Centos 8 and the container will run on a SL7 host system. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data. Singularity containers have proven particularly useful to support machine learning (ML) frameworks on the Fermilab HPC clusters since the ML software frameworks evolve rapidly and the ML software development is typically done on an operating system such as Ubuntu rather than Scientific Linux. Containers allow users to select from a wide range of ML frameworks and versions with confidence that their selected environment is isolated from changes to the underlying host OS.
Where to find containers
Pre-built containers are available online and Singularity has the ability to build a local copy of a container. Singularity is able to convert a Docker container into a Singularity container. Every user should be extremely cautious of the security implications of downloading binary code within containers. Hence, a user should only download containers that are provided by verified repositories and publishers or that they have built themselves from official Linux package repositories. Docker format containers are found at:
DockerHub: Please ensure you filter your choices by selecting either “Verified Publisher” or “Official Images“.
NVIDIA NGC: Be aware that many of the “latest” version containers built by Nvidia no longer support older K40 (sm35) GPUs. You may be able to find a suitable container by searching the available container Tags.
Example: A TensorFlow container from DockerHub
First, setup Singularity on the cluster login host. Then, build the Singularity container in sandbox (directory) format in Lustre . We use Lustre since every project has a default storage allocation in Lustre. The container is an official Tensorflow container from DockerHub.
module load singularity singularity --version cd /wclustre/your_project/images export SINGULARITY_CACHEDIR=/wclustre/your_project/images/.singularity/cache singularity build --fix-perms --sandbox \ tensorflow_latest_gpu \ docker:tensorflow/tensorflow:latest-gpu
The sandbox directory is called
tensorflow_latest_gpu. Lustre does not handle a large number of small files well such as the sandbox files. Hence, first build a compressed tarball from the sandbox and then remove the sandbox directory. The cached tarball is stored as one large file on /wclustre. The sandbox directory is restored as needed from the tarball in subsequent batch jobs.
tar cf tensorflow_latest_gpu.tar tensorflow_latest_gpu/ bzip2 -9 tensorflow_latest_gpu.tar rm -rf tensorflow_latest_gpu/
We will start an interactive slurm job and run this container interactively on a GPU-accelerated worker node. We request a single GPU of any architecture and two CPU threads per task.
cd /work1/your_project/singularity srun --unbuffered --pty -A your_project --partition=gpu_gce \ --qos=regular --time=08:00:00 \ --nodes=1 --ntasks-per-node=1 --gres=gpu:1 \ --cpus-per-task=2 /bin/bash
The work directory when the interactive job starts is /
work1/your_project/singularity. We extract the sandbox directory in the local
/scratch directory on the worker. Note that the singularity sandbox directory and all other files are removed from /scratch at the end of your batch session.
module load singularity export SINGULARITY_CACHEDIR=/scratch/.singularity/cache tar xf /wclustre/your_project/images/tensorflow_latest_gpu.tar.bz2 --directory /scratch
The sandbox has been restored to /scratch on the worker node. Run the singularity sandbox from the directory of your choice, e.g., your /work1 area. The workdir is used by singularity for /tmp and /var/tmp. Option home makes the /work1 area the home directory in the container environment.
The singularity options include:
--userns to run in user namespace unprivileged mode,
--nv to map the host Nvidia GPU drivers into the container,
--home to remap the location of
--workdir to remap
mkdir /scratch/work singularity shell --userns --nv \ --workdir=/scratch/work \ --home=/work1/your_project/singularity/home \ /scratch/tensorflow_latest_gpu
The environment within the container is Ubuntu 18.04
Singularity> cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=18.04 DISTRIB_CODENAME=bionic DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
From within the container, query for the attached GPU:
Run the Tensorflow / Keras MNIST example to train a network on the MNIST dataset.
Example: A pyTorch container from Nvidia
Please note that recent pre-built pyTorch images available from NVidia do support older K40 GPUs. In this example we will request a worker equipped with more capable P100 GPUs. The NVidia site also has older container builds of earlier versions of pyTorch which will run on K40 GPUs.
From the cluster login node, configure singularity and relocate the cache directory used in building the container. We build the container sandbox directory in Lustre.
module load singularity cd /wclustre/your_project/images export SINGULARITY_CACHEDIR=/wclustre/your_project/images/.singularity/cache singularity build --fix-perms --sandbox \ pytorch:20.10-py3 \ docker://nvcr.io/nvidia/pytorch:20.10-py3
The container sandbox is called
pytorch_20.10-py3. We build a compressed tarball of the sandbox that will later be unpacked for use in batch jobs.
tar cf pytorch:20.10-py3.tar pytorch:20.10-py3/ bzip2 -9 pytorch:20.10-py3.tar rm -rf pytorch:20.10-py3/
Start an interactive job requesting a worker equipped with an NVidia P100 GPU.
cd /work1/your_project/singularity/ ### TODO: edit srun command to request worker "gpu3" (wcgpu01) # Note: cores/gpu gpu3 28/2 gpu4 16/8 srun --unbuffered --pty -A your_project --partition=gpu_gce --qos=regular --time=08:00:00 \ --nodes=1 --ntasks-per-node=1 --gres=gpu:p100:1 --cpus-per-task=2 /bin/bash
From the worker node, setup singularity and extract the container sandbox to
module load cuda10/10.1 module load singularity export SINGULARITY_CACHEDIR=/scratch/.singularity/cache tar xf /wclustre/your_project/images/pytorch_20.10-py3.tar.bz2 --directory /scratch
Now, we can go to our work directory and activate a shell within the Ubuntu 18.04 container environment.
mkdir /scratch/work singularity shell --userns --nv \ --home=/work1/your_project/singularity \ --workdir=/scratch/work \ /scratch/pytorch_20.10-py3
Run the pyTorch MNIST example from the container
source /opt/conda/etc/profile.d/conda.sh conda activate python examples/pytorch/mnist_main.py
Additional useful information
- An OSG Helpdesk article: Docker and Singularity Containers