Containers

What is a container?

Containers are a way to package a software in a format that can run in a highly isolated environment on a host operating system. Unlike virtual machines (VMs), containers do not emulate the full OS kernel – only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained environments and guarantees that software will always run the same, regardless of where it’s deployed. The best known container technology is Docker.

Singularity

Unlike the Docker system, Singularity is designed for regular users to run containers on a shared host system, such as an HPC cluster. Singularity enables users to have full control of their environment. For example, the environment inside the container might be Ubuntu 18.04 or Centos 8 and the container will run on a SL7 host system. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data. Singularity containers have proven particularly useful to support machine learning (ML) frameworks on the Fermilab HPC clusters since the ML software frameworks evolve rapidly and the ML software development is typically done on an operating system such as Ubuntu rather than Scientific Linux. Containers allow users to select from a wide range of ML frameworks and versions with confidence that their selected environment is isolated from changes to the underlying host OS.

Where to find containers

Pre-built containers are available online and Singularity has the ability to build a local copy of a container. Singularity is able to convert a Docker container into a Singularity container. Every user should be extremely cautious of the security implications of downloading binary code within containers. Hence, a user should only download containers that are provided by verified repositories and publishers or that they have built themselves from official Linux package repositories. Docker format containers are found at:

DockerHub: Please ensure you filter your choices by selecting either “Verified Publisher” or “Official Images“.

NVIDIA NGC: Be aware that many of the “latest” version containers built by Nvidia no longer support older K40 (sm35) GPUs. You may be able to find a suitable container by searching the available container Tags.

Example: A TensorFlow container from DockerHub

First, setup Singularity on the cluster login host. Then, build the Singularity container in sandbox (directory) format in Lustre . We use Lustre since every project has a default storage allocation in Lustre. The container is an official Tensorflow container from DockerHub.

module load singularity
singularity --version

cd /wclustre/your_project/images
export SINGULARITY_CACHEDIR=/wclustre/your_project/images/.singularity/cache

singularity build --fix-perms --sandbox \
    tensorflow_latest_gpu \
    docker:tensorflow/tensorflow:latest-gpu

The sandbox directory is called tensorflow_latest_gpu. Lustre does not handle a large number of small files well such as the sandbox files. Hence, first build a compressed tarball from the sandbox and then remove the sandbox directory. The cached tarball is stored as one large file on /wclustre. The sandbox directory is restored as needed from the tarball in subsequent batch jobs.

tar cf tensorflow_latest_gpu.tar tensorflow_latest_gpu/
bzip2 -9 tensorflow_latest_gpu.tar
rm -rf tensorflow_latest_gpu/

We will start an interactive slurm job and run this container interactively on a GPU-accelerated worker node. We request a single GPU of any architecture and two CPU threads per task.

cd /work1/your_project/singularity

srun --unbuffered --pty -A your_project --partition=gpu_gce \
     --qos=regular --time=08:00:00 \
     --nodes=1 --ntasks-per-node=1 --gres=gpu:1 \
     --cpus-per-task=2 /bin/bash

The work directory when the interactive job starts is /work1/your_project/singularity. We extract the sandbox directory in the local /scratch directory on the worker. Note that the singularity sandbox directory and all other files are removed from /scratch at the end of your batch session.

module load singularity
export SINGULARITY_CACHEDIR=/scratch/.singularity/cache

tar xf /wclustre/your_project/images/tensorflow_latest_gpu.tar.bz2 --directory /scratch

The sandbox has been restored to /scratch on the worker node. Run the singularity sandbox from the directory of your choice, e.g., your /work1 area. The workdir is used by singularity for /tmp and /var/tmp. Option home makes the /work1 area the home directory in the container environment.

The singularity options include: --userns to run in user namespace unprivileged mode, --nv to map the host Nvidia GPU drivers into the container, --home to remap the location of /home to /work, and --workdir to remap /tmp and /var/tmp to /scratch.

mkdir /scratch/work

singularity shell --userns --nv \
        --workdir=/scratch/work \
        --home=/work1/your_project/singularity/home \
          /scratch/tensorflow_latest_gpu

The environment within the container is Ubuntu 18.04

Singularity> cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"

From within the container, query for the attached GPU:

nvidia-smi

Run the Tensorflow / Keras MNIST example to train a network on the MNIST dataset.

python examples/tf_keras/mnist_convnet.py

Example: A pyTorch container from Nvidia

Please note that recent pre-built pyTorch images available from NVidia do support older K40 GPUs. In this example we will request a worker equipped with more capable P100 GPUs. The NVidia site also has older container builds of earlier versions of pyTorch which will run on K40 GPUs.

From the cluster login node, configure singularity and relocate the cache directory used in building the container. We build the container sandbox directory in Lustre.

module load singularity
cd /wclustre/your_project/images
export SINGULARITY_CACHEDIR=/wclustre/your_project/images/.singularity/cache

singularity build --fix-perms --sandbox \
            pytorch:20.10-py3 \
            docker://nvcr.io/nvidia/pytorch:20.10-py3

The container sandbox is called pytorch_20.10-py3. We build a compressed tarball of the sandbox that will later be unpacked for use in batch jobs.

tar cf pytorch:20.10-py3.tar pytorch:20.10-py3/
bzip2 -9 pytorch:20.10-py3.tar
rm -rf pytorch:20.10-py3/

Start an interactive job requesting a worker equipped with an NVidia P100 GPU.

cd /work1/your_project/singularity/

### TODO: edit srun command to request worker "gpu3" (wcgpu01)
# Note: cores/gpu  gpu3 28/2  gpu4 16/8
srun --unbuffered --pty -A your_project --partition=gpu_gce --qos=regular --time=08:00:00 \
     --nodes=1 --ntasks-per-node=1 --gres=gpu:p100:1 --cpus-per-task=2 /bin/bash

From the worker node, setup singularity and extract the container sandbox to /scratch.

module load cuda10/10.1
module load singularity
export SINGULARITY_CACHEDIR=/scratch/.singularity/cache

tar xf /wclustre/your_project/images/pytorch_20.10-py3.tar.bz2 --directory /scratch

Now, we can go to our work directory and activate a shell within the Ubuntu 18.04 container environment.

mkdir /scratch/work
singularity shell --userns --nv \
       --home=/work1/your_project/singularity \
       --workdir=/scratch/work \
         /scratch/pytorch_20.10-py3

Run the pyTorch MNIST example from the container

source /opt/conda/etc/profile.d/conda.sh
conda activate

python examples/pytorch/mnist_main.py

Additional useful information