{"id":1640,"date":"2020-12-02T12:50:03","date_gmt":"2020-12-02T18:50:03","guid":{"rendered":"https:\/\/computing.fnal.gov\/wilsoncluster\/?page_id=1640"},"modified":"2024-02-02T14:13:11","modified_gmt":"2024-02-02T20:13:11","slug":"containers","status":"publish","type":"page","link":"https:\/\/computing.fnal.gov\/wilsoncluster\/containers\/","title":{"rendered":"Containers"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">What is a container?<\/h2>\n\n\n\n<p>Containers are a way to package a software in a format that can run in a highly isolated environment on a host operating system. Unlike <a href=\"https:\/\/en.wikipedia.org\/wiki\/Virtual_machine\" target=\"_blank\" rel=\"noreferrer noopener\">virtual machines<\/a> (VMs), <a href=\"https:\/\/en.wikipedia.org\/wiki\/OS-level_virtualization\" target=\"_blank\" rel=\"noreferrer noopener\">containers<\/a> do not emulate the full OS kernel &#8211; only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained environments and guarantees that software will always run the same, regardless of where it\u2019s deployed. The best known container technology is <a href=\"https:\/\/www.docker.com\/what-container\">Docker<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Apptainer and Singularity<\/h2>\n\n\n\n<p>Singularity was renamed <a href=\"https:\/\/apptainer.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apptainer<\/a> and put under the stewardship of the <a href=\"https:\/\/www.linuxfoundation.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Linux Foundation<\/a> to differentiate it from the other like-named projects and commercial products [<a href=\"https:\/\/apptainer.org\/news\/community-announcement-20211130\/\" target=\"_blank\" rel=\"noreferrer noopener\">announcement<\/a>]. The Fermilab HPC clusters support the use of Apptainer. <\/p>\n\n\n\n<p>Unlike the Docker system, Apptainer is designed for regular users to securely run containers on a shared host system, such as an HPC cluster. Apptainer enables users to have full control of their environment. For example, the environment inside the container might be Ubuntu 24.04 or Alma Linux 9.x and the container will run on an Alma Linux 8.x host system.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Containers for portability and reproducibility<\/h2>\n\n\n\n<p>Singularity-format containers can be used to package entire scientific workflows, software and libraries, and even data. Singularity-format containers have proven particularly useful to support machine learning (ML) frameworks on the Fermilab HPC clusters since the ML software frameworks evolve rapidly and the ML software development is typically done on an operating system such as Ubuntu rather than Scientific Linux. Containers allow users to select from a wide range of ML frameworks and versions with confidence that their selected environment is isolated from changes to the underlying host OS.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where to find containers<\/h2>\n\n\n\n<p>We recommend using standard &#8220;off the shelf&#8221; containers whenever possible rather than building and maintaining customized containers. Pre-built containers are available online and Apptainer has the ability to build a local copy of a container. Apptainer is able to convert a Docker container into a Singularity-format container. <mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\"><strong>Every user should be extremely cautious of the security implications of downloading and running binary code within containers<\/strong>.<\/mark> Hence, a user should only download containers that are provided by verified repositories and publishers or that they have built themselves from official Linux package repositories. Docker format containers are found at:<\/p>\n\n\n\n<p><a href=\"https:\/\/hub.docker.com\/search?q=&amp;type=image\">DockerHub<\/a>: Please ensure you filter your choices by selecting either &#8220;<strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Verified Publisher<\/mark><\/strong>&#8221; or &#8220;<strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Official Images<\/mark><\/strong>&#8220;.<\/p>\n\n\n\n<p><a href=\"https:\/\/ngc.nvidia.com\/catalog\/containers?orderBy=modifiedDESC&amp;pageNumber=0&amp;query=&amp;quickFilter=containers&amp;\\ filters=\" target=\"_blank\" rel=\"noreferrer noopener\">NVIDIA NGC<\/a>: Be aware that many of the &#8220;latest&#8221; version containers built by Nvidia may no longer support older P100 (sm60) GPUs. You may be able to find a suitable container by searching the available container Tags.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Apptainer setup<\/h2>\n\n\n\n<p>Apptainer is available via software modules. We list available versions of apptainer and then load the default version.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ module avail apptainer\n--------- \/srv\/software\/el8\/x86_64\/hpc\/lmod\/Core ----------------------\n   apptainer\/1.2.1\n\n$ module load apptainer\n$ apptainer --version\napptainer version 1.2.1<\/code><\/pre>\n\n\n\n<p>Appatainer caches the overlay pieces needed to build a container. We set an environment variable to place the cache in Lustre rather than defaulting to your \/nashome home directory. The latter has very limited storage that may not be big enough for cache. Below replace my_project_dir with the name of your project area. Apptainer also uses a temporary directory to assemble the image file when building a singularity image. The second variable controls the location used for temporary space.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ export APPTAINER_CACHEDIR=\/wclustre\/my_project_dir\/apptainer\/.apptainer\/cache\n$ export APPTAINER_TMPDIR=\/wclustre\/my_project_dir\/apptainer\/.apptainer\/tmp\n$ $ export APPTAINER_CONFIGDIR=\/wclustre\/my_project_dir\/apptainer\/config<\/code><\/pre>\n\n\n\n<p>We recommend building large and complex container images on the WC worker nodes in a batch job. The build can be done on the <code>\/scratch<\/code> partition which typically has a few hundred GB of free space. From a shell on the worker node, use<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cd \/scratch\n$ mkdir $USER\n$ cd $USER\n$ export APPTAINER_CACHEDIR=\/scratch\/$USER\/apptainer\/.apptainer\/cache\n$ export APPTAINER_TMPDIR=\/scratch\/$USER\/apptainer\/.apptainer\/tmp\n$ export APPTAINER_CONFIGDIR=\/scratch\/$USER\/apptainer\/config<\/code><\/pre>\n\n\n\n<p>Please note that you must copy any containers you build in \/scratch to either <code>\/wclustre<\/code> or <code>\/work1<\/code> before you exit the batch job since <code>\/scratch<\/code> is cleaned at the end of the job.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Cleaning the cache<\/h4>\n\n\n\n<p>This command lists currently cached files<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ apptainer cache list\nThere are 0 container file(s) using 0.00 KiB and 0 oci blob file(s) using 0.00 KiB of space\nTotal space used: 0.00 KiB<\/code><\/pre>\n\n\n\n<p>The block below illustrates the command used to clean the cache. Remove the <code>--dry-run<\/code> flag to actually remove cached files.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ apptainer cache clean --dry-run\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Example: Building a container from Docker Hub<\/h2>\n\n\n\n<p>Your project area in Lustre is a convenient place to store very large images you need for your work. Since the container in this example is less than 90 MB and requires little cache space during the build, we can build it directly in Lustre from the login node rather than building from a batch job.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cd \/wclustre\/my_project_dir\n$ mkdir images\n$ cd images<\/code><\/pre>\n\n\n\n<p>The build command below creates a local singularity-format image from a Docker container from Docker Hub. The following command will download the storage overlays for the <a rel=\"noreferrer noopener\" href=\"https:\/\/hub.docker.com\/r\/godlovedc\/lolcow\/dockerfile\" target=\"_blank\">lolcow<\/a> container from Docker Hub to create a copy of the container called <code>lolcow.sif<\/code> in the Lustre directory.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ HOME=\/work1\/my_project_dir  apptainer build lolcow.sif docker:\/\/godlovedc\/lolcow<\/code><\/pre>\n\n\n\n<p>Note that above we have reset the <code>HOME<\/code> variable while running the command. This is needed when <code>\/nashome<\/code> is not accessible to <code>apptainer<\/code> as when doing the build from a batch job.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Running the container<\/h4>\n\n\n\n<p>Containers often define a default action when activated by the <code>run<\/code> command. Running the lolcow container prints a random fortune.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ apptainer run lolcow.sif\n _____________________________________\n\/ You are not dead yet. But watch for \\\n\\ further reports.                    \/\n -------------------------------------\n        \\   ^__^\n         \\  (oo)\\_______\n            (__)\\       )\\\/\\\n                ||----w |\n                ||     ||<\/code><\/pre>\n\n\n\n<p>It is also possible to start a shell within the container. Below we start a shell and type a command to determine the guest OS (Ubuntu) within the container.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ apptainer shell --home=\/work1\/your_project_name lolcow.sif\nApptainer&gt; cat \/etc\/os-release\nNAME=\"Ubuntu\"\nVERSION=\"16.04.3 LTS (Xenial Xerus)\"\nID=ubuntu\nID_LIKE=debian\nPRETTY_NAME=\"Ubuntu 16.04.3 LTS\"\nVERSION_ID=\"16.04\"\nHOME_URL=\"http:\/\/www.ubuntu.com\/\"\nSUPPORT_URL=\"http:\/\/help.ubuntu.com\/\"\nBUG_REPORT_URL=\"http:\/\/bugs.launchpad.net\/ubuntu\/\"\nVERSION_CODENAME=xenial\nUBUNTU_CODENAME=xenial\nApptainer&gt; ^D\n$<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Building a container from an apptainer \/ singularity recipe<\/h2>\n\n\n\n<p>The steps needed to build a container can be described in a text file. The file <code>alma_9.x.def<\/code> describes an Alma Linux 9.x (Enterprise Linux 9) container that also provides access to the EPEL RPM repository. The recipe starts from an Alma 9 image obtained from Docker hub.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cat alma_9.x.def\n# An Alma Linux 9.x container\n\nBootstrap: docker\nFrom: almalinux:9\n\n%post\n    yum install -y epel-release<\/code><\/pre>\n\n\n\n<p>Additional software packages can be added to this container at build time by adding <code>dnf install<\/code> commands to the basic <code>alma_9.x.def<\/code> file.<\/p>\n\n\n\n<p>The build command will create a container named <code>alma_9.x.sif<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ apptainer build alma_9.x.sif alma_9.x.def<\/code><\/pre>\n\n\n\n<p>We can start a shell in the resulting container to verify that the guest operating system is Alma Linux 9<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ apptainer shell alma_9.x.sif\nApptainer&gt; cat \/etc\/os-release\nNAME=\"AlmaLinux\"\nVERSION=\"9.3 (Shamrock Pampas Cat)\"\nID=\"almalinux\"\nID_LIKE=\"rhel centos fedora\"\nVERSION_ID=\"9.3\"\nPLATFORM_ID=\"platform:el9\"\nPRETTY_NAME=\"AlmaLinux 9.3 (Shamrock Pampas Cat)\"\nREDHAT_SUPPORT_PRODUCT=\"AlmaLinux\"\nREDHAT_SUPPORT_PRODUCT_VERSION=\"9.3\"\nApptainer&gt; ^D\n$\n\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Using a  pyTorch container from NVIDIA<\/h2>\n\n\n\n<p>NVIDIA provides many versions of pyTorch containers. See <a href=\"https:\/\/catalog.ngc.nvidia.com\/orgs\/nvidia\/containers\/pytorch\" data-type=\"URL\" data-id=\"https:\/\/catalog.ngc.nvidia.com\/orgs\/nvidia\/containers\/pytorch\" target=\"_blank\" rel=\"noreferrer noopener\">PyTorch NGC<\/a> for a list of available containers. <\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Building a local copy of the container<\/h4>\n\n\n\n<p>We use apptainer to build a copy of the pyTorch container from NVIDIA. Since this is a large complex container, we do the build in a batch job and then copy the container to Lustre for future reuse. Although pyTorch is GPU accelerated, we do not need a GPU worker node to do the build. We start an interactive batch session on a CPU-only worker, setup apptainer, and setup the http proxy.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ srun --unbuffered --pty -A myAccount --qos=regular \\\n       --partition=wc_cpu --nodes=1 --time=02:00:00 \\\n       --ntasks-per-node=1 --cpus-per-task=16 \/bin\/bash\n# batch job has started\n$ cd \/scratch\/\n$ mkdir $USER\n$ cd $USER\n$ module load apptainer\n$ export APPTAINER_CACHEDIR=\/scratch\/$USER\/apptainer\/.apptainer\/cache\n$ export APPTAINER_TMPDIR=\/scratch\/$USER\/apptainer\/.apptainer\/tmp\n$ export APPTAINER_CONFIGDIR=\/scratch\/$USER\/apptainer\/config\n$ mkdir -p $APPTAINER_CACHEDIR $APPTAINER_TMPDIR $APPTAINER_CONFIGDIR\n$ export https_proxy=http:\/\/squid.fnal.gov:3128\n$ export http_proxy=http:\/\/squid.fnal.gov:3128\n<\/code><\/pre>\n\n\n\n<p>We next build the pyTorch container using the setup above. See <a href=\"https:\/\/catalog.ngc.nvidia.com\/orgs\/nvidia\/containers\/pytorch\" data-type=\"URL\" data-id=\"https:\/\/catalog.ngc.nvidia.com\/orgs\/nvidia\/containers\/pytorch\" target=\"_blank\" rel=\"noreferrer noopener\">PyTorch NGC<\/a> for the current list of available containers. For the build below we have reset <code>HOME<\/code> to mitigate an issue affecting apptainer when <code>$HOME<\/code> is not accessible.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ HOME=\/scratch\/$USER apptainer pull \\\n                      pytorch-23.12-py3.sif \\\n                      docker:\/\/nvcr.io\/nvidia\/pytorch:23.12-py3\n(several minutes and lots of screen output from the build)\nINFO:    Creating SIF file...\n$ ls -sh pytorch-23.12-py3.sif\n9.4G pytorch-23.12-py3.sif<\/code><\/pre>\n\n\n\n<p>Remember to make a copy of your image before ending your batch job.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ $ cp pytorch-23.12-py3.sif \/wclustre\/my_project_dir\/images\/\n<\/code><\/pre>\n\n\n\n<p>We do a simple test of  pyTorch in the container<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ apptainer shell --home=\/scratch\/$USER pytorch-23.12-py3.sif\nApptainer&gt; python\nPython 3.10.12 (main, Nov 20 2023, 15:14:05) &#91;GCC 11.4.0] on linux\nType \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n&gt;&gt;&gt; import torch\n&gt;&gt;&gt; torch.__version__\n'2.2.0a0+81ea7a4'\n&gt;&gt;&gt; torch.cuda.is_available()\nFalse\n&gt;&gt;&gt; ^D\nApptainer&gt; ^D\n$ exit # from batch<\/code><\/pre>\n\n\n\n<p>Above, the torch module was loaded from python, but CUDA is not available since the build job was run on a system without GPUs. Note that you can still run torch, however, performance will be slower without a GPU.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Using the pyTorch container on a GPU worker from batch<\/h4>\n\n\n\n<p>As an example, we train a NN on MNIST training data. The pyTorch examples are found on <a href=\"https:\/\/github.com\/pytorch\/examples\">gitHub<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ cd \/work1\/my_project_dir\/torch\n$ module load git\n$ git clone https:\/\/github.com\/pytorch\/examples.git\n<\/code><\/pre>\n\n\n\n<p>We start an interactive job on a GPU worker asking for a single NVIDIA P100 GPU.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ module load apptainer\n$ srun --unbuffered --pty -A myProject --qos=regular --time=1:00:00 \\\n       --partition=wc_gpu --gres=gpu:p100:1 --nodes=1 \\\n       --ntasks-per-node=1 --cpus-per-gpu=4 --mem-per-gpu=64G \\\n       \/bin\/bash\n$ hostname\nwcgpu02.fnal.gov\n$ pwd\n\/work1\/my_project_dir\/torch<\/code><\/pre>\n\n\n\n<p>We start a shell within the pyTorch container and run the <a href=\"https:\/\/github.com\/pytorch\/examples\/blob\/main\/mnist\/main.py\">example<\/a>. The <code>--nv<\/code> flag is needed to allow the container to access GPUs on the host system. The <code>--home=\/work1\/my_project_dir <\/code>option sets the home directory inside the container to <code>\/work1<\/code> rather than your <code>\/nashome<\/code> directory.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ export https_proxy=http:\/\/squid.fnal.gov:3128\n$ export http_proxy=http:\/\/squid.fnal.gov:3128\n$ mkdir test\n$ cd test\n$ apptainer shell --nv --home=\/work1\/my_project_dir \\\n        \/wclustre\/my_project_dir\/images\/pytorch-23.12-py3.sif\n# check that torch detects the GPU\nApptainer&gt; python\nPython 3.10.12 (main, Nov 20 2023, 15:14:05) &#91;GCC 11.4.0] on linux\n&gt;&gt;&gt; import torch\n&gt;&gt;&gt; torch.cuda.is_available()\nTrue\n&gt;&gt;&gt; ^D\n# run the example\nApptainer&gt; python ..\/examples\/mnist\/main.py\n(the MNIST data set is downloaded to directory ..\/data)\n(training progress is reported)\nTrain Epoch: 14 &#91;59520\/60000 (99%)]\tLoss: 0.002743\nTest set: Average loss: 0.0259, Accuracy: 9917\/10000 (99%)\nApptainer&gt;<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Using a TensorFlow container from NVIDIA<\/h2>\n\n\n\n<p>The steps needed to obtain and use a TensorFlow images are analogous to the steps needed for using a pyTorch container. See <a href=\"https:\/\/catalog.ngc.nvidia.com\/orgs\/nvidia\/containers\/tensorflow\">NGC Tensorflow<\/a> for the available containers.<\/p>\n\n\n\n<p>The example <a href=\"https:\/\/github.com\/keras-team\/keras-io\/blob\/master\/examples\/nlp\/addition_rnn.py\">addition_rnn.py<\/a>, written in <a href=\"https:\/\/www.tensorflow.org\/guide\/keras\">Keras<\/a>, trains a RNN to do the addition of integers presented as strings.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Additional Information<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/apptainer.org\/docs\/user\/latest\/\">Apptainer user guide<\/a><\/li>\n\n\n\n<li>NVIDIA <a href=\"https:\/\/github.com\/NVIDIA\/hpc-container-maker\">HPC Container Maker<\/a> &#8212; an open source tool to make it easier to generate container specification files.<\/li>\n\n\n\n<li>Ten simple rules for writing Dockerfiles for reproducible data science &#8212; <a href=\"https:\/\/journals.plos.org\/ploscompbiol\/article\/file?id=10.1371\/journal.pcbi.1008316&amp;type=printable\">PLoS Comput Biol 16(11): e1008316<\/a><\/li>\n\n\n\n<li>Slack <a href=\"http:\/\/apptainer.slack.com\">Apptainer<\/a><\/li>\n\n\n\n<li>Slack <a href=\"http:\/\/hpc-containers.slack.com\">hpc-containers<\/a><\/li>\n\n\n\n<li>Open Science Grid  <a href=\"https:\/\/portal.osg-htc.org\/documentation\/htc_workloads\/using_software\/containers-singularity\/\">Containers &#8211; Apptainer\/Singularity<\/a><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What is a container? Containers are a way to package a software in a format that can run in a highly isolated environment on a host operating system. Unlike virtual machines (VMs), containers do not emulate the full OS kernel &#8211; only libraries and settings required to make the software work are needed. This makes&#8230; <a class=\"more-link\" href=\"https:\/\/computing.fnal.gov\/wilsoncluster\/containers\/\"> More &#187;<\/a><\/p>\n","protected":false},"author":24,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1640","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/pages\/1640","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/users\/24"}],"replies":[{"embeddable":true,"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/comments?post=1640"}],"version-history":[{"count":126,"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/pages\/1640\/revisions"}],"predecessor-version":[{"id":7637,"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/pages\/1640\/revisions\/7637"}],"wp:attachment":[{"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/media?parent=1640"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}