{"id":114,"date":"2020-08-13T11:19:39","date_gmt":"2020-08-13T16:19:39","guid":{"rendered":"http:\/\/computing.fnal.gov\/wilsoncluster\/?page_id=114"},"modified":"2024-06-13T13:56:56","modified_gmt":"2024-06-13T18:56:56","slug":"welcome-to-the-fermilab-institutional-wilson-cluster","status":"publish","type":"page","link":"https:\/\/computing.fnal.gov\/wilsoncluster\/","title":{"rendered":"Introduction"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\"><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">NOTICE: The Wilson Cluster will be decommissioned beginning June 28, 2024<\/mark><\/h1>\n\n\n\n<p><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">All HPC services provided by the Wilson cluster will cease. Users will have to consider alternatives for their computing needs. Users must migrate their data from the NFS project filesystem <code>\/work1<\/code> and off of <code>\/wclustre<\/code> to another storage system. Slides about the alternatives to the Wilson cluster from the June 11, 2024 meeting with Wilson users are on <a href=\"https:\/\/indico.fnal.gov\/event\/65115\/\">indico<\/a>.<\/mark><\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Timeline<\/mark><\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Date<\/mark><\/strong><\/td><td><strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Milestone<\/mark><\/strong><\/td><\/tr><tr><td><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">2024-06-28<\/mark><\/td><td><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">End all batch operations<\/mark><\/td><\/tr><tr><td><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">2024-07-15<\/mark><\/td><td><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Deadline to transfer data out of \/wclustre<\/mark><\/td><\/tr><tr><td><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">2024-07-15<\/mark><\/td><td><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Deadline to transfer data out of \/work1<\/mark><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Decommissioning updates:<\/mark><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">2024-06-12 &#8212; Ceph disk areas are mounted on data transfer node wcio.fnal.gov. Use wcio to copy data you wish to keep from <code>\/work1<\/code> or <code>\/wclustre<\/code> to your organization&#8217;s Ceph area.<\/mark> <mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">See the file transfer tips below.<\/mark><\/li>\n\n\n\n<li><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">2024-06-11 &#8212; Wilson cluster all hands meeting <a href=\"https:\/\/indico.fnal.gov\/event\/65115\/\">slides<\/a><\/mark><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">File transfer tips<\/mark><\/h3>\n\n\n\n<h4 class=\"wp-block-heading\"><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Copy to your orgainization&#8217;s Ceph areas from either \/work1 or \/wclustre<\/mark><\/h4>\n\n\n\n<p>Direct copies can be done from the Wilson cluster data transfer server <code>wcio.fnal.gov<\/code> since the experiment Ceph areas, work1, and Lustre are all mounted there. Login to <code>wcio.fnal.gov<\/code> to do your copies. We recommend using <a href=\"https:\/\/www.redhat.com\/sysadmin\/sync-rsync\">rsync <\/a>with the archive option enabled (<code>rsync -a<\/code>). When doing long copies we recommend using a <a href=\"https:\/\/github.com\/tmux\/tmux\/wiki\/Getting-Started\">tmux<\/a> terminal session to avoid copy interruptions from terminal disconnects. From wcio the Ceph areas are mounted as <code>\/exp<\/code>, Lustre is mounted as<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-black-color\"> <\/mark><code>\/GRIDFTPROOT\/wclustre<\/code><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-black-color\">, and the NFS work partition is <code>\/GRIDFTPROOT\/work1<\/code>.<\/mark> You can view your orgainization&#8217;s Ceph quotas in landscape from the <a href=\"https:\/\/landscape.fnal.gov\/monitor\/d\/d4qZ8JSSz\/cephfs-experiment-usage?orgId=1\">CephFS Usage page<\/a>. Please coordinate with your organization&#8217;s computing liaison on managing your organization&#8217;s Ceph areas and where you should put your data. Ceph is operated by the Scientific Data Services department. Questions about quotas or issues affecting Ceph operations should be filed in <a href=\"https:\/\/fermi.servicenowservices.com\/nav_to.do?uri=%2Fservice_offering.do%3Fsys_id%3Df3907a4e1b1321906ee0ea42f54bcb0e%26sysparm_view%3Dess%26sysparm_affiliation%3D\">Service Now (Ceph)<\/a>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-red-color\">Globus copies offsite (e.g. to NERSC)<\/mark><\/h4>\n\n\n\n<p>Instructions for using Globus transfers are <a href=\"https:\/\/docs.globus.org\/guides\/tutorials\/manage-files\/transfer-files\/\">here<\/a>. Search for the Wilson cluster globus endpoint which is called &#8220;<code>Wilson Cluster Globus Endpoint<\/code>&#8221; from the file manager page endpoint search box. The Wilson endpoint uses <a href=\"https:\/\/www.cilogon.org\/\">CILogon<\/a> for authentication. The NERSC instructions for accessing their Globus endpoints are <a href=\"https:\/\/docs.nersc.gov\/services\/globus\/\">here<\/a>. Please be aware that the current storage allocation at NERSC was intended for staging data and is not suitable for parking your data long term.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<div class=\"alignnormal\"><div id=\"metaslider-id-1128\" style=\"width: 100%;\" class=\"ml-slider-3-107-0 metaslider metaslider-flex metaslider-1128 ml-slider ms-theme-default nav-hidden\" role=\"region\" aria-label=\"New Slideshow\" data-height=\"150\" data-width=\"700\">\n    <div id=\"metaslider_container_1128\">\n        <div id=\"metaslider_1128\" class=\"flexslider\">\n            <ul class='slides'>\n                <li style=\"display: block; width: 100%;\" class=\"slide-1134 ms-image \" aria-roledescription=\"slide\" data-date=\"2020-10-30 14:51:00\" data-filename=\"19-0150-03.hr_-scaled-e1604088080970-700x150.jpg\" data-slide-type=\"image\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/10\/19-0150-03.hr_-scaled-e1604088080970-700x150.jpg\" height=\"150\" width=\"700\" alt=\"\" class=\"slider-1128 slide-1134 msDefaultImage\" title=\"19-0150-03.hr\" \/><\/li>\n                <li style=\"display: none; width: 100%;\" class=\"slide-1137 ms-image \" aria-roledescription=\"slide\" data-date=\"2020-10-30 14:51:00\" data-filename=\"08-0338-27D-700x150.jpg\" data-slide-type=\"image\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/10\/08-0338-27D-700x150.jpg\" height=\"150\" width=\"700\" alt=\"\" class=\"slider-1128 slide-1137 msDefaultImage\" title=\"08-0338-27D\" \/><\/li>\n                <li style=\"display: none; width: 100%;\" class=\"slide-1140 ms-image \" aria-roledescription=\"slide\" data-date=\"2020-10-30 14:51:13\" data-filename=\"08-0186-09D.hr_-scaled-700x150.jpg\" data-slide-type=\"image\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/10\/08-0186-09D.hr_-scaled-700x150.jpg\" height=\"150\" width=\"700\" alt=\"\" class=\"slider-1128 slide-1140 msDefaultImage\" title=\"08-0186-09D.hr\" \/><\/li>\n                <li style=\"display: none; width: 100%;\" class=\"slide-1161 ms-image \" aria-roledescription=\"slide\" data-date=\"2020-10-30 15:01:52\" data-filename=\"13-0289-03D.hr_-scaled-700x150.jpg\" data-slide-type=\"image\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/10\/13-0289-03D.hr_-scaled-700x150.jpg\" height=\"150\" width=\"700\" alt=\"\" class=\"slider-1128 slide-1161 msDefaultImage\" title=\"13-0289-03D.hr\" \/><\/li>\n            <\/ul>\n        <\/div>\n        \n    <\/div>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><\/h2>\n\n\n\n<p>The Wilson cluster&nbsp;(WC) is a High-Performance Computing (HPC) cluster available to the entire Fermilab scientific and engineering community. The WC is designed to be able to efficiently run and scale up parallel workloads over hundreds CPU cores and\/or multiple GPUs. The Wilson cluster provides HPC services typical of larger HPC centers such as NERSC, OLCF, or ALCF. The WC is considered a medium-scale HPC facility which can provide a development on-ramp to the larger HPC centers.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Features include:<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Up to O(800) CPU cores per job for tightly coupled parallel computations (MPI,&nbsp;<a href=\"https:\/\/www.openmp.org\/\">OpenMP<\/a>, &#8230;).<\/li>\n\n\n\n<li>Access to multiple A100, V100, and P100&nbsp;<a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/products\/\">NVIDIA GPUs<\/a>&nbsp;(<a href=\"https:\/\/developer.nvidia.com\/cuda-toolkit\">CUDA<\/a>, NVIDIA&nbsp;<a href=\"https:\/\/developer.nvidia.com\/hpc-sdk\">HPC sdk<\/a>).<\/li>\n\n\n\n<li>Workers equipped with multiple GPUs to efficiently scale jobs to multiple GPUs.<\/li>\n\n\n\n<li>Ability to run <a href=\"https:\/\/en.wikipedia.org\/wiki\/OS-level_virtualization\">containerized<\/a> HPC and AI applications&nbsp;with <a href=\"https:\/\/apptainer.org\/\">Apptainer<\/a>.<\/li>\n\n\n\n<li>High-bandwidth, low-latency&nbsp;<a href=\"https:\/\/community.fs.com\/blog\/infiniband-vs-ethernet-which-is-right-for-your-data-center-network.html\">InfiniBand<\/a>&nbsp;networking among workers and storage.<\/li>\n\n\n\n<li>High-performance&nbsp;<a href=\"https:\/\/doc.lustre.org\/lustre_manual.xhtml#understandinglustre.tab1\">Lustre<\/a>&nbsp;parallel filesystem for efficient access large data sets and files.<\/li>\n\n\n\n<li>NFS&nbsp;<code>\/work1<\/code>&nbsp;filesystem allowing shared access among users in the same compute project.<\/li>\n\n\n\n<li><a href=\"https:\/\/slurm.schedmd.com\/SLUG18\/slurm_overview.pdf\">Slurm<\/a>&nbsp;batch system designed to run HPC workloads at scale.<\/li>\n\n\n\n<li>Optional interactive access to worker nodes via a shell launched by slurm.<\/li>\n\n\n\n<li>High-bandwidth data transfer node with <a href=\"https:\/\/www.globus.org\/data-transfer\">Globus<\/a> for transfers among data centers.<\/li>\n\n\n\n<li>Access to the <a href=\"https:\/\/cvmfs.readthedocs.io\/en\/stable\/\">CernVM-FS<\/a> software distribution service.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Use cases for include:<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Code development and performance testing of parallel CPU codes.<\/li>\n\n\n\n<li>GPU code development including the ability to test performance while running on multi-GPUs.<\/li>\n\n\n\n<li>AI model training when the convergence of HPC and AI features are critical to performance.<\/li>\n\n\n\n<li>Testbed to rapidly explore new algorithms and methods with minimal barriers to getting started and obtaining the needed computing resources.<\/li>\n\n\n\n<li>Platform for modest small to medium scale non-critical parallel computing campaigns.<\/li>\n\n\n\n<li>A development on-ramp for HPC workflows to be run at scale at larger HPC centers.<\/li>\n\n\n\n<li>A reservable compute resource for workflows with tight deadlines or for use during hands-on workshops.<\/li>\n<\/ul>\n\n\n\n<h5 class=\"wp-block-heading\">Q&amp;A<\/h5>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Who has access to Wilson?<\/strong> In short, everyone within the Fermilab community having a Kerberos identity has opportunistic access to cluster resources. Opportunistic access means your HPC jobs run at lower priority and have more restrictive limits on compute resources.<\/li>\n\n\n\n<li><strong>How do I obtain resources beyond what opportunistic access permits?<\/strong> Groups of users that have scientific or engineering goals that require more resources are asked to provide justification and apply for a Wilson project account. See <a href=\"https:\/\/computing.fnal.gov\/wilsoncluster\/project-and-user-requests\/\">Projects and User Requests<\/a>.<\/li>\n\n\n\n<li><strong>How do I login to the Wilson Cluster?<\/strong> Use ssh to login to either <code>wc.fnal.gov<\/code> or <code>wc2.fnal.gov<\/code>.<\/li>\n\n\n\n<li><strong>Is my workload suitable for Wilson?<\/strong> Wilson is specifically designed to efficiently run High Performance (HPC) workloads consisting of tightly coupled parallel applications. Examples of HPC applications include Lattice QCD, computational fluid dynamics, molecular dynamics simulations, and training large AI models. If your workload, instead, consists of many independent single-core tasks that can execute concurrently or in a distributed manner, then a High Throughput Computing (HTC) facility such as <a href=\"https:\/\/hepcloud.fnal.gov\/\">HepCloud<\/a> or <a href=\"https:\/\/computing.fnal.gov\/computing-facilities-and-middleware\/\">FermiGrid<\/a> is better matched to your needs.<\/li>\n\n\n\n<li><strong>What are the advantages doing AI training on Wilson?<\/strong> Jobs on Wilson are provided whole <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/data-center-gpus\/\">data center<\/a> GPU devices, not a partition of a device, or a lower performance &#8220;gamer&#8221; GPU. Large training jobs can take advantage of training on multiple GPUs. Lustre and InfiniBand provide low-latency high-bandwidth access to very large data sets.<\/li>\n\n\n\n<li><strong>I prefer to use JupyterHub for my computing, can I run <a href=\"https:\/\/jupyter.org\/hub\">Jupyter<\/a> on Wilson?<\/strong> Yes, it is indeed possible to run Jupyter from Wilson worker nodes and login nodes via ssh tunneling, but it requires extra steps and you may need to wait in a batch queue before your session starts on a worker. Please note that Wilson does not officially support this mode operation. Fermilab offers the <a href=\"https:\/\/analytics-hub.fnal.gov\/hub\/login?next=%2Fhub%2F\">Elastic Analysis Facility<\/a> specifically designed for JupyterHub. Note that VPN is required to access EAF offsite.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Schematic layout<\/h4>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/12\/Screen-Shot-2020-12-17-at-2.48.32-PM-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"545\" src=\"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/12\/Screen-Shot-2020-12-17-at-2.48.32-PM-1-1024x545.png\" alt=\"\" class=\"wp-image-2030\" srcset=\"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/12\/Screen-Shot-2020-12-17-at-2.48.32-PM-1-1024x545.png 1024w, https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/12\/Screen-Shot-2020-12-17-at-2.48.32-PM-1-300x160.png 300w, https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/12\/Screen-Shot-2020-12-17-at-2.48.32-PM-1-768x409.png 768w, https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/12\/Screen-Shot-2020-12-17-at-2.48.32-PM-1-1536x818.png 1536w, https:\/\/computing.fnal.gov\/wilsoncluster\/wp-content\/uploads\/2020\/12\/Screen-Shot-2020-12-17-at-2.48.32-PM-1-2048x1091.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>NOTICE: The Wilson Cluster will be decommissioned beginning June 28, 2024 All HPC services provided by the Wilson cluster will cease. Users will have to consider alternatives for their computing needs. Users must migrate their data from the NFS project filesystem \/work1 and off of \/wclustre to another storage system. Slides about the alternatives to&#8230; <a class=\"more-link\" href=\"https:\/\/computing.fnal.gov\/wilsoncluster\/\"> More &#187;<\/a><\/p>\n","protected":false},"author":15,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-114","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/pages\/114","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/comments?post=114"}],"version-history":[{"count":105,"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/pages\/114\/revisions"}],"predecessor-version":[{"id":8366,"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/pages\/114\/revisions\/8366"}],"wp:attachment":[{"href":"https:\/\/computing.fnal.gov\/wilsoncluster\/wp-json\/wp\/v2\/media?parent=114"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}