Fermilab Computational Science Internships (FCSI) for 2021

Please check the FCSI program page for additional information about the internship and how to apply.

Big Data analysis of data transfers in multi-petabyte distributed storage system

Fermilab develops and operates a tiered data storage system that stores, manages and delivers to scientists hundreds of petabytes of scientific data. A tape backend and a distributed disk cache frontend form the two tiers of the system. The frontend provides fast access to a subset of the data but is limited by the aggregate cache size while the backend provides large scale “cold data” storage with the bandwidth limited by the number of tape drives serving the system.

The disk cache, tape drives and available network bandwidth are limited resources requiring clever data flow management to achieve maximum data delivery efficiency. Going forward the amount of data generated by physics experiments is expected to increase dramatically while resource levels are expected to grow at a much slower pace. Therefore, it’s important to understand how the system is used over time to find and eventually eliminate inefficiencies or pathological behavior.

Each data transfer in the system is recorded resulting in data samples of about 10 GB/month. We want to perform Big Data analysis on these data with the goal of identifying patterns that could lead to system configuration improvements. Ultimately, we’d like to develop Machine Learning based tools that will aid in implementing self-correcting behavior in the storage system.

The intern will work with storage developers and operators to create a system to analyze the data transfers using Apache Spark, Python 3, and Jupyter Notebook.]

Desired Experience/Qualifications

Experience with Python is required.
Experience in a UNIX/Linux environment is desired.
Experience with shell scripting languages is desired.
Experience with SQL is desired.
Prior experience with Big Data analysis technology is beneficial, but not required.
Master’s student (year 1 or year 2) in Computer Science

Enhanced I/O systems for particle physics experiments

Current and next-generation neutrino and precision muon experiments at Fermilab will collect data at the multi-petabyte level and will require computing resources of a wide variety of types at sites distributed around the world. One major challenge is data movement and management. With such a large array of possible resource types and storage elements, it can be extremely difficult for end users to consistently make optimal choices when performing analysis, and what works on one set of resources may not work on another.

The intern will work with offline production and data management personnel to design and build enhanced I/O services for use by experiments within jobs. These services would perform functions such as automatically choosing the optimal location from which to read inputs for a given job (or even steering jobs to known data locations) and choosing the most efficient location for initial job output storage. These services would interface with existing and planned data management and replication systems to allow for optimal data movement in and out of jobs with minimal user intervention.

Desired Experience/Qualifications

Strong organizational skills
Familiarity with one or more scripting languages (e.g., Python) desired
Experience within a Linux environment highly desired
Experience with a distributed computing environment desirable, but not required

HEP Data Science at Exascale

A research tool for physics analysis named PandAna has emerged from the neutrino community over the past year. This tool combines HPC processing using MPI, Python and Pandas into a collaborative analysis environment. This internship entails contributing to the revising of PandAna to use compiler optimization techniques to build and evaluate expressions containing user-written filter, transformation and reduction functions. This work includes demonstrating use of PandAna in realistic analysis scenarios for challenge problems at CMS and DUNE using 1000s of nodes available at NERSC and ALCF facilities to process terabytes of data at interactive speeds. Terabytes of experimental data will need to be collected and reorganized to fit the HDF5 parallel processing model of PandAna. Collaboration with the CCE and SciDAC RAPIDS institute, including Northwestern University and Argonne National Laboratory, will be necessary for assessing and optimizing PandAna I/O and use of HDF5.

A primary goal of PandAna’s computational model is to take a large collection of filter and transformation functions contributed by many physicists and to organize them into an executable expression. The complete expression formed from these functions is executed using multiple levels of parallelism (multi-node and vectorization) to yield a greatly reduced dataset that is used in the final measurements presented in science papers.

The intern will work with the experiments to produce datasets from CMS and DUNE for this challenge, and with Fermilab and university researchers to introduce compiler optimization techniques into the system, and to participate in performance evaluation of the system at scale at ALCF and NERSC.

Desired Experience/Qualifications

Understanding of and interest in fundamentals of automata theory, or formal languages, or compilers and compiler optimization techniques
Familiarity with Python programming
Experience with MPI
Experience with using a revision control system, especially git
Experience with numpy and pandas is beneficial, but not necessary

Please check the FCSI program page for additional information about the internship and how to apply.