Big Data

Experimental particle physics has been at the forefront of analyzing the world’s largest datasets for decades. The high-energy physics (HEP) community was amongst the first to develop suitable software and computing tools for this task.

In recent times, new open source toolkits and systems collectively called “Big Data” technologies have emerged to support the analysis of petabyte and exabyte datasets. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity.

Several projects are investigating new technologies related to Big Data.

The aim of our first big data project is to understand the role of big data technologies, such as Spark and others, on HPC platforms for high-energy physics data-processing tasks (non-traditional HPC), and to define the role of incorporating exascale-capable visualization tools for algorithm development and visual debugging. Our HEP use cases will be from CMS and LArTPC experiments.

The goals of our project are:

  • Bring HPC and exascale tools (e.g. Paraview for visualization and HDF5 as data format) into big data technologies.
  • Work with science drivers from the CMS and LArTPC-based experiments.

We are currently working with the CMS big data science project as our first use case; we have provided tools for converting CMS bacon files (in ROOT) to the HDF5 format, and set up to load and analyze the converted data using Spark on NERSC (Edison, Cori). Our code and recent talks are available in a GitHub repository.

The second project investigates the direct usage of CMS data in ROOT format, the dominant data format in HEP, from the open source Apache Spark system. The project investigates the possible scaling behavior through a CERN openlab project together with Intel, and studies the usability by comparing the experiences of different user groups with the new tool. The main documentation can be found here:

The third project takes a different approach and loads CMS data into a NoSQL database. Using numpy arrays to represent columns of data and a multi-tiered caching infrastructure, the data can be analyzed from a single notebook at very fast speed. The main documentation can be found here: