Big Data

Experimental particle physics has been at the forefront of analyzing the world’s largest datasets for decades. The high-energy physics (HEP) community was amongst the first to develop suitable software and computing tools for this task.

In recent times, new open source toolkits and systems collectively called “Big Data” technologies have emerged to support the analysis of petabyte and exabyte datasets. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity.

The aim of our big data project is to understand the role of big data technologies, such as Spark and others, on HPC platforms for high-energy physics data-processing tasks (non-traditional HPC), and to define the role of incorporating exascale-capable visualization tools for algorithm development and visual debugging. Our HEP use cases will be from CMS and LArTPC experiments.

The goals of our project are:

  • Bring HPC and exascale tools (e.g. Paraview for visualization and HDF5 as data format) into big data technologies.
  • Work with science drivers from the CMS and LArTPC-based experiments.

We are currently working with the CMS big data science project as our first use case; we have provided tools for converting CMS bacon files (in ROOT) to the HDF5 format, and set up to load and analyze the converted data using Spark on NERSC (Edison, Cori). Our code and recent talks are available in a GitHub repository.