IceCube: Detecting Cosmic Rays
Who am I?
I’m James Bourbeau, I’m a graduate student inthe Physics department at the University of Wisconsin at Madison. I work atthe IceCube South Pole Neutrino Observatorystudying the cosmic-ray energy spectrum.
What problem am I trying to solve?
Cosmic rays are energetic particles that originate from outer space. While theyhave been studied since the early 1900s, the sources of high-energy cosmic raysare still not well known. I analyze data collected by IceCube to study how thecosmic-ray spectrum changes with energy and particle mass; this can help providevaluable insight into our understanding of the origin of cosmic rays.
This involves developing algorithms to perform energy reconstruction as wellas particle mass group classification for events detected by IceCube. Inaddition, we use detector simulation and an iterative unfolding algorithm tocorrect for inherit detector biases and the finite resolution of ourreconstructions.
How Dask Helps us
I originally chose to use Dask because of theDask Array andDask Dataframe datastructures. I use Dask Dataframe to load thousands ofHDF files and then apply further featureengineering and filtering data preprocessing steps. The final dataset can beup to 100GB in size, which is too large to load into our available RAM. Sobeing able to easily distribute this load while still using the familiarpandas API has become invaluable in my research.
Later I discovered theDask delayed iterface and nowuse it to parallelize code that doesn’t easily conform to the Dask Array orDask Dataframe use cases. For example, I often need to perform thousands ofindependent calculations for the pixels in a HEALPix sky map. I’ve found Daskdelayed to be really useful for parallelizing these types of embarrassinglyparallel calculations with minimal hassle.
I also use several of thediagnostic toolsDask offers such as the progress bar and resource profiler. Working in a largecollaboration with shared computing resources, it’s great to be able tomonitor how many resources I’m using and scale back or scale up accordingly.
Pain points of using Dask
There were two main pain points I encountered when first using Dask:
- Getting used to the idea of lazy computation. While this isn’t an issue thatis specific to Dask, it was something that took time to get used to.
- Dask is a fairly large project with many components and it took some time tofigure out how all the various pieces fit together. Luckily, the userdocumentation for Dask is quite good and I was able to get over this initiallearning curve.
Technology that we use around Dask
We store our data in HDF files, which Dask has nice read and write supportfor. We also use several other Python data stack tools like Jupyter,scikit-learn, matplotlib, seaborn, etc. Recently, we’ve started experimentingwith using HTCondor and theDask distributed scheduler toscale up to using hundreds of workers on a cluster.