Pangeo: Earth Science
Who Am I?
I am Ryan Abernathey, a physical oceanographer andprofessor at Columbia University /Lamont Doherty Earth Observatory.
I am a founding member of the Pangeo Project, aninitiative aimed at coordinating and supporting the development of open sourcesoftware for the analysis of very large geoscientific datasets such assatellite observations or climate simulation outputs. Pangeo is funded byNational Science Foundation Grant1740648,of which I am the principal investigator.
What Problem are We Trying to Solve?
Many oceanographic and atmospheric science datasets consist of multi-dimensionalarrays of numerical data, such as temperature sampled on a regular latitude,longitude, depth, time grid. These can be real data, observed by instrumentslike weather balloons, satellites, or other sensors; or they can be “virtual”data, produced by simulations. Scientists in these fields perform an extremelywide range of different analyses on these datasets. For example:
- simple statistics like mean and standard deviation
- principal component analysis of spatio-temporal variability
- intercomparison of datasets with different spatio-temporal sampling
- spectral analysis (Fourier transforms) over various space and time dimensions
- budget diagnostics (e.g. calculating terms in the equation for heat conservation)
- machine learning for pattern recognition and prediction
Scientists like to work interactively and iteratively, trying out calculations,visualizing the results, and tweaking their code until they eventually settle ona result that is worthy of publication.
The traditional workflow is to download datasets to a personal laptop orworkstation and peform all analysis there. As sensor technology and computerpower continue to develop, the volume of our datasets is growing exponentially.This workflow is not feasible or efficient with multi-terabyte datasets, and itis impossible with petabyte-scale datasets. The fundamental problem we aretrying to solve in Pangeo is how do we maintain the ability to performrapid, interactive analysis in the face of extremely large datasets?Dask is an essential part of our solution.
How Dask Helps
Our large multi-dimensional arrays map very well to Dask’s array
model. Ourusers tend to interact with Dask via Xarray, whichadds additional label-aware operations and group-by / resample capabilities.The Xarray data model is explicitly inspired by the Common Data Model formatwidely used in geosciences. Xarray has incorporated dask from very early in itsdevelopment, leading to close integration between these packages.
Pangeo provides configurations for deploying Jupyter, Xarray and Dask onhigh-performance computing clusters and cloud platforms. On these platforms,our users load data lazily using xarray from a variety of different storageformats and perform analysis inside Jupyter notebooks. Working closely withthe Dask development team, we have tried to simplify the process of launchingDask clusters interactively by using packages such asdask-kubernetes anddask-jobqueue.Users employ those packages to interactively launch their own Dask clustersacross many nodes of the compute system. Dask then automatically parallelizesthe xarray-based computations without users having to write much specializedparallel code. Users appreciate the Dask dashboard, which provides a visualindication of the progress and efficiency of their ongoing analysis. Wheneverything is working well, Dask is largely transparent to the user.
Why We Chose Dask Originally
Pangeo emerged from the Xarray development group, so Dask was a natural choice.Beyond this, Dask’s flexibility is a good fit for our applications; asdescribed above, scientists in this domain perform a huge range of differenttypes of analysis. We need a parallel computing engine which does not stronglyconstrain the type of computations that can be performed nor require the userto engage with the details of parallelization.
Pain Points
Dask’s flexibility comes with some overhead.I have the impression that the size of the graphs our users generate, whichcan easily exceed a million tasks, is pushing the limits of the dask scheduler.It is not uncommon for the scheduler to crash, or to take an uncomfortably longtime to process, when these tasks are submitted. Our workaround is mostly tofall back on the sort of loop-based iteration over large datasets that we hadto do pre-Dask. All of this undermines the interactive experience we are tryingto achieve.
However, the first year of this project has made me optimistic about the future.I think the interaction between Pangeo users and Dask developers has beenpretty successful. Our use cases have helped identify several performancebottlenecks that have been fixed at the Dask level. If this trend can continue,I’m confident we will be able to reach our desired scale (petabytes) and speed.
A broader issue relates to onboarding of new users. While I said above thatDask operates transparently to the users, this is not always the case. Usersused to writing loop-based code to process datasets have to be retrained aroundthe delayed-evaluation paradigm. It can be a challenge to translate legacy codeinto a Dask-friendly format. Some sort of “cheat sheet” might be able to helpwith this.
Technology around Dask
Xarray is the main way we interact with Dask. We use thedask-jobqueque
anddask-kubernetes
projects heavily.
We also use Zarr extensively for storage,especially on the cloud, where we also employgcsfs
ands3fs
to interface with cloud storage.