Overview: Why xarray?
What labels enable
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called“tensors”) are an essential part of computational science.They are encountered in a wide range of fields, including physics, astronomy,geoscience, bioinformatics, engineering, finance, and deep learning.In Python, NumPy provides the fundamental data structure and API forworking with raw ND arrays.However, real-world datasets are usually more than just raw numbers;they have labels which encode information about how the array values mapto locations in space, time, etc.
Xarray doesn’t just keep track of labels on arrays – it uses them to provide apowerful and concise interface. For example:
Apply operations over dimensions by name:
x.sum('time')
.Select values by label instead of integer location:
x.loc['2014-01-01']
orx.sel(time='2014-01-01')
.Mathematical operations (e.g.,
x - y
) vectorize across multipledimensions (array broadcasting) based on dimension names, not shape.Flexible split-apply-combine operations with groupby:
x.groupby('time.dayofyear').mean()
.Database like alignment based on coordinate labels that smoothlyhandles missing values:
x, y = xr.align(x, y, join='outer')
.Keep track of arbitrary metadata in the form of a Python dictionary:
x.attrs
.
The N-dimensional nature of xarray’s data structures makes it suitable for dealingwith multi-dimensional scientific data, and its use of dimension namesinstead of axis labels (dim='time'
instead of axis=0
) makes sucharrays much more manageable than the raw numpy ndarray: with xarray, you don’tneed to keep track of the order of arrays dimensions or insert dummy dimensions(e.g., np.newaxis
) to align arrays.
The immediate payoff of using xarray is that you’ll write less code. Thelong-term payoff is that you’ll understand what you were thinking when you comeback to look at it weeks or months later.
Core data structures
xarray has two core data structures, which build upon and extend the corestrengths of NumPy and pandas. Both are fundamentally N-dimensional:
DataArray
is our implementation of a labeled, N-dimensionalarray. It is an N-D generalization of apandas.Series
. The nameDataArray
itself is borrowed from Fernando Perez’s datarray project,which prototyped a similar data structure.Dataset
is a multi-dimensional, in-memory array database.It is a dict-like container ofDataArray
objects aligned along any number ofshared dimensions, and serves a similar purpose in xarray to thepandas.DataFrame
.
The value of attaching labels to numpy’s numpy.ndarray
may befairly obvious, but the dataset may need more motivation.
The power of the dataset over a plain dictionary is that, in addition topulling out arrays by name, it is possible to select or combine data along adimension across all arrays simultaneously. Like aDataFrame
, datasets facilitate array operations withheterogeneous data – the difference is that the arrays in a dataset can notonly have different data types, but can also have different numbers ofdimensions.
This data model is borrowed from the netCDF file format, which also providesxarray with a natural and portable serialization format. NetCDF is very popularin the geosciences, and there are existing libraries for reading and writingnetCDF in many programming languages, including Python.
xarray distinguishes itself from many tools for working with netCDF datain-so-far as it provides data structures for in-memory analytics that bothutilize and preserve labels. You only need to do the tedious work of addingmetadata once, not every time you save a file.
Goals and aspirations
Xarray contributes domain-agnostic data-structures and tools for labeledmulti-dimensional arrays to Python’s SciPy ecosystem for numerical computing.In particular, xarray builds upon and integrates with NumPy and pandas:
Our user-facing interfaces aim to be more explicit verisons of those found inNumPy/pandas.
Compatibility with the broader ecosystem is a major goal: it should be easyto get your data in and out.
We try to keep a tight focus on functionality and interfaces related tolabeled data, and leverage other Python libraries for everything else, e.g.,NumPy/pandas for fast arrays/indexing (xarray itself contains no compiledcode), Dask for parallel computing, matplotlib for plotting, etc.
Xarray is a collaborative and community driven project, run entirely onvolunteer effort (see Contributing to xarray).Our target audience is anyone who needs N-dimensional labeled arrays in Python.Originally, development was driven by the data analysis needs of physicalscientists (especially geoscientists who already know and lovenetCDF), but it has become a much more broadly useful tool, and is stillunder active development.See our technical Development roadmap for more details, and feel free to reach outwith questions about whether xarray is the right tool for your needs.