Quick overview
Here are some quick examples of what you can do with xarray.DataArray
objects. Everything is explained in much more detail in the rest of thedocumentation.
To begin, import numpy, pandas and xarray using their customary abbreviations:
- In [1]: import numpy as np
- In [2]: import pandas as pd
- In [3]: import xarray as xr
Create a DataArray
You can make a DataArray from scratch by supplying data in the form of a numpyarray or list, with optional dimensions and coordinates:
- In [4]: data = xr.DataArray(np.random.randn(2, 3),
- ...: dims=('x', 'y'),
- ...: coords={'x': [10, 20]})
- ...:
- In [5]: data
- Out[5]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[-1.039575, 0.27186 , -0.424972],
- [ 0.56702 , 0.276232, -1.087401]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
In this case, we have generated a 2D array, assigned the names x and y to the two dimensions respectively and associated two coordinate labels ‘10’ and ‘20’ with the two locations along the x dimension. If you supply a pandas Series
or DataFrame
, metadata is copied directly:
- In [6]: xr.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))
- Out[6]:
- <xarray.DataArray 'foo' (dim_0: 3)>
- array([0, 1, 2])
- Coordinates:
- * dim_0 (dim_0) object 'a' 'b' 'c'
Here are the key properties for a DataArray
:
- # like in pandas, values is a numpy array that you can modify in-place
- In [7]: data.values
- Out[7]:
- array([[-1.04 , 0.272, -0.425],
- [ 0.567, 0.276, -1.087]])
- In [8]: data.dims
- Out[8]: ('x', 'y')
- In [9]: data.coords
- Out[9]:
- Coordinates:
- * x (x) int64 10 20
- # you can use this dictionary to store arbitrary metadata
- In [10]: data.attrs
- Out[10]: OrderedDict()
Indexing
xarray supports four kind of indexing. Since we have assigned coordinate labels to the x dimension we can use label-based indexing along that dimension just like pandas. The four examples below all yield the same result but at varying levels of convenience and intuitiveness.
- # positional and by integer label, like numpy
- In [11]: data[[0, 1]]
- Out[11]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[-1.039575, 0.27186 , -0.424972],
- [ 0.56702 , 0.276232, -1.087401]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- # positional and by coordinate label, like pandas
- In [12]: data.loc[10:20]
- Out[12]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[-1.039575, 0.27186 , -0.424972],
- [ 0.56702 , 0.276232, -1.087401]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- # by dimension name and integer label
- In [13]: data.isel(x=slice(2))
- Out[13]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[-1.039575, 0.27186 , -0.424972],
- [ 0.56702 , 0.276232, -1.087401]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- # by dimension name and coordinate label
- In [14]: data.sel(x=[10, 20])
- Out[14]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[-1.039575, 0.27186 , -0.424972],
- [ 0.56702 , 0.276232, -1.087401]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
Unlike positional indexing, label-based indexing frees us from having to know how our array is organized. All we need to know are the dimension name and the label we wish to index i.e. data.sel(x=10)
works regardless of whether x
is the first or second dimension of the array and regardless of whether 10
is the first or second element of x
. We have already told xarray that x is the first dimension when we created data
: xarray keeps track of this so we don’t have to. For more, see Indexing and selecting data.
Attributes
While you’re setting up your DataArray, it’s often a good idea to set metadata attributes. A useful choice is to set data.attrs['long_name']
and data.attrs['units']
since xarray will use these, if present, to automatically label your plots. These special names were chosen following the NetCDF Climate and Forecast (CF) Metadata Conventions. attrs
is just a Python dictionary, so you can assign anything you wish.
- In [15]: data.attrs['long_name'] = 'random velocity'
- In [16]: data.attrs['units'] = 'metres/sec'
- In [17]: data.attrs['description'] = 'A random variable created as an example.'
- In [18]: data.attrs['random_attribute'] = 123
- In [19]: data.attrs
- Out[19]:
- OrderedDict([('long_name', 'random velocity'),
- ('units', 'metres/sec'),
- ('description', 'A random variable created as an example.'),
- ('random_attribute', 123)])
- # you can add metadata to coordinates too
- In [20]: data.x.attrs['units'] = 'x units'
Computation
Data arrays work very similarly to numpy ndarrays:
- In [21]: data + 10
- Out[21]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[ 8.960425, 10.27186 , 9.575028],
- [10.56702 , 10.276232, 8.912599]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- In [22]: np.sin(data)
- Out[22]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[-0.862189, 0.268523, -0.412296],
- [ 0.537121, 0.272732, -0.885422]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- # transpose
- In [23]: data.T
- Out[23]:
- <xarray.DataArray (y: 3, x: 2)>
- array([[-1.039575, 0.56702 ],
- [ 0.27186 , 0.276232],
- [-0.424972, -1.087401]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- Attributes:
- long_name: random velocity
- units: metres/sec
- description: A random variable created as an example.
- random_attribute: 123
- In [24]: data.sum()
- Out[24]:
- <xarray.DataArray ()>
- array(-1.436836)
However, aggregation operations can use dimension names instead of axisnumbers:
- In [25]: data.mean(dim='x')
- Out[25]:
- <xarray.DataArray (y: 3)>
- array([-0.236277, 0.274046, -0.756187])
- Dimensions without coordinates: y
Arithmetic operations broadcast based on dimension name. This means you don’tneed to insert dummy dimensions for alignment:
- In [26]: a = xr.DataArray(np.random.randn(3), [data.coords['y']])
- In [27]: b = xr.DataArray(np.random.randn(4), dims='z')
- In [28]: a
- Out[28]:
- <xarray.DataArray (y: 3)>
- array([-0.67369 , 0.113648, -1.478427])
- Coordinates:
- * y (y) int64 0 1 2
- In [29]: b
- Out[29]:
- <xarray.DataArray (z: 4)>
- array([ 0.524988, 0.404705, 0.577046, -1.715002])
- Dimensions without coordinates: z
- In [30]: a + b
- Out[30]:
- <xarray.DataArray (y: 3, z: 4)>
- array([[-0.148702, -0.268984, -0.096644, -2.388692],
- [ 0.638636, 0.518354, 0.690694, -1.601354],
- [-0.953439, -1.073721, -0.901381, -3.193429]])
- Coordinates:
- * y (y) int64 0 1 2
- Dimensions without coordinates: z
It also means that in most cases you do not need to worry about the order ofdimensions:
- In [31]: data - data.T
- Out[31]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[0., 0., 0.],
- [0., 0., 0.]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
Operations also align based on index labels:
- In [32]: data[:-1] - data[:1]
- Out[32]:
- <xarray.DataArray (x: 1, y: 3)>
- array([[0., 0., 0.]])
- Coordinates:
- * x (x) int64 10
- Dimensions without coordinates: y
For more, see Computation.
GroupBy
xarray supports grouped operations using a very similar API to pandas (see GroupBy: split-apply-combine):
- In [33]: labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')
- In [34]: labels
- Out[34]:
- <xarray.DataArray 'labels' (y: 3)>
- array(['E', 'F', 'E'], dtype='<U1')
- Coordinates:
- * y (y) int64 0 1 2
- In [35]: data.groupby(labels).mean('y')
- Out[35]:
- <xarray.DataArray (x: 2, labels: 2)>
- array([[-0.732274, 0.27186 ],
- [-0.26019 , 0.276232]])
- Coordinates:
- * x (x) int64 10 20
- * labels (labels) object 'E' 'F'
- In [36]: data.groupby(labels).apply(lambda x: x - x.min())
- Out[36]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[0.047826, 0. , 0.662428],
- [1.654421, 0.004372, 0. ]])
- Coordinates:
- * x (x) int64 10 20
- * y (y) int64 0 1 2
- labels (y) <U1 'E' 'F' 'E'
Plotting
Visualizing your datasets is quick and convenient:
- In [37]: data.plot()
- Out[37]: <matplotlib.collections.QuadMesh at 0x7f33f267b710>
Note the automatic labeling with names and units. Our effort in adding metadata attributes has paid off! Many aspects of these figures are customizable: see Plotting.
pandas
Xarray objects can be easily converted to and from pandas objects using the to_series()
, to_dataframe()
and to_xarray()
methods:
- In [38]: series = data.to_series()
- In [39]: series
- Out[39]:
- x y
- 10 0 -1.039575
- 1 0.271860
- 2 -0.424972
- 20 0 0.567020
- 1 0.276232
- 2 -1.087401
- dtype: float64
- # convert back
- In [40]: series.to_xarray()
- Out[40]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[-1.039575, 0.27186 , -0.424972],
- [ 0.56702 , 0.276232, -1.087401]])
- Coordinates:
- * x (x) int64 10 20
- * y (y) int64 0 1 2
Datasets
xarray.Dataset
is a dict-like container of aligned DataArray
objects. You can think of it as a multi-dimensional generalization of thepandas.DataFrame
:
- In [41]: ds = xr.Dataset({'foo': data, 'bar': ('x', [1, 2]), 'baz': np.pi})
- In [42]: ds
- Out[42]:
- <xarray.Dataset>
- Dimensions: (x: 2, y: 3)
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- Data variables:
- foo (x, y) float64 -1.04 0.2719 -0.425 0.567 0.2762 -1.087
- bar (x) int64 1 2
- baz float64 3.142
This creates a dataset with three DataArrays named foo
, bar
and baz
. Use dictionary or dot indexing to pull out Dataset
variables as DataArray
objects but note that assignment only works with dictionary indexing:
- In [43]: ds['foo']
- Out[43]:
- <xarray.DataArray 'foo' (x: 2, y: 3)>
- array([[-1.039575, 0.27186 , -0.424972],
- [ 0.56702 , 0.276232, -1.087401]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- Attributes:
- long_name: random velocity
- units: metres/sec
- description: A random variable created as an example.
- random_attribute: 123
- In [44]: ds.foo
- Out[44]:
- <xarray.DataArray 'foo' (x: 2, y: 3)>
- array([[-1.039575, 0.27186 , -0.424972],
- [ 0.56702 , 0.276232, -1.087401]])
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- Attributes:
- long_name: random velocity
- units: metres/sec
- description: A random variable created as an example.
- random_attribute: 123
When creating ds
, we specified that foo
is identical to data
created earlier, bar
is one-dimensional with single dimension x
and associated values ‘1’ and ‘2’, and baz
is a scalar not associated with any dimension in ds
. Variables in datasets can have different dtype
and even different dimensions, but all dimensions are assumed to refer to points in the same shared coordinate system i.e. if two variables have dimension x
, that dimension must be identical in both variables.
For example, when creating ds
xarray automatically aligns bar
with DataArray
foo
, i.e., they share the same coordinate system so that ds.bar['x'] == ds.foo['x'] == ds['x']
. Consequently, the following works without explicitly specifying the coordinate x
when creating ds['bar']
:
- In [45]: ds.bar.sel(x=10)
- Out[45]:
- <xarray.DataArray 'bar' ()>
- array(1)
- Coordinates:
- x int64 10
You can do almost everything you can do with DataArray
objects withDataset
objects (including indexing and arithmetic) if you prefer to workwith multiple variables at once.
Read & write netCDF files
NetCDF is the recommended file format for xarray objects. Usersfrom the geosciences will recognize that the Dataset
datamodel looks very similar to a netCDF file (which, in fact, inspired it).
You can directly read and write xarray objects to disk using to_netcdf()
, open_dataset()
andopen_dataarray()
:
- In [46]: ds.to_netcdf('example.nc')
- In [47]: xr.open_dataset('example.nc')
- Out[47]:
- <xarray.Dataset>
- Dimensions: (x: 2, y: 3)
- Coordinates:
- * x (x) int64 10 20
- Dimensions without coordinates: y
- Data variables:
- foo (x, y) float64 ...
- bar (x) int64 ...
- baz float64 ...
It is common for datasets to be distributed across multiple files (commonly one file per timestep). xarray supports this use-case by providing the open_mfdataset()
and the save_mfdataset()
methods. For more, see Reading and writing files.