GroupBy: split-apply-combine
xarray supports “group by” operations with the same API as pandas toimplement the split-apply-combine strategy:
Split your data into multiple independent groups.
Apply some function to each group.
Combine your groups back into a single data object.
Group by operations work on both Dataset
andDataArray
objects. Most of the examples focus on grouping bya single one-dimensional variable, although support for groupingover a multi-dimensional variable has recently been implemented. Note that forone-dimensional data, it is usually faster to rely on pandas’ implementation ofthe same pipeline.
Split
Let’s create a simple example dataset:
- In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
- ...: coords={'x': [10, 20, 30, 40],
- ...: 'letters': ('x', list('abba'))})
- ...:
- In [2]: arr = ds['foo']
- In [3]: ds
- Out[3]:
- <xarray.Dataset>
- Dimensions: (x: 4, y: 3)
- Coordinates:
- * x (x) int64 10 20 30 40
- letters (x) <U1 'a' 'b' 'b' 'a'
- Dimensions without coordinates: y
- Data variables:
- foo (x, y) float64 0.127 0.9667 0.2605 0.8972 ... 0.543 0.373 0.448
If we groupby the name of a variable or coordinate in a dataset (we can alsouse a DataArray directly), we get back a GroupBy
object:
- In [4]: ds.groupby('letters')
- Out[4]: <xarray.core.groupby.DatasetGroupBy at 0x7f3425a45a20>
This object works very similarly to a pandas GroupBy object. You can viewthe group indices with the groups
attribute:
- In [5]: ds.groupby('letters').groups
- Out[5]: {'a': [0, 3], 'b': [1, 2]}
You can also iterate over groups in (label, group)
pairs:
- In [6]: list(ds.groupby('letters'))
- Out[6]:
- [('a', <xarray.Dataset>
- Dimensions: (x: 2, y: 3)
- Coordinates:
- * x (x) int64 10 40
- letters (x) <U1 'a' 'a'
- Dimensions without coordinates: y
- Data variables:
- foo (x, y) float64 0.127 0.9667 0.2605 0.543 0.373 0.448),
- ('b', <xarray.Dataset>
- Dimensions: (x: 2, y: 3)
- Coordinates:
- * x (x) int64 20 30
- letters (x) <U1 'b' 'b'
- Dimensions without coordinates: y
- Data variables:
- foo (x, y) float64 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]
Just like in pandas, creating a GroupBy object is cheap: it does not actuallysplit the data until you access particular values.
Binning
Sometimes you don’t want to use all the unique values to determine the groupsbut instead want to “bin” the data into coarser groups. You could always createa customized coordinate, but xarray facilitates this via thegroupby_bins()
method.
- In [7]: x_bins = [0,25,50]
- In [8]: ds.groupby_bins('x', x_bins).groups
- Out[8]:
- {Interval(0, 25, closed='right'): [0, 1],
- Interval(25, 50, closed='right'): [2, 3]}
The binning is implemented via pandas.cut, whose documentation details howthe bins are assigned. As seen in the example above, by default, the bins arelabeled with strings using set notation to precisely identify the bin limits. Tooverride this behavior, you can specify the bin labels explicitly. Here wechoose float labels which identify the bin centers:
- In [9]: x_bin_labels = [12.5,37.5]
- In [10]: ds.groupby_bins('x', x_bins, labels=x_bin_labels).groups
- Out[10]: {12.5: [0, 1], 37.5: [2, 3]}
Apply
To apply a function to each group, you can use the flexibleapply()
method. The resulting objects are automaticallyconcatenated back together along the group axis:
- In [11]: def standardize(x):
- ....: return (x - x.mean()) / x.std()
- ....:
- In [12]: arr.groupby('letters').apply(standardize)
- Out[12]:
- <xarray.DataArray 'foo' (x: 4, y: 3)>
- array([[-1.229778, 1.93741 , -0.726247],
- [ 1.419796, -0.460192, -0.606579],
- [-0.190642, 1.21398 , -1.376362],
- [ 0.339417, -0.301806, -0.018995]])
- Coordinates:
- * x (x) int64 10 20 30 40
- letters (x) <U1 'a' 'b' 'b' 'a'
- Dimensions without coordinates: y
GroupBy objects also have a reduce()
method andmethods like mean()
as shortcuts for applying anaggregation function:
- In [13]: arr.groupby('letters').mean(dim='x')
- Out[13]:
- <xarray.DataArray 'foo' (letters: 2, y: 3)>
- array([[0.334998, 0.669865, 0.354236],
- [0.674306, 0.608502, 0.229662]])
- Coordinates:
- * letters (letters) object 'a' 'b'
- Dimensions without coordinates: y
Using a groupby is thus also a convenient shortcut for aggregating over alldimensions other than the provided one:
- In [14]: ds.groupby('x').std(xr.ALL_DIMS)
- Out[14]:
- <xarray.Dataset>
- Dimensions: (x: 4)
- Coordinates:
- * x (x) int64 10 20 30 40
- letters (x) <U1 'a' 'b' 'b' 'a'
- Data variables:
- foo (x) float64 0.3684 0.2554 0.2931 0.06957
First and last
There are two special aggregation operations that are currently only found ongroupby objects: first and last. These provide the first or last example ofvalues for group along the grouped dimension:
- In [15]: ds.groupby('letters').first(xr.ALL_DIMS)
- Out[15]:
- <xarray.Dataset>
- Dimensions: (letters: 2, y: 3)
- Coordinates:
- * letters (letters) object 'a' 'b'
- Dimensions without coordinates: y
- Data variables:
- foo (letters, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362
By default, they skip missing values (control this with skipna
).
Grouped arithmetic
GroupBy objects also support a limited set of binary arithmetic operations, asa shortcut for mapping over all unique labels. Binary arithmetic is supportedfor (GroupBy, Dataset)
and (GroupBy, DataArray)
pairs, as long as thedataset or data array uses the unique grouped values as one of its indexcoordinates. For example:
- In [16]: alt = arr.groupby('letters').mean(xr.ALL_DIMS)
- In [17]: alt
- Out[17]:
- <xarray.DataArray 'foo' (letters: 2)>
- array([0.453033, 0.504157])
- Coordinates:
- * letters (letters) object 'a' 'b'
- In [18]: ds.groupby('letters') - alt
- Out[18]:
- <xarray.Dataset>
- Dimensions: (x: 4, y: 3)
- Coordinates:
- * x (x) int64 10 20 30 40
- letters (x) <U1 'a' 'b' 'b' 'a'
- Dimensions without coordinates: y
- Data variables:
- foo (x, y) float64 -0.3261 0.5137 -0.1926 ... -0.08002 -0.005036
This last line is roughly equivalent to the following:
- results = []
- for label, group in ds.groupby('letters'):
- results.append(group - alt.sel(x=label))
- xr.concat(results, dim='x')
Squeezing
When grouping over a dimension, you can control whether the dimension issqueezed out or if it should remain with length one on each group by usingthe squeeze
parameter:
- In [19]: next(iter(arr.groupby('x')))
- Out[19]:
- (10, <xarray.DataArray 'foo' (y: 3)>
- array([0.12697 , 0.966718, 0.260476])
- Coordinates:
- x int64 10
- letters <U1 'a'
- Dimensions without coordinates: y)
- In [20]: next(iter(arr.groupby('x', squeeze=False)))
- Out[20]:
- (10, <xarray.DataArray 'foo' (x: 1, y: 3)>
- array([[0.12697 , 0.966718, 0.260476]])
- Coordinates:
- * x (x) int64 10
- letters (x) <U1 'a'
- Dimensions without coordinates: y)
Although xarray will attempt to automaticallytranspose
dimensions back into their original orderwhen you use apply, it is sometimes useful to set squeeze=False
toguarantee that all original dimensions remain unchanged.
You can always squeeze explicitly later with the Dataset or DataArraysqueeze()
methods.
Multidimensional Grouping
Many datasets have a multidimensional coordinate variable (e.g. longitude)which is different from the logical grid dimensions (e.g. nx, ny). Suchvariables are valid under the CF conventions. Xarray supports groupbyoperations over multidimensional coordinate variables:
- In [21]: da = xr.DataArray([[0,1],[2,3]],
- ....: coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
- ....: 'lat': (['ny','nx'], [[10,10],[20,20]] ),},
- ....: dims=['ny','nx'])
- ....:
- In [22]: da
- Out[22]:
- <xarray.DataArray (ny: 2, nx: 2)>
- array([[0, 1],
- [2, 3]])
- Coordinates:
- lon (ny, nx) int64 30 40 40 50
- lat (ny, nx) int64 10 10 20 20
- Dimensions without coordinates: ny, nx
- In [23]: da.groupby('lon').sum(xr.ALL_DIMS)
- Out[23]:
- <xarray.DataArray (lon: 3)>
- array([0, 3, 3])
- Coordinates:
- * lon (lon) int64 30 40 50
- In [24]: da.groupby('lon').apply(lambda x: x - x.mean(), shortcut=False)
- Out[24]:
- <xarray.DataArray (ny: 2, nx: 2)>
- array([[ 0. , -0.5],
- [ 0.5, 0. ]])
- Coordinates:
- lon (ny, nx) int64 30 40 40 50
- lat (ny, nx) int64 10 10 20 20
- Dimensions without coordinates: ny, nx
Because multidimensional groups have the ability to generate a very largenumber of bins, coarse-binning via groupby_bins()
may be desirable:
- In [25]: da.groupby_bins('lon', [0,45,50]).sum()
- Out[25]:
- <xarray.DataArray (lon_bins: 2)>
- array([3, 3])
- Coordinates:
- * lon_bins (lon_bins) object (0, 45] (45, 50]
These methods group by lon values. It is also possible to groupby eachcell in a grid, regardless of value, by stacking multiple dimensions,applying your function, and then unstacking the result:
- In [26]: stacked = da.stack(gridcell=['ny', 'nx'])
- In [27]: stacked.groupby('gridcell').sum().unstack('gridcell')
- Out[27]:
- <xarray.DataArray (ny: 2, nx: 2)>
- array([[0, 1],
- [2, 3]])
- Coordinates:
- lon (ny, nx) int64 30 40 40 50
- lat (ny, nx) int64 10 10 20 20
- * ny (ny) int64 0 1
- * nx (nx) int64 0 1