Reshaping and reorganizing data
These methods allow you to reorganize
Reordering dimensions
To reorder dimensions on a DataArray
or across all variableson a Dataset
, use transpose()
:
- In [1]: ds = xr.Dataset({'foo': (('x', 'y', 'z'), [[[42]]]), 'bar': (('y', 'z'), [[24]])})
- In [2]: ds.transpose('y', 'z', 'x')
- Out[2]:
- <xarray.Dataset>
- Dimensions: (x: 1, y: 1, z: 1)
- Dimensions without coordinates: x, y, z
- Data variables:
- foo (y, z, x) int64 42
- bar (y, z) int64 24
- In [3]: ds.transpose() # reverses all dimensions
- Out[3]:
- <xarray.Dataset>
- Dimensions: (x: 1, y: 1, z: 1)
- Dimensions without coordinates: x, y, z
- Data variables:
- foo (z, y, x) int64 42
- bar (z, y) int64 24
Expand and squeeze dimensions
To expand a DataArray
or allvariables on a Dataset
along a new dimension,use expand_dims()
- In [4]: expanded = ds.expand_dims('w')
- In [5]: expanded
- Out[5]:
- <xarray.Dataset>
- Dimensions: (w: 1, x: 1, y: 1, z: 1)
- Dimensions without coordinates: w, x, y, z
- Data variables:
- foo (w, x, y, z) int64 42
- bar (w, y, z) int64 24
This method attaches a new dimension with size 1 to all data variables.
To remove such a size-1 dimension from the DataArray
or Dataset
,use squeeze()
- In [6]: expanded.squeeze('w')
- Out[6]:
- <xarray.Dataset>
- Dimensions: (x: 1, y: 1, z: 1)
- Dimensions without coordinates: x, y, z
- Data variables:
- foo (x, y, z) int64 42
- bar (y, z) int64 24
Converting between datasets and arrays
To convert from a Dataset to a DataArray, use to_array()
:
- In [7]: arr = ds.to_array()
- In [8]: arr
- Out[8]:
- <xarray.DataArray (variable: 2, x: 1, y: 1, z: 1)>
- array([[[[42]]],
- [[[24]]]])
- Coordinates:
- * variable (variable) <U3 'foo' 'bar'
- Dimensions without coordinates: x, y, z
This method broadcasts all data variables in the dataset against each other,then concatenates them along a new dimension into a new array while preservingcoordinates.
To convert back from a DataArray to a Dataset, useto_dataset()
:
- In [9]: arr.to_dataset(dim='variable')
- Out[9]:
- <xarray.Dataset>
- Dimensions: (x: 1, y: 1, z: 1)
- Dimensions without coordinates: x, y, z
- Data variables:
- foo (x, y, z) int64 42
- bar (x, y, z) int64 24
The broadcasting behavior of to_array
means that the resulting arrayincludes the union of data variable dimensions:
- In [10]: ds2 = xr.Dataset({'a': 0, 'b': ('x', [3, 4, 5])})
- # the input dataset has 4 elements
- In [11]: ds2
- Out[11]:
- <xarray.Dataset>
- Dimensions: (x: 3)
- Dimensions without coordinates: x
- Data variables:
- a int64 0
- b (x) int64 3 4 5
- # the resulting array has 6 elements
- In [12]: ds2.to_array()
- Out[12]:
- <xarray.DataArray (variable: 2, x: 3)>
- array([[0, 0, 0],
- [3, 4, 5]])
- Coordinates:
- * variable (variable) <U1 'a' 'b'
- Dimensions without coordinates: x
Otherwise, the result could not be represented as an orthogonal array.
If you use to_dataset
without supplying the dim
argument, the DataArray will be converted into a Dataset of one variable:
- In [13]: arr.to_dataset(name='combined')
- Out[13]:
- <xarray.Dataset>
- Dimensions: (variable: 2, x: 1, y: 1, z: 1)
- Coordinates:
- * variable (variable) <U3 'foo' 'bar'
- Dimensions without coordinates: x, y, z
- Data variables:
- combined (variable, x, y, z) int64 42 24
Stack and unstack
As part of xarray’s nascent support for pandas.MultiIndex
, we haveimplemented stack()
andunstack()
method, for combining or splitting dimensions:
- In [14]: array = xr.DataArray(np.random.randn(2, 3),
- ....: coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
- ....:
- In [15]: stacked = array.stack(z=('x', 'y'))
- In [16]: stacked
- Out[16]:
- <xarray.DataArray (z: 6)>
- array([ 0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215])
- Coordinates:
- * z (z) MultiIndex
- - x (z) object 'a' 'a' 'a' 'b' 'b' 'b'
- - y (z) int64 0 1 2 0 1 2
- In [17]: stacked.unstack('z')
- Out[17]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[ 0.469112, -0.282863, -1.509059],
- [-1.135632, 1.212112, -0.173215]])
- Coordinates:
- * x (x) object 'a' 'b'
- * y (y) int64 0 1 2
These methods are modeled on the pandas.DataFrame
methods of thesame name, although in xarray they always create new dimensions rather thanadding to the existing index or columns.
Like DataFrame.unstack
, xarray’s unstack
always succeeds, even if the multi-index being unstacked does not contain allpossible levels. Missing levels are filled in with NaN
in the resulting object:
- In [18]: stacked2 = stacked[::2]
- In [19]: stacked2
- Out[19]:
- <xarray.DataArray (z: 3)>
- array([ 0.469112, -1.509059, 1.212112])
- Coordinates:
- * z (z) MultiIndex
- - x (z) object 'a' 'a' 'b'
- - y (z) int64 0 2 1
- In [20]: stacked2.unstack('z')
- Out[20]:
- <xarray.DataArray (x: 2, y: 3)>
- array([[ 0.469112, nan, -1.509059],
- [ nan, 1.212112, nan]])
- Coordinates:
- * x (x) object 'a' 'b'
- * y (y) int64 0 1 2
However, xarray’s stack
has an important difference from pandas: unlikepandas, it does not automatically drop missing values. Compare:
- In [21]: array = xr.DataArray([[np.nan, 1], [2, 3]], dims=['x', 'y'])
- In [22]: array.stack(z=('x', 'y'))
- Out[22]:
- <xarray.DataArray (z: 4)>
- array([nan, 1., 2., 3.])
- Coordinates:
- * z (z) MultiIndex
- - x (z) int64 0 0 1 1
- - y (z) int64 0 1 0 1
- In [23]: array.to_pandas().stack()
- Out[23]:
- x y
- 0 1 1.0
- 1 0 2.0
- 1 3.0
- dtype: float64
We departed from pandas’s behavior here because predictable shapes for newarray dimensions is necessary for Parallel computing with Dask.
Stacking different variables together
These stacking and unstacking operations are particularly useful for reshapingxarray objects for use in machine learning packages, such as scikit-learn, that usually require two-dimensional numpyarrays as inputs. For datasets with only one variable, we only need stack
and unstack
, but combining multiple variables in axarray.Dataset
is more complicated. If the variables in the datasethave matching numbers of dimensions, we can callto_array()
and then stack along the the new coordinate.But to_array()
will broadcast the dataarrays together,which will effectively tile the lower dimensional variable along the missingdimensions. The method xarray.Dataset.to_stacked_array()
allowscombining variables of differing dimensions without this wasteful copying whilexarray.DataArray.to_unstacked_dataset()
reverses this operation.Just as with xarray.Dataset.stack()
the stacked coordinate isrepresented by a pandas.MultiIndex
object. These methods are usedlike this:
In this example, stacked
is a two dimensional array that we can easily pass to a scikit-learn or another genericnumerical method.
Note
Unlike with stack
, in to_stacked_array
, the user specifies the dimensions they do not want stacked.For a machine learning task, these unstacked dimensions can be interpreted as the dimensions over which samples aredrawn, whereas the stacked coordinates are the features. Naturally, all variables should possess these samplingdimensions.
Set and reset index
Complementary to stack / unstack, xarray’s .set_index
, .reset_index
and.reorder_levels
allow easy manipulation of DataArray
or Dataset
multi-indexes without modifying the data and its dimensions.
You can create a multi-index from several 1-dimensional variables and/orcoordinates using set_index()
:
- In [24]: da = xr.DataArray(np.random.rand(4),
- ....: coords={'band': ('x', ['a', 'a', 'b', 'b']),
- ....: 'wavenumber': ('x', np.linspace(200, 400, 4))},
- ....: dims='x')
- ....:
- In [25]: da
- Out[25]:
- <xarray.DataArray (x: 4)>
- array([0.123102, 0.543026, 0.373012, 0.447997])
- Coordinates:
- band (x) <U1 'a' 'a' 'b' 'b'
- wavenumber (x) float64 200.0 266.7 333.3 400.0
- Dimensions without coordinates: x
- In [26]: mda = da.set_index(x=['band', 'wavenumber'])
- In [27]: mda
- Out[27]:
- <xarray.DataArray (x: 4)>
- array([0.123102, 0.543026, 0.373012, 0.447997])
- Coordinates:
- * x (x) MultiIndex
- - band (x) object 'a' 'a' 'b' 'b'
- - wavenumber (x) float64 200.0 266.7 333.3 400.0
These coordinates can now be used for indexing, e.g.,
- In [28]: mda.sel(band='a')
- Out[28]:
- <xarray.DataArray (wavenumber: 2)>
- array([0.123102, 0.543026])
- Coordinates:
- * wavenumber (wavenumber) float64 200.0 266.7
Conversely, you can use reset_index()
to extract multi-index levels as coordinates (this is mainly usefulfor serialization):
- In [29]: mda.reset_index('x')
- Out[29]:
- <xarray.DataArray (x: 4)>
- array([0.123102, 0.543026, 0.373012, 0.447997])
- Coordinates:
- band (x) object 'a' 'a' 'b' 'b'
- wavenumber (x) float64 200.0 266.7 333.3 400.0
- Dimensions without coordinates: x
reorder_levels()
allows changing the orderof multi-index levels:
- In [30]: mda.reorder_levels(x=['wavenumber', 'band'])
- Out[30]:
- <xarray.DataArray (x: 4)>
- array([0.123102, 0.543026, 0.373012, 0.447997])
- Coordinates:
- * x (x) MultiIndex
- - wavenumber (x) float64 200.0 266.7 333.3 400.0
- - band (x) object 'a' 'a' 'b' 'b'
As of xarray v0.9 coordinate labels for each dimension are optional.You can also use .set_index
/ .reset_index
to add / removelabels for one or several dimensions:
- In [31]: array = xr.DataArray([1, 2, 3], dims='x')
- In [32]: array
- Out[32]:
- <xarray.DataArray (x: 3)>
- array([1, 2, 3])
- Dimensions without coordinates: x
- In [33]: array['c'] = ('x', ['a', 'b', 'c'])
- In [34]: array.set_index(x='c')
- Out[34]:
- <xarray.DataArray (x: 3)>
- array([1, 2, 3])
- Coordinates:
- * x (x) object 'a' 'b' 'c'
- In [35]: array = array.set_index(x='c')
- In [36]: array = array.reset_index('x', drop=True)
Shift and roll
To adjust coordinate labels, you can use the shift()
androll()
methods:
- In [37]: array = xr.DataArray([1, 2, 3, 4], dims='x')
- In [38]: array.shift(x=2)
- Out[38]:
- <xarray.DataArray (x: 4)>
- array([nan, nan, 1., 2.])
- Dimensions without coordinates: x
- In [39]: array.roll(x=2, roll_coords=True)
- Out[39]:
- <xarray.DataArray (x: 4)>
- array([3, 4, 1, 2])
- Dimensions without coordinates: x
Sort
One may sort a DataArray/Dataset via sortby()
andsortby()
. The input can be an individual or list of1D DataArray
objects:
- In [40]: ds = xr.Dataset({'A': (('x', 'y'), [[1, 2], [3, 4]]),
- ....: 'B': (('x', 'y'), [[5, 6], [7, 8]])},
- ....: coords={'x': ['b', 'a'], 'y': [1, 0]})
- ....:
- In [41]: dax = xr.DataArray([100, 99], [('x', [0, 1])])
- In [42]: day = xr.DataArray([90, 80], [('y', [0, 1])])
- In [43]: ds.sortby([day, dax])
- Out[43]:
- <xarray.Dataset>
- Dimensions: (x: 2, y: 2)
- Coordinates:
- * x (x) object 'b' 'a'
- * y (y) int64 1 0
- Data variables:
- A (x, y) int64 1 2 3 4
- B (x, y) int64 5 6 7 8
As a shortcut, you can refer to existing coordinates by name:
- In [44]: ds.sortby('x')
- Out[44]:
- <xarray.Dataset>
- Dimensions: (x: 2, y: 2)
- Coordinates:
- * x (x) object 'a' 'b'
- * y (y) int64 1 0
- Data variables:
- A (x, y) int64 3 4 1 2
- B (x, y) int64 7 8 5 6
- In [45]: ds.sortby(['y', 'x'])
- Out[45]:
- <xarray.Dataset>
- Dimensions: (x: 2, y: 2)
- Coordinates:
- * x (x) object 'a' 'b'
- * y (y) int64 0 1
- Data variables:
- A (x, y) int64 4 3 2 1
- B (x, y) int64 8 7 6 5
- In [46]: ds.sortby(['y', 'x'], ascending=False)
- Out[46]:
- <xarray.Dataset>
- Dimensions: (x: 2, y: 2)
- Coordinates:
- * x (x) object 'b' 'a'
- * y (y) int64 1 0
- Data variables:
- A (x, y) int64 1 2 3 4
- B (x, y) int64 5 6 7 8