Grouping with multiple levels #1569

jjpr-mit · 2017-09-12T14:46:12Z

http://xarray.pydata.org/en/stable/groupby.html says:

xarray supports “group by” operations with the same API as pandas

but when I supply the level keyword argument as described at https://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-with-multiindex, I get:

TypeError                                 Traceback (most recent call last)
<ipython-input-12-566fc67c0151> in <module>()
----> 1 hvm_it_v6_obj = hvm_it_v6.groupby(level=["category","obj"]).mean(dim="presentation")
      2 hvm_it_v6_obj

TypeError: groupby() got an unexpected keyword argument 'level'

The text was updated successfully, but these errors were encountered:

shoyer · 2017-09-12T18:21:36Z

We should clarify: we support a subset of the pandas API. The level argument and grouping by multiple variables at once is not yet supported.

jjpr-mit · 2017-09-12T19:17:24Z

Makes sense. Just needs a doc update, then.

What's the preferred means to contribute doc (including little edits like this)? Pull requests?

shoyer · 2017-09-12T20:05:39Z

The "Edit on GitHub" link at the top right of doc page is probably the most convenient way for primary doc pages. It should walk you through making a pull request if you hit the "Edit this file" icon (next to the trash bin) at the top right of the file on GitHub.

To clarify on level: it's not supported, but instead you can simply pass the name of a MultiIndex level directly as a name. Multiple arguments are truly not supported yet (#324).

max-sixty · 2018-05-30T22:52:38Z

Are we making silent mistakes at the moment?

In [1]: import xarray as xr
In [2]: import pandas as pd

In [4]: import numpy as np

In [5]: ds=xr.Dataset(dict(a=(('z',), np.ones(10))), coords=dict(b=(('z'),np.arange(2).repeat(5)), c=(('z'),np.arange(5).repeat(2))))

In [6]: ds
Out[6]:
<xarray.Dataset>
Dimensions:  (z: 10)
Coordinates:
    b        (z) int64 0 0 0 0 0 1 1 1 1 1
    c        (z) int64 0 0 1 1 2 2 3 3 4 4
Dimensions without coordinates: z
Data variables:
    a        (z) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

In [7]: ds.set_index(bc=['b','c']).groupby('bc').sum()
Out[7]:
<xarray.Dataset>
Dimensions:     (bc: 6)
Coordinates:
  * bc          (bc) MultiIndex
  - bc_level_0  (bc) int64 0 0 0 1 1 1
  - bc_level_1  (bc) int64 0 1 2 2 3 4
Data variables:
    a           (bc) float64 10.0 10.0 10.0 10.0 10.0 10.0   # <--- this is wrong?

With a DataArray, we do raise:

In [19]: ds['a'].set_index(bc=['b','c'])
Out[19]:
<xarray.DataArray 'a' (z: 10)>
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Coordinates:
  * bc       (bc) MultiIndex
  - b        (bc) int64 0 0 0 0 0 1 1 1 1 1
  - c        (bc) int64 0 0 1 1 2 2 3 3 4 4
Dimensions without coordinates: z

In [20]: ds['a'].set_index(bc=['b','c']).groupby('bc')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-20-82bf6782525c> in <module>()
----> 1 ds['a'].set_index(bc=['b','c']).groupby('bc')

/usr/local/lib/python3.6/site-packages/xarray/core/common.py in groupby(self, group, squeeze)
    472         core.groupby.DatasetGroupBy
    473         """
--> 474         return self._groupby_cls(self, group, squeeze=squeeze)
    475
    476     def groupby_bins(self, group, bins, right=True, labels=None, precision=3,

/usr/local/lib/python3.6/site-packages/xarray/core/groupby.py in __init__(self, obj, group, squeeze, grouper, bins, cut_kwargs)
    213         group_dim, = group.dims
    214
--> 215         expected_size = obj.sizes[group_dim]
    216         if group.size != expected_size:
    217             raise ValueError('the group variable\'s length does not '

/usr/local/lib/python3.6/site-packages/xarray/core/utils.py in __getitem__(self, key)
    306
    307     def __getitem__(self, key):
--> 308         return self.mapping[key]
    309
    310     def __iter__(self):

KeyError: 'bc'

ray306 · 2019-04-30T20:55:01Z

I got a solution which is not so flexible but works:
Data:

import pandas as pd
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo']),
          np.array(['one', 'two', 'one', 'two', 'one', 'two'])]
da = pd.DataFrame(np.random.randn(6, 4)).to_xarray().to_array()
da.coords['index'] = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
da

<xarray.DataArray (variable: 4, index: 6)>
array([[ 0.379189,  1.082292, -2.073478, -0.84626 , -1.529927, -0.837407],
       [-0.267983, -0.2516  , -1.016653, -0.085762, -0.058382, -0.667891],
       [-0.013488, -0.855332, -0.038072, -0.385211, -2.149742, -0.304361],
       [ 1.749561, -0.606031,  1.914146,  1.6292  , -0.515519,  1.996283]])
Coordinates:
  * index     (index) MultiIndex
  - first     (index) object 'bar' 'bar' 'baz' 'baz' 'foo' 'foo'
  - second    (index) object 'one' 'two' 'one' 'two' 'one' 'two'
  * variable  (variable) int32 0 1 2 3

Groupby on levels:

def multi_groupby(data, levels):
    return (((k0,k1),v1) for k0,v0 in data.groupby(levels[0]) for k1,v1 in v0.groupby(levels[1]))
    
for k,v in multi_groupby(da,['second','variable']):
    print(k,v)

('one', 0) <xarray.DataArray (index: 3)>
array([ 0.379189, -2.073478, -1.529927])
Coordinates:
  * index     (index) MultiIndex
  - first     (index) object 'bar' 'baz' 'foo'
  - second    (index) object 'one' 'one' 'one'
    variable  int32 0
('one', 1) <xarray.DataArray (index: 3)>
array([-0.267983, -1.016653, -0.058382])
Coordinates:
  * index     (index) MultiIndex
  - first     (index) object 'bar' 'baz' 'foo'
  - second    (index) object 'one' 'one' 'one'
    variable  int32 1
('one', 2) <xarray.DataArray (index: 3)>
array([-0.013488, -0.038072, -2.149742])
Coordinates:
  * index     (index) MultiIndex
  - first     (index) object 'bar' 'baz' 'foo'
  - second    (index) object 'one' 'one' 'one'
    variable  int32 2
('one', 3) <xarray.DataArray (index: 3)>
array([ 1.749561,  1.914146, -0.515519])
Coordinates:
  * index     (index) MultiIndex
  - first     (index) object 'bar' 'baz' 'foo'
  - second    (index) object 'one' 'one' 'one'
    variable  int32 3
('two', 0) <xarray.DataArray (index: 3)>
array([ 1.082292, -0.84626 , -0.837407])
Coordinates:
  * index     (index) MultiIndex
  - first     (index) object 'bar' 'baz' 'foo'
  - second    (index) object 'two' 'two' 'two'
    variable  int32 0
('two', 1) <xarray.DataArray (index: 3)>
array([-0.2516  , -0.085762, -0.667891])
Coordinates:
  * index     (index) MultiIndex
  - first     (index) object 'bar' 'baz' 'foo'
  - second    (index) object 'two' 'two' 'two'
    variable  int32 1
('two', 2) <xarray.DataArray (index: 3)>
array([-0.855332, -0.385211, -0.304361])
Coordinates:
  * index     (index) MultiIndex
  - first     (index) object 'bar' 'baz' 'foo'
  - second    (index) object 'two' 'two' 'two'
    variable  int32 2
('two', 3) <xarray.DataArray (index: 3)>
array([-0.606031,  1.6292  ,  1.996283])
Coordinates:
  * index     (index) MultiIndex
  - first     (index) object 'bar' 'baz' 'foo'
  - second    (index) object 'two' 'two' 'two'
    variable  int32 3

max-sixty · 2022-04-09T15:25:06Z

The example above is now fixed:

In [1]: ds=xr.Dataset(dict(a=(('z',), np.ones(10))), coords=dict(b=(('z'),np.arange(2).repeat(5)), c=(('z'),np.arange(5).repeat(2))))
   ...:

In [2]: ds.set_index(bc=['b','c']).groupby('bc').sum()
Out[2]:
<xarray.Dataset>
Dimensions:  (bc: 6, z: 10)
Coordinates:
  * bc       (bc) object MultiIndex
  * b        (bc) int64 0 0 0 1 1 1
  * c        (bc) int64 0 1 2 2 3 4
Dimensions without coordinates: z
Data variables:
    a        (bc, z) float64 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0   # <- correct

shoyer added contrib-help-wanted topic-documentation labels Sep 12, 2017

dcherian added the topic-groupby label Oct 26, 2018

OriolAbril mentioned this issue Mar 24, 2021

xarray usage teddygroves/figure_skating#1

Merged

max-sixty closed this as completed Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping with multiple levels #1569

Grouping with multiple levels #1569

jjpr-mit commented Sep 12, 2017

shoyer commented Sep 12, 2017

jjpr-mit commented Sep 12, 2017

shoyer commented Sep 12, 2017

max-sixty commented May 30, 2018

ray306 commented Apr 30, 2019

max-sixty commented Apr 9, 2022

Grouping with multiple levels #1569

Grouping with multiple levels #1569

Comments

jjpr-mit commented Sep 12, 2017

shoyer commented Sep 12, 2017

jjpr-mit commented Sep 12, 2017

shoyer commented Sep 12, 2017

max-sixty commented May 30, 2018

ray306 commented Apr 30, 2019

max-sixty commented Apr 9, 2022