-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass indexes directly to the DataArray and Dataset constructors #7214
Changes from 13 commits
41f4fd8
4baa8af
dbc058a
16a9983
3c076d5
70e7a5d
00e1766
3bf92cd
c9b6363
82dc5cc
a58c9d0
9beeea7
c6e94b4
ddd505e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -440,8 +440,10 @@ class Dataset( | |
Dataset implements the mapping interface with keys given by variable | ||
names and values given by DataArray objects for each variable name. | ||
|
||
One dimensional variables with name equal to their dimension are | ||
index coordinates used for label based indexing. | ||
By default, pandas indexes are created for one dimensional variables with | ||
name equal to their dimension so those variables can be used as coordinates | ||
for label based indexing. Xarray-compatible indexes may also be provided | ||
via the `indexes` argument. | ||
|
||
To load data from a file or file-like object, use the `open_dataset` | ||
function. | ||
|
@@ -492,6 +494,11 @@ class Dataset( | |
|
||
attrs : dict-like, optional | ||
Global attributes to save on this dataset. | ||
indexes : py:class:`~xarray.Indexes` or dict-like, optional | ||
A collection of :py:class:`~xarray.indexes.Index` objects and | ||
their coordinates variables. If an empty collection is given, | ||
it will skip the creation of default (pandas) indexes for | ||
dimension coordinates. | ||
|
||
Examples | ||
-------- | ||
|
@@ -551,6 +558,7 @@ class Dataset( | |
precipitation float64 8.326 | ||
Attributes: | ||
description: Weather related data. | ||
|
||
""" | ||
|
||
_attrs: dict[Hashable, Any] | None | ||
|
@@ -581,14 +589,26 @@ def __init__( | |
data_vars: Mapping[Any, Any] | None = None, | ||
coords: Mapping[Any, Any] | None = None, | ||
attrs: Mapping[Any, Any] | None = None, | ||
indexes: Mapping[Any, Index] | None = None, | ||
) -> None: | ||
# TODO(shoyer): expose indexes as a public argument in __init__ | ||
|
||
if data_vars is None: | ||
data_vars = {} | ||
if coords is None: | ||
coords = {} | ||
|
||
if indexes is None: | ||
create_default_indexes = True | ||
indexes = Indexes() | ||
elif len(indexes) == 0: | ||
create_default_indexes = False | ||
indexes = Indexes() | ||
else: | ||
create_default_indexes = True | ||
if not isinstance(indexes, Indexes): | ||
raise TypeError("non-empty indexes must be an instance of `Indexes`") | ||
elif indexes._index_type != Index: | ||
raise TypeError("indexes must only contain Xarray `Index` objects") | ||
|
||
both_data_and_coords = set(data_vars) & set(coords) | ||
if both_data_and_coords: | ||
raise ValueError( | ||
|
@@ -598,17 +618,34 @@ def __init__( | |
if isinstance(coords, Dataset): | ||
coords = coords.variables | ||
|
||
variables, coord_names, dims, indexes, _ = merge_data_and_coords( | ||
data_vars, coords, compat="broadcast_equals" | ||
variables, coord_names, dims, ds_indexes, _ = merge_data_and_coords( | ||
data_vars, | ||
coords, | ||
compat="broadcast_equals", | ||
create_default_indexes=create_default_indexes, | ||
) | ||
|
||
both_indexes_and_coords = set(indexes) & coord_names | ||
if both_indexes_and_coords: | ||
raise ValueError( | ||
f"{both_indexes_and_coords} are found in both indexes and coords" | ||
) | ||
Comment on lines
+640
to
+644
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This check surprises me. As a user, I would expect that I can write something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. Would it be reasonable that the coordinates provided via There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you give an example of how this would happen? Do indexes already provide enough information to create a coordinate (this would slightly surprise me)? I think an error in these cases or creating a Dataset with inconsistent coordinates/indexes (if it is hard to check) would be fine. I would not silently override coordinates. |
||
|
||
variables.update({k: v.copy(deep=False) for k, v in indexes.variables.items()}) | ||
coord_names.update(indexes.variables) | ||
ds_indexes.update(indexes) | ||
|
||
# re-calculate dimensions if indexes are given explicitly | ||
if indexes: | ||
dims = calculate_dimensions(variables) | ||
|
||
self._attrs = dict(attrs) if attrs is not None else None | ||
self._close = None | ||
self._encoding = None | ||
self._variables = variables | ||
self._coord_names = coord_names | ||
self._dims = dims | ||
self._indexes = indexes | ||
self._indexes = ds_indexes | ||
|
||
@classmethod | ||
def load_store(cls: type[T_Dataset], store, decoder=None) -> T_Dataset: | ||
|
@@ -6052,6 +6089,30 @@ def assign( | |
data.update(results) | ||
return data | ||
|
||
def assign_indexes(self, indexes: Indexes[Index]): | ||
"""Assign new indexes to this dataset. | ||
|
||
Returns a new dataset with all the original data in addition to the new | ||
indexes (and their corresponding coordinates). | ||
|
||
Parameters | ||
---------- | ||
indexes : :py:class:`~xarray.Indexes`. | ||
A collection of :py:class:`~xarray.indexes.Index` objects | ||
to assign (including their coordinate variables). | ||
|
||
Returns | ||
------- | ||
assigned : Dataset | ||
A new dataset with the new indexes and coordinates in addition to | ||
the existing data. | ||
""" | ||
ds_indexes = Dataset(indexes=indexes) | ||
dropped = self.drop_vars(indexes, errors="ignore") | ||
return dropped.merge( | ||
ds_indexes, compat="minimal", join="override", combine_attrs="no_conflicts" | ||
) | ||
|
||
def to_array( | ||
self, dim: Hashable = "variable", name: Hashable | None = None | ||
) -> DataArray: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the special case for size 0 indexes. These sort of special cases that violate type stability can lead to very hard to debug bugs.
Instead, how about never creating default pandas indexes if
indexes
is provided? Maybe there could also be a special method (e.g.,assign_default_indexes()
) for manually adding in pandas indexes later.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I like that!
I also like
assign_default_indexes()
or maybeassign_indexes(indexes=None)
if we don't want to add another method.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern I have is when a user wants to explicitly provide a pandas multi-index or a custom single coordinate index for a dimension and still wants the default pandas indexes for the other dimension coordinates.
In this case, calling
assign_default_indexes()
orassign_indexes(indexes=None)
later will likely overwrite the explicitly provided indexes?After removing the pandas multi-index dimension coordinate this won't be an issue anymore, but the issue remains for any custom 1-d dimension coordinate index.
Should we add an
overwrite=True
keyword argument toassign_(default_)indexes
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assign_default_indexes()
should probably include an optional list of dimension names for which to assign default indexes.By default, we might make it only add indexes for dimensions that doesn't already have an index.