From d26894bdc3d80acd117a16a528663bcdf26bab32 Mon Sep 17 00:00:00 2001 From: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Date: Tue, 22 Feb 2022 19:49:52 -0800 Subject: [PATCH] Move Zarr up in io.rst (#6289) * Move Zarr up in io.rst The existing version had it right down the page, below Iris / Pickle / et al. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --- doc/contributing.rst | 4 +- doc/user-guide/io.rst | 858 ++++++++++++++++++------------------- doc/whats-new.rst | 4 +- xarray/core/computation.py | 2 +- 4 files changed, 434 insertions(+), 434 deletions(-) diff --git a/doc/contributing.rst b/doc/contributing.rst index df279caa54f..0913702fd83 100644 --- a/doc/contributing.rst +++ b/doc/contributing.rst @@ -274,13 +274,13 @@ Some other important things to know about the docs: .. ipython:: python x = 2 - x ** 3 + x**3 will be rendered as:: In [1]: x = 2 - In [2]: x ** 3 + In [2]: x**3 Out[2]: 8 Almost all code examples in the docs are run (and the output saved) during the diff --git a/doc/user-guide/io.rst b/doc/user-guide/io.rst index 28eeeeda99b..834e9ad2464 100644 --- a/doc/user-guide/io.rst +++ b/doc/user-guide/io.rst @@ -498,596 +498,596 @@ and currently raises a warning unless ``invalid_netcdf=True`` is set: Note that this produces a file that is likely to be not readable by other netCDF libraries! -.. _io.iris: +.. _io.zarr: -Iris +Zarr ---- -The Iris_ tool allows easy reading of common meteorological and climate model formats -(including GRIB and UK MetOffice PP files) into ``Cube`` objects which are in many ways very -similar to ``DataArray`` objects, while enforcing a CF-compliant data model. If iris is -installed, xarray can convert a ``DataArray`` into a ``Cube`` using -:py:meth:`DataArray.to_iris`: - -.. ipython:: python +`Zarr`_ is a Python package that provides an implementation of chunked, compressed, +N-dimensional arrays. +Zarr has the ability to store arrays in a range of ways, including in memory, +in files, and in cloud-based object storage such as `Amazon S3`_ and +`Google Cloud Storage`_. +Xarray's Zarr backend allows xarray to leverage these capabilities, including +the ability to store and analyze datasets far too large fit onto disk +(particularly :ref:`in combination with dask `). - da = xr.DataArray( - np.random.rand(4, 5), - dims=["x", "y"], - coords=dict(x=[10, 20, 30, 40], y=pd.date_range("2000-01-01", periods=5)), - ) +Xarray can't open just any zarr dataset, because xarray requires special +metadata (attributes) describing the dataset dimensions and coordinates. +At this time, xarray can only open zarr datasets that have been written by +xarray. For implementation details, see :ref:`zarr_encoding`. - cube = da.to_iris() - cube +To write a dataset with zarr, we use the :py:meth:`Dataset.to_zarr` method. -Conversely, we can create a new ``DataArray`` object from a ``Cube`` using -:py:meth:`DataArray.from_iris`: +To write to a local directory, we pass a path to a directory: .. ipython:: python + :suppress: - da_cube = xr.DataArray.from_iris(cube) - da_cube + ! rm -rf path/to/directory.zarr +.. ipython:: python -.. _Iris: https://scitools.org.uk/iris + ds = xr.Dataset( + {"foo": (("x", "y"), np.random.rand(4, 5))}, + coords={ + "x": [10, 20, 30, 40], + "y": pd.date_range("2000-01-01", periods=5), + "z": ("x", list("abcd")), + }, + ) + ds.to_zarr("path/to/directory.zarr") +(The suffix ``.zarr`` is optional--just a reminder that a zarr store lives +there.) If the directory does not exist, it will be created. If a zarr +store is already present at that path, an error will be raised, preventing it +from being overwritten. To override this behavior and overwrite an existing +store, add ``mode='w'`` when invoking :py:meth:`~Dataset.to_zarr`. -OPeNDAP -------- +To store variable length strings, convert them to object arrays first with +``dtype=object``. -Xarray includes support for `OPeNDAP`__ (via the netCDF4 library or Pydap), which -lets us access large datasets over HTTP. +To read back a zarr dataset that has been created this way, we use the +:py:func:`open_zarr` method: -__ https://www.opendap.org/ +.. ipython:: python -For example, we can open a connection to GBs of weather data produced by the -`PRISM`__ project, and hosted by `IRI`__ at Columbia: + ds_zarr = xr.open_zarr("path/to/directory.zarr") + ds_zarr -__ https://www.prism.oregonstate.edu/ -__ https://iri.columbia.edu/ +Cloud Storage Buckets +~~~~~~~~~~~~~~~~~~~~~ -.. ipython source code for this section - we don't use this to avoid hitting the DAP server on every doc build. +It is possible to read and write xarray datasets directly from / to cloud +storage buckets using zarr. This example uses the `gcsfs`_ package to provide +an interface to `Google Cloud Storage`_. - remote_data = xr.open_dataset( - 'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods', - decode_times=False) - tmax = remote_data.tmax[:500, ::3, ::3] - tmax +From v0.16.2: general `fsspec`_ URLs are parsed and the store set up for you +automatically when reading, such that you can open a dataset in a single +call. You should include any arguments to the storage backend as the +key ``storage_options``, part of ``backend_kwargs``. - @savefig opendap-prism-tmax.png - tmax[0].plot() +.. code:: python -.. ipython:: - :verbatim: + ds_gcs = xr.open_dataset( + "gcs:///path.zarr", + backend_kwargs={ + "storage_options": {"project": "", "token": None} + }, + engine="zarr", + ) - In [3]: remote_data = xr.open_dataset( - ...: "http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods", - ...: decode_times=False, - ...: ) - In [4]: remote_data - Out[4]: - - Dimensions: (T: 1422, X: 1405, Y: 621) - Coordinates: - * X (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ... - * T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ... - * Y (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ... - Data variables: - ppt (T, Y, X) float64 ... - tdmean (T, Y, X) float64 ... - tmax (T, Y, X) float64 ... - tmin (T, Y, X) float64 ... - Attributes: - Conventions: IRIDL - expires: 1375315200 +This also works with ``open_mfdataset``, allowing you to pass a list of paths or +a URL to be interpreted as a glob string. -.. TODO: update this example to show off decode_cf? +For older versions, and for writing, you must explicitly set up a ``MutableMapping`` +instance and pass this, as follows: -.. note:: +.. code:: python - Like many real-world datasets, this dataset does not entirely follow - `CF conventions`_. Unexpected formats will usually cause xarray's automatic - decoding to fail. The way to work around this is to either set - ``decode_cf=False`` in ``open_dataset`` to turn off all use of CF - conventions, or by only disabling the troublesome parser. - In this case, we set ``decode_times=False`` because the time axis here - provides the calendar attribute in a format that xarray does not expect - (the integer ``360`` instead of a string like ``'360_day'``). + import gcsfs -We can select and slice this data any number of times, and nothing is loaded -over the network until we look at particular values: + fs = gcsfs.GCSFileSystem(project="", token=None) + gcsmap = gcsfs.mapping.GCSMap("", gcs=fs, check=True, create=False) + # write to the bucket + ds.to_zarr(store=gcsmap) + # read it back + ds_gcs = xr.open_zarr(gcsmap) -.. ipython:: - :verbatim: +(or use the utility function ``fsspec.get_mapper()``). - In [4]: tmax = remote_data["tmax"][:500, ::3, ::3] +.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/ +.. _Zarr: https://zarr.readthedocs.io/ +.. _Amazon S3: https://aws.amazon.com/s3/ +.. _Google Cloud Storage: https://cloud.google.com/storage/ +.. _gcsfs: https://github.com/fsspec/gcsfs - In [5]: tmax - Out[5]: - - [48541500 values with dtype=float64] - Coordinates: - * Y (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 ... - * X (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 ... - * T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 ... - Attributes: - pointwidth: 120 - standard_name: air_temperature - units: Celsius_scale - expires: 1443657600 +Zarr Compressors and Filters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - # the data is downloaded automatically when we make the plot - In [6]: tmax[0].plot() +There are many different options for compression and filtering possible with +zarr. These are described in the +`zarr documentation `_. +These options can be passed to the ``to_zarr`` method as variable encoding. +For example: -.. image:: ../_static/opendap-prism-tmax.png +.. ipython:: python + :suppress: -Some servers require authentication before we can access the data. For this -purpose we can explicitly create a :py:class:`backends.PydapDataStore` -and pass in a `Requests`__ session object. For example for -HTTP Basic authentication:: + ! rm -rf foo.zarr - import xarray as xr - import requests +.. ipython:: python - session = requests.Session() - session.auth = ('username', 'password') + import zarr - store = xr.backends.PydapDataStore.open('http://example.com/data', - session=session) - ds = xr.open_dataset(store) + compressor = zarr.Blosc(cname="zstd", clevel=3, shuffle=2) + ds.to_zarr("foo.zarr", encoding={"foo": {"compressor": compressor}}) -`Pydap's cas module`__ has functions that generate custom sessions for -servers that use CAS single sign-on. For example, to connect to servers -that require NASA's URS authentication:: +.. note:: - import xarray as xr - from pydata.cas.urs import setup_session + Not all native zarr compression and filtering options have been tested with + xarray. - ds_url = 'https://gpm1.gesdisc.eosdis.nasa.gov/opendap/hyrax/example.nc' +.. _io.zarr.consolidated_metadata: - session = setup_session('username', 'password', check_url=ds_url) - store = xr.backends.PydapDataStore.open(ds_url, session=session) +Consolidated Metadata +~~~~~~~~~~~~~~~~~~~~~ - ds = xr.open_dataset(store) +Xarray needs to read all of the zarr metadata when it opens a dataset. +In some storage mediums, such as with cloud object storage (e.g. amazon S3), +this can introduce significant overhead, because two separate HTTP calls to the +object store must be made for each variable in the dataset. +As of xarray version 0.18, xarray by default uses a feature called +*consolidated metadata*, storing all metadata for the entire dataset with a +single key (by default called ``.zmetadata``). This typically drastically speeds +up opening the store. (For more information on this feature, consult the +`zarr docs `_.) -__ https://docs.python-requests.org -__ https://www.pydap.org/en/latest/client.html#authentication +By default, xarray writes consolidated metadata and attempts to read stores +with consolidated metadata, falling back to use non-consolidated metadata for +reads. Because this fall-back option is so much slower, xarray issues a +``RuntimeWarning`` with guidance when reading with consolidated metadata fails: -.. _io.pickle: + Failed to open Zarr store with consolidated metadata, falling back to try + reading non-consolidated metadata. This is typically much slower for + opening a dataset. To silence this warning, consider: -Pickle ------- + 1. Consolidating metadata in this existing store with + :py:func:`zarr.consolidate_metadata`. + 2. Explicitly setting ``consolidated=False``, to avoid trying to read + consolidate metadata. + 3. Explicitly setting ``consolidated=True``, to raise an error in this case + instead of falling back to try reading non-consolidated metadata. -The simplest way to serialize an xarray object is to use Python's built-in pickle -module: +.. _io.zarr.appending: -.. ipython:: python +Appending to existing Zarr stores +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - import pickle +Xarray supports several ways of incrementally writing variables to a Zarr +store. These options are useful for scenarios when it is infeasible or +undesirable to write your entire dataset at once. - # use the highest protocol (-1) because it is way faster than the default - # text based pickle format - pkl = pickle.dumps(ds, protocol=-1) +.. tip:: - pickle.loads(pkl) + If you can load all of your data into a single ``Dataset`` using dask, a + single call to ``to_zarr()`` will write all of your data in parallel. -Pickling is important because it doesn't require any external libraries -and lets you use xarray objects with Python modules like -:py:mod:`multiprocessing` or :ref:`Dask `. However, pickling is -**not recommended for long-term storage**. +.. warning:: -Restoring a pickle requires that the internal structure of the types for the -pickled data remain unchanged. Because the internal design of xarray is still -being refined, we make no guarantees (at this point) that objects pickled with -this version of xarray will work in future versions. - -.. note:: + Alignment of coordinates is currently not checked when modifying an + existing Zarr store. It is up to the user to ensure that coordinates are + consistent. - When pickling an object opened from a NetCDF file, the pickle file will - contain a reference to the file on disk. If you want to store the actual - array values, load it into memory first with :py:meth:`Dataset.load` - or :py:meth:`Dataset.compute`. +To add or overwrite entire variables, simply call :py:meth:`~Dataset.to_zarr` +with ``mode='a'`` on a Dataset containing the new variables, passing in an +existing Zarr store or path to a Zarr store. -.. _dictionary io: +To resize and then append values along an existing dimension in a store, set +``append_dim``. This is a good option if data always arives in a particular +order, e.g., for time-stepping a simulation: -Dictionary ----------- +.. ipython:: python + :suppress: -We can convert a ``Dataset`` (or a ``DataArray``) to a dict using -:py:meth:`Dataset.to_dict`: + ! rm -rf path/to/directory.zarr .. ipython:: python - d = ds.to_dict() - d + ds1 = xr.Dataset( + {"foo": (("x", "y", "t"), np.random.rand(4, 5, 2))}, + coords={ + "x": [10, 20, 30, 40], + "y": [1, 2, 3, 4, 5], + "t": pd.date_range("2001-01-01", periods=2), + }, + ) + ds1.to_zarr("path/to/directory.zarr") + ds2 = xr.Dataset( + {"foo": (("x", "y", "t"), np.random.rand(4, 5, 2))}, + coords={ + "x": [10, 20, 30, 40], + "y": [1, 2, 3, 4, 5], + "t": pd.date_range("2001-01-03", periods=2), + }, + ) + ds2.to_zarr("path/to/directory.zarr", append_dim="t") -We can create a new xarray object from a dict using -:py:meth:`Dataset.from_dict`: +Finally, you can use ``region`` to write to limited regions of existing arrays +in an existing Zarr store. This is a good option for writing data in parallel +from independent processes. + +To scale this up to writing large datasets, the first step is creating an +initial Zarr store without writing all of its array data. This can be done by +first creating a ``Dataset`` with dummy values stored in :ref:`dask `, +and then calling ``to_zarr`` with ``compute=False`` to write only metadata +(including ``attrs``) to Zarr: .. ipython:: python + :suppress: - ds_dict = xr.Dataset.from_dict(d) - ds_dict + ! rm -rf path/to/directory.zarr -Dictionary support allows for flexible use of xarray objects. It doesn't -require external libraries and dicts can easily be pickled, or converted to -json, or geojson. All the values are converted to lists, so dicts might -be quite large. +.. ipython:: python -To export just the dataset schema without the data itself, use the -``data=False`` option: + import dask.array + + # The values of this dask array are entirely irrelevant; only the dtype, + # shape and chunks are used + dummies = dask.array.zeros(30, chunks=10) + ds = xr.Dataset({"foo": ("x", dummies)}) + path = "path/to/directory.zarr" + # Now we write the metadata without computing any array values + ds.to_zarr(path, compute=False) + +Now, a Zarr store with the correct variable shapes and attributes exists that +can be filled out by subsequent calls to ``to_zarr``. The ``region`` provides a +mapping from dimension names to Python ``slice`` objects indicating where the +data should be written (in index space, not coordinate space), e.g., .. ipython:: python - ds.to_dict(data=False) + # For convenience, we'll slice a single dataset, but in the real use-case + # we would create them separately possibly even from separate processes. + ds = xr.Dataset({"foo": ("x", np.arange(30))}) + ds.isel(x=slice(0, 10)).to_zarr(path, region={"x": slice(0, 10)}) + ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": slice(10, 20)}) + ds.isel(x=slice(20, 30)).to_zarr(path, region={"x": slice(20, 30)}) -This can be useful for generating indices of dataset contents to expose to -search indices or other automated data discovery tools. +Concurrent writes with ``region`` are safe as long as they modify distinct +chunks in the underlying Zarr arrays (or use an appropriate ``lock``). -.. ipython:: python - :suppress: +As a safety check to make it harder to inadvertently override existing values, +if you set ``region`` then *all* variables included in a Dataset must have +dimensions included in ``region``. Other variables (typically coordinates) +need to be explicitly dropped and/or written in a separate calls to ``to_zarr`` +with ``mode='a'``. - import os +.. _io.iris: - os.remove("saved_on_disk.nc") +Iris +---- -.. _io.rasterio: +The Iris_ tool allows easy reading of common meteorological and climate model formats +(including GRIB and UK MetOffice PP files) into ``Cube`` objects which are in many ways very +similar to ``DataArray`` objects, while enforcing a CF-compliant data model. If iris is +installed, xarray can convert a ``DataArray`` into a ``Cube`` using +:py:meth:`DataArray.to_iris`: -Rasterio --------- +.. ipython:: python -GeoTIFFs and other gridded raster datasets can be opened using `rasterio`_, if -rasterio is installed. Here is an example of how to use -:py:func:`open_rasterio` to read one of rasterio's `test files`_: + da = xr.DataArray( + np.random.rand(4, 5), + dims=["x", "y"], + coords=dict(x=[10, 20, 30, 40], y=pd.date_range("2000-01-01", periods=5)), + ) -.. deprecated:: 0.20.0 + cube = da.to_iris() + cube - Deprecated in favor of rioxarray. - For information about transitioning, see: - https://corteva.github.io/rioxarray/stable/getting_started/getting_started.html +Conversely, we can create a new ``DataArray`` object from a ``Cube`` using +:py:meth:`DataArray.from_iris`: -.. ipython:: - :verbatim: +.. ipython:: python - In [7]: rio = xr.open_rasterio("RGB.byte.tif") + da_cube = xr.DataArray.from_iris(cube) + da_cube - In [8]: rio - Out[8]: - - [1703814 values with dtype=uint8] - Coordinates: - * band (band) int64 1 2 3 - * y (y) float64 2.827e+06 2.826e+06 2.826e+06 2.826e+06 2.826e+06 ... - * x (x) float64 1.021e+05 1.024e+05 1.027e+05 1.03e+05 1.033e+05 ... - Attributes: - res: (300.0379266750948, 300.041782729805) - transform: (300.0379266750948, 0.0, 101985.0, 0.0, -300.041782729805, 28... - is_tiled: 0 - crs: +init=epsg:32618 +.. _Iris: https://scitools.org.uk/iris -The ``x`` and ``y`` coordinates are generated out of the file's metadata -(``bounds``, ``width``, ``height``), and they can be understood as cartesian -coordinates defined in the file's projection provided by the ``crs`` attribute. -``crs`` is a PROJ4 string which can be parsed by e.g. `pyproj`_ or rasterio. -See :ref:`/examples/visualization_gallery.ipynb#Parsing-rasterio-geocoordinates` -for an example of how to convert these to longitudes and latitudes. +OPeNDAP +------- -Additionally, you can use `rioxarray`_ for reading in GeoTiff, netCDF or other -GDAL readable raster data using `rasterio`_ as well as for exporting to a geoTIFF. -`rioxarray`_ can also handle geospatial related tasks such as re-projecting and clipping. +Xarray includes support for `OPeNDAP`__ (via the netCDF4 library or Pydap), which +lets us access large datasets over HTTP. -.. ipython:: - :verbatim: +__ https://www.opendap.org/ - In [1]: import rioxarray +For example, we can open a connection to GBs of weather data produced by the +`PRISM`__ project, and hosted by `IRI`__ at Columbia: - In [2]: rds = rioxarray.open_rasterio("RGB.byte.tif") +__ https://www.prism.oregonstate.edu/ +__ https://iri.columbia.edu/ - In [3]: rds - Out[3]: - - [1703814 values with dtype=uint8] - Coordinates: - * band (band) int64 1 2 3 - * y (y) float64 2.827e+06 2.826e+06 ... 2.612e+06 2.612e+06 - * x (x) float64 1.021e+05 1.024e+05 ... 3.389e+05 3.392e+05 - spatial_ref int64 0 - Attributes: - STATISTICS_MAXIMUM: 255 - STATISTICS_MEAN: 29.947726688477 - STATISTICS_MINIMUM: 0 - STATISTICS_STDDEV: 52.340921626611 - transform: (300.0379266750948, 0.0, 101985.0, 0.0, -300.0417827... - _FillValue: 0.0 - scale_factor: 1.0 - add_offset: 0.0 - grid_mapping: spatial_ref +.. ipython source code for this section + we don't use this to avoid hitting the DAP server on every doc build. - In [4]: rds.rio.crs - Out[4]: CRS.from_epsg(32618) + remote_data = xr.open_dataset( + 'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods', + decode_times=False) + tmax = remote_data.tmax[:500, ::3, ::3] + tmax - In [5]: rds4326 = rds.rio.reproject("epsg:4326") + @savefig opendap-prism-tmax.png + tmax[0].plot() - In [6]: rds4326.rio.crs - Out[6]: CRS.from_epsg(4326) +.. ipython:: + :verbatim: - In [7]: rds4326.rio.to_raster("RGB.byte.4326.tif") + In [3]: remote_data = xr.open_dataset( + ...: "http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods", + ...: decode_times=False, + ...: ) + In [4]: remote_data + Out[4]: + + Dimensions: (T: 1422, X: 1405, Y: 621) + Coordinates: + * X (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ... + * T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ... + * Y (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ... + Data variables: + ppt (T, Y, X) float64 ... + tdmean (T, Y, X) float64 ... + tmax (T, Y, X) float64 ... + tmin (T, Y, X) float64 ... + Attributes: + Conventions: IRIDL + expires: 1375315200 -.. _rasterio: https://rasterio.readthedocs.io/en/latest/ -.. _rioxarray: https://corteva.github.io/rioxarray/stable/ -.. _test files: https://github.com/rasterio/rasterio/blob/master/tests/data/RGB.byte.tif -.. _pyproj: https://github.com/pyproj4/pyproj +.. TODO: update this example to show off decode_cf? -.. _io.zarr: +.. note:: -Zarr ----- + Like many real-world datasets, this dataset does not entirely follow + `CF conventions`_. Unexpected formats will usually cause xarray's automatic + decoding to fail. The way to work around this is to either set + ``decode_cf=False`` in ``open_dataset`` to turn off all use of CF + conventions, or by only disabling the troublesome parser. + In this case, we set ``decode_times=False`` because the time axis here + provides the calendar attribute in a format that xarray does not expect + (the integer ``360`` instead of a string like ``'360_day'``). -`Zarr`_ is a Python package that provides an implementation of chunked, compressed, -N-dimensional arrays. -Zarr has the ability to store arrays in a range of ways, including in memory, -in files, and in cloud-based object storage such as `Amazon S3`_ and -`Google Cloud Storage`_. -Xarray's Zarr backend allows xarray to leverage these capabilities, including -the ability to store and analyze datasets far too large fit onto disk -(particularly :ref:`in combination with dask `). +We can select and slice this data any number of times, and nothing is loaded +over the network until we look at particular values: -Xarray can't open just any zarr dataset, because xarray requires special -metadata (attributes) describing the dataset dimensions and coordinates. -At this time, xarray can only open zarr datasets that have been written by -xarray. For implementation details, see :ref:`zarr_encoding`. +.. ipython:: + :verbatim: -To write a dataset with zarr, we use the :py:meth:`Dataset.to_zarr` method. + In [4]: tmax = remote_data["tmax"][:500, ::3, ::3] -To write to a local directory, we pass a path to a directory: + In [5]: tmax + Out[5]: + + [48541500 values with dtype=float64] + Coordinates: + * Y (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 ... + * X (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 ... + * T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 ... + Attributes: + pointwidth: 120 + standard_name: air_temperature + units: Celsius_scale + expires: 1443657600 -.. ipython:: python - :suppress: + # the data is downloaded automatically when we make the plot + In [6]: tmax[0].plot() - ! rm -rf path/to/directory.zarr +.. image:: ../_static/opendap-prism-tmax.png -.. ipython:: python +Some servers require authentication before we can access the data. For this +purpose we can explicitly create a :py:class:`backends.PydapDataStore` +and pass in a `Requests`__ session object. For example for +HTTP Basic authentication:: - ds = xr.Dataset( - {"foo": (("x", "y"), np.random.rand(4, 5))}, - coords={ - "x": [10, 20, 30, 40], - "y": pd.date_range("2000-01-01", periods=5), - "z": ("x", list("abcd")), - }, - ) - ds.to_zarr("path/to/directory.zarr") + import xarray as xr + import requests -(The suffix ``.zarr`` is optional--just a reminder that a zarr store lives -there.) If the directory does not exist, it will be created. If a zarr -store is already present at that path, an error will be raised, preventing it -from being overwritten. To override this behavior and overwrite an existing -store, add ``mode='w'`` when invoking :py:meth:`~Dataset.to_zarr`. + session = requests.Session() + session.auth = ('username', 'password') -To store variable length strings, convert them to object arrays first with -``dtype=object``. + store = xr.backends.PydapDataStore.open('http://example.com/data', + session=session) + ds = xr.open_dataset(store) -To read back a zarr dataset that has been created this way, we use the -:py:func:`open_zarr` method: +`Pydap's cas module`__ has functions that generate custom sessions for +servers that use CAS single sign-on. For example, to connect to servers +that require NASA's URS authentication:: -.. ipython:: python + import xarray as xr + from pydata.cas.urs import setup_session - ds_zarr = xr.open_zarr("path/to/directory.zarr") - ds_zarr + ds_url = 'https://gpm1.gesdisc.eosdis.nasa.gov/opendap/hyrax/example.nc' -Cloud Storage Buckets -~~~~~~~~~~~~~~~~~~~~~ + session = setup_session('username', 'password', check_url=ds_url) + store = xr.backends.PydapDataStore.open(ds_url, session=session) -It is possible to read and write xarray datasets directly from / to cloud -storage buckets using zarr. This example uses the `gcsfs`_ package to provide -an interface to `Google Cloud Storage`_. + ds = xr.open_dataset(store) -From v0.16.2: general `fsspec`_ URLs are parsed and the store set up for you -automatically when reading, such that you can open a dataset in a single -call. You should include any arguments to the storage backend as the -key ``storage_options``, part of ``backend_kwargs``. +__ https://docs.python-requests.org +__ https://www.pydap.org/en/latest/client.html#authentication -.. code:: python +.. _io.pickle: - ds_gcs = xr.open_dataset( - "gcs:///path.zarr", - backend_kwargs={ - "storage_options": {"project": "", "token": None} - }, - engine="zarr", - ) +Pickle +------ +The simplest way to serialize an xarray object is to use Python's built-in pickle +module: -This also works with ``open_mfdataset``, allowing you to pass a list of paths or -a URL to be interpreted as a glob string. +.. ipython:: python -For older versions, and for writing, you must explicitly set up a ``MutableMapping`` -instance and pass this, as follows: + import pickle -.. code:: python + # use the highest protocol (-1) because it is way faster than the default + # text based pickle format + pkl = pickle.dumps(ds, protocol=-1) - import gcsfs + pickle.loads(pkl) - fs = gcsfs.GCSFileSystem(project="", token=None) - gcsmap = gcsfs.mapping.GCSMap("", gcs=fs, check=True, create=False) - # write to the bucket - ds.to_zarr(store=gcsmap) - # read it back - ds_gcs = xr.open_zarr(gcsmap) +Pickling is important because it doesn't require any external libraries +and lets you use xarray objects with Python modules like +:py:mod:`multiprocessing` or :ref:`Dask `. However, pickling is +**not recommended for long-term storage**. -(or use the utility function ``fsspec.get_mapper()``). +Restoring a pickle requires that the internal structure of the types for the +pickled data remain unchanged. Because the internal design of xarray is still +being refined, we make no guarantees (at this point) that objects pickled with +this version of xarray will work in future versions. -.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/ -.. _Zarr: https://zarr.readthedocs.io/ -.. _Amazon S3: https://aws.amazon.com/s3/ -.. _Google Cloud Storage: https://cloud.google.com/storage/ -.. _gcsfs: https://github.com/fsspec/gcsfs +.. note:: -Zarr Compressors and Filters -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + When pickling an object opened from a NetCDF file, the pickle file will + contain a reference to the file on disk. If you want to store the actual + array values, load it into memory first with :py:meth:`Dataset.load` + or :py:meth:`Dataset.compute`. -There are many different options for compression and filtering possible with -zarr. These are described in the -`zarr documentation `_. -These options can be passed to the ``to_zarr`` method as variable encoding. -For example: +.. _dictionary io: -.. ipython:: python - :suppress: +Dictionary +---------- - ! rm -rf foo.zarr +We can convert a ``Dataset`` (or a ``DataArray``) to a dict using +:py:meth:`Dataset.to_dict`: .. ipython:: python - import zarr - - compressor = zarr.Blosc(cname="zstd", clevel=3, shuffle=2) - ds.to_zarr("foo.zarr", encoding={"foo": {"compressor": compressor}}) + d = ds.to_dict() + d -.. note:: +We can create a new xarray object from a dict using +:py:meth:`Dataset.from_dict`: - Not all native zarr compression and filtering options have been tested with - xarray. +.. ipython:: python -.. _io.zarr.consolidated_metadata: + ds_dict = xr.Dataset.from_dict(d) + ds_dict -Consolidated Metadata -~~~~~~~~~~~~~~~~~~~~~ +Dictionary support allows for flexible use of xarray objects. It doesn't +require external libraries and dicts can easily be pickled, or converted to +json, or geojson. All the values are converted to lists, so dicts might +be quite large. -Xarray needs to read all of the zarr metadata when it opens a dataset. -In some storage mediums, such as with cloud object storage (e.g. amazon S3), -this can introduce significant overhead, because two separate HTTP calls to the -object store must be made for each variable in the dataset. -As of xarray version 0.18, xarray by default uses a feature called -*consolidated metadata*, storing all metadata for the entire dataset with a -single key (by default called ``.zmetadata``). This typically drastically speeds -up opening the store. (For more information on this feature, consult the -`zarr docs `_.) +To export just the dataset schema without the data itself, use the +``data=False`` option: -By default, xarray writes consolidated metadata and attempts to read stores -with consolidated metadata, falling back to use non-consolidated metadata for -reads. Because this fall-back option is so much slower, xarray issues a -``RuntimeWarning`` with guidance when reading with consolidated metadata fails: +.. ipython:: python - Failed to open Zarr store with consolidated metadata, falling back to try - reading non-consolidated metadata. This is typically much slower for - opening a dataset. To silence this warning, consider: + ds.to_dict(data=False) - 1. Consolidating metadata in this existing store with - :py:func:`zarr.consolidate_metadata`. - 2. Explicitly setting ``consolidated=False``, to avoid trying to read - consolidate metadata. - 3. Explicitly setting ``consolidated=True``, to raise an error in this case - instead of falling back to try reading non-consolidated metadata. +This can be useful for generating indices of dataset contents to expose to +search indices or other automated data discovery tools. -.. _io.zarr.appending: +.. ipython:: python + :suppress: -Appending to existing Zarr stores -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + import os -Xarray supports several ways of incrementally writing variables to a Zarr -store. These options are useful for scenarios when it is infeasible or -undesirable to write your entire dataset at once. + os.remove("saved_on_disk.nc") -.. tip:: +.. _io.rasterio: - If you can load all of your data into a single ``Dataset`` using dask, a - single call to ``to_zarr()`` will write all of your data in parallel. +Rasterio +-------- -.. warning:: +GeoTIFFs and other gridded raster datasets can be opened using `rasterio`_, if +rasterio is installed. Here is an example of how to use +:py:func:`open_rasterio` to read one of rasterio's `test files`_: - Alignment of coordinates is currently not checked when modifying an - existing Zarr store. It is up to the user to ensure that coordinates are - consistent. +.. deprecated:: 0.20.0 -To add or overwrite entire variables, simply call :py:meth:`~Dataset.to_zarr` -with ``mode='a'`` on a Dataset containing the new variables, passing in an -existing Zarr store or path to a Zarr store. + Deprecated in favor of rioxarray. + For information about transitioning, see: + https://corteva.github.io/rioxarray/stable/getting_started/getting_started.html -To resize and then append values along an existing dimension in a store, set -``append_dim``. This is a good option if data always arives in a particular -order, e.g., for time-stepping a simulation: +.. ipython:: + :verbatim: -.. ipython:: python - :suppress: + In [7]: rio = xr.open_rasterio("RGB.byte.tif") - ! rm -rf path/to/directory.zarr + In [8]: rio + Out[8]: + + [1703814 values with dtype=uint8] + Coordinates: + * band (band) int64 1 2 3 + * y (y) float64 2.827e+06 2.826e+06 2.826e+06 2.826e+06 2.826e+06 ... + * x (x) float64 1.021e+05 1.024e+05 1.027e+05 1.03e+05 1.033e+05 ... + Attributes: + res: (300.0379266750948, 300.041782729805) + transform: (300.0379266750948, 0.0, 101985.0, 0.0, -300.041782729805, 28... + is_tiled: 0 + crs: +init=epsg:32618 -.. ipython:: python - ds1 = xr.Dataset( - {"foo": (("x", "y", "t"), np.random.rand(4, 5, 2))}, - coords={ - "x": [10, 20, 30, 40], - "y": [1, 2, 3, 4, 5], - "t": pd.date_range("2001-01-01", periods=2), - }, - ) - ds1.to_zarr("path/to/directory.zarr") - ds2 = xr.Dataset( - {"foo": (("x", "y", "t"), np.random.rand(4, 5, 2))}, - coords={ - "x": [10, 20, 30, 40], - "y": [1, 2, 3, 4, 5], - "t": pd.date_range("2001-01-03", periods=2), - }, - ) - ds2.to_zarr("path/to/directory.zarr", append_dim="t") +The ``x`` and ``y`` coordinates are generated out of the file's metadata +(``bounds``, ``width``, ``height``), and they can be understood as cartesian +coordinates defined in the file's projection provided by the ``crs`` attribute. +``crs`` is a PROJ4 string which can be parsed by e.g. `pyproj`_ or rasterio. +See :ref:`/examples/visualization_gallery.ipynb#Parsing-rasterio-geocoordinates` +for an example of how to convert these to longitudes and latitudes. -Finally, you can use ``region`` to write to limited regions of existing arrays -in an existing Zarr store. This is a good option for writing data in parallel -from independent processes. -To scale this up to writing large datasets, the first step is creating an -initial Zarr store without writing all of its array data. This can be done by -first creating a ``Dataset`` with dummy values stored in :ref:`dask `, -and then calling ``to_zarr`` with ``compute=False`` to write only metadata -(including ``attrs``) to Zarr: +Additionally, you can use `rioxarray`_ for reading in GeoTiff, netCDF or other +GDAL readable raster data using `rasterio`_ as well as for exporting to a geoTIFF. +`rioxarray`_ can also handle geospatial related tasks such as re-projecting and clipping. -.. ipython:: python - :suppress: +.. ipython:: + :verbatim: - ! rm -rf path/to/directory.zarr + In [1]: import rioxarray -.. ipython:: python + In [2]: rds = rioxarray.open_rasterio("RGB.byte.tif") - import dask.array + In [3]: rds + Out[3]: + + [1703814 values with dtype=uint8] + Coordinates: + * band (band) int64 1 2 3 + * y (y) float64 2.827e+06 2.826e+06 ... 2.612e+06 2.612e+06 + * x (x) float64 1.021e+05 1.024e+05 ... 3.389e+05 3.392e+05 + spatial_ref int64 0 + Attributes: + STATISTICS_MAXIMUM: 255 + STATISTICS_MEAN: 29.947726688477 + STATISTICS_MINIMUM: 0 + STATISTICS_STDDEV: 52.340921626611 + transform: (300.0379266750948, 0.0, 101985.0, 0.0, -300.0417827... + _FillValue: 0.0 + scale_factor: 1.0 + add_offset: 0.0 + grid_mapping: spatial_ref - # The values of this dask array are entirely irrelevant; only the dtype, - # shape and chunks are used - dummies = dask.array.zeros(30, chunks=10) - ds = xr.Dataset({"foo": ("x", dummies)}) - path = "path/to/directory.zarr" - # Now we write the metadata without computing any array values - ds.to_zarr(path, compute=False) + In [4]: rds.rio.crs + Out[4]: CRS.from_epsg(32618) -Now, a Zarr store with the correct variable shapes and attributes exists that -can be filled out by subsequent calls to ``to_zarr``. The ``region`` provides a -mapping from dimension names to Python ``slice`` objects indicating where the -data should be written (in index space, not coordinate space), e.g., + In [5]: rds4326 = rds.rio.reproject("epsg:4326") -.. ipython:: python + In [6]: rds4326.rio.crs + Out[6]: CRS.from_epsg(4326) - # For convenience, we'll slice a single dataset, but in the real use-case - # we would create them separately possibly even from separate processes. - ds = xr.Dataset({"foo": ("x", np.arange(30))}) - ds.isel(x=slice(0, 10)).to_zarr(path, region={"x": slice(0, 10)}) - ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": slice(10, 20)}) - ds.isel(x=slice(20, 30)).to_zarr(path, region={"x": slice(20, 30)}) + In [7]: rds4326.rio.to_raster("RGB.byte.4326.tif") -Concurrent writes with ``region`` are safe as long as they modify distinct -chunks in the underlying Zarr arrays (or use an appropriate ``lock``). -As a safety check to make it harder to inadvertently override existing values, -if you set ``region`` then *all* variables included in a Dataset must have -dimensions included in ``region``. Other variables (typically coordinates) -need to be explicitly dropped and/or written in a separate calls to ``to_zarr`` -with ``mode='a'``. +.. _rasterio: https://rasterio.readthedocs.io/en/latest/ +.. _rioxarray: https://corteva.github.io/rioxarray/stable/ +.. _test files: https://github.com/rasterio/rasterio/blob/master/tests/data/RGB.byte.tif +.. _pyproj: https://github.com/pyproj4/pyproj .. _io.cfgrib: diff --git a/doc/whats-new.rst b/doc/whats-new.rst index 37abf931eb4..aa48bd619e8 100644 --- a/doc/whats-new.rst +++ b/doc/whats-new.rst @@ -5133,7 +5133,7 @@ Enhancements .. ipython:: python ds = xray.Dataset(coords={"x": range(100), "y": range(100)}) - ds["distance"] = np.sqrt(ds.x ** 2 + ds.y ** 2) + ds["distance"] = np.sqrt(ds.x**2 + ds.y**2) @savefig where_example.png width=4in height=4in ds.distance.where(ds.distance < 100).plot() @@ -5341,7 +5341,7 @@ Enhancements .. ipython:: python ds = xray.Dataset({"y": ("x", [1, 2, 3])}) - ds.assign(z=lambda ds: ds.y ** 2) + ds.assign(z=lambda ds: ds.y**2) ds.assign_coords(z=("x", ["a", "b", "c"])) These methods return a new Dataset (or DataArray) with updated data or diff --git a/xarray/core/computation.py b/xarray/core/computation.py index 88eefbdc441..c11bd1a78a4 100644 --- a/xarray/core/computation.py +++ b/xarray/core/computation.py @@ -944,7 +944,7 @@ def apply_ufunc( Calculate the vector magnitude of two arguments: >>> def magnitude(a, b): - ... func = lambda x, y: np.sqrt(x ** 2 + y ** 2) + ... func = lambda x, y: np.sqrt(x**2 + y**2) ... return xr.apply_ufunc(func, a, b) ...