Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving an extracted netCDF file is far too time consuming #10061

Open
3 of 5 tasks
susiebrn opened this issue Feb 19, 2025 · 5 comments
Open
3 of 5 tasks

Saving an extracted netCDF file is far too time consuming #10061

susiebrn opened this issue Feb 19, 2025 · 5 comments
Labels
plan to close May be closeable, needs more eyeballs

Comments

@susiebrn
Copy link

What happened?

Image

time extracting area: 10.947124004364014
time saving: 487.40945315361023

Image

time extracting area: 514.2160515785217
time saving: 0.14027118682861328

I am trying to cut the netcdfs files because they are too big. And from the 50 variables of one file, I extract the 7 most important ones and saving the file takes an unexpectedly long time. And what is weird is if I just use the method where (2nd image), it is too slow to extract the xarray but quick to save. If I use .sel, (1st image) this is the opposite. I don't know what to do.

My file has the dimensions : N_PROF: 54212N_CALIB: 1N_PARAM: 5N_LEVELS: 400N_HISTORY: 0 but has no coordinates.

Thank you!

What did you expect to happen?

No response

Minimal Complete Verifiable Example

lat_min = 20
lat_max = 60
lon_min = -80
lon_max = -30

start = time.time()
ds = xr.open_dataset('./data/'+'EN.4.2.2.f.profiles.g10.201604.nc')
ds = ds[['LATITUDE', 'LONGITUDE', 'TEMP', 'PSAL_CORRECTED', 'DEPH_CORRECTED','JULD_LOCATION','QC_FLAGS_PROFILES']]
#profiles = ds['N_PROF'].where((ds.LATITUDE >= lat_min) & (ds.LATITUDE <= lat_max) & (ds.LONGITUDE >= lon_min) & (ds.LONGITUDE <= lon_max), drop=True)
ds = ds.where((ds.LATITUDE >= lat_min) & (ds.LATITUDE <= lat_max) & (ds.LONGITUDE >= lon_min) & (ds.LONGITUDE <= lon_max), drop=True)
#ds = ds.sel(N_PROF=profiles.astype(int))
end = time.time()
print(f' time extracting area: {end-start}')

start = time.time()
ds.load().to_netcdf('./data/test1.nc')
end = time.time()
print(f' time saving: {end-start}')

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-27-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2024.2.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.14.1
netCDF4: 1.7.1
pydap: None
h5netcdf: 1.3.0
h5py: 3.11.0
Nio: None
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: 3.10.0
bottleneck: 1.4.0
dask: 2024.9.0
distributed: 2024.9.0
matplotlib: 3.8.4
cartopy: 0.23.0
seaborn: 0.13.2
numbagg: None
fsspec: 2024.9.0
cupy: None
pint: 0.24.3
sparse: 0.15.4
flox: None
numpy_groupies: None
setuptools: 73.0.1
pip: 24.2
conda: 24.7.1
pytest: 8.3.3
mypy: None
IPython: 8.27.0
sphinx: 7.4.7

@susiebrn susiebrn added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 19, 2025
Copy link

welcome bot commented Feb 19, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas
Copy link
Member

The difference in timing between your two cases is due to xarray lazily evaluating operations. It's waiting as long as possible before loading any data, but in general .where is an unusual method in that it requires loading all the data. .sel does not require loading the data until you later try to save.

The same work is being done in both cases though, which is why both examples take a similar amount of total time to execute.

From your example it's not clear if there is some other reason why it takes 50 seconds to evaluate. How big is your data file?

@TomNicholas TomNicholas added usage question and removed bug needs triage Issue that has not been reviewed by xarray team member labels Feb 19, 2025
@susiebrn susiebrn reopened this Feb 19, 2025
@susiebrn
Copy link
Author

The data file is about 80M and this is the case for all the data files I have.

@max-sixty max-sixty added plan to close May be closeable, needs more eyeballs and removed usage question labels Feb 19, 2025
@max-sixty
Copy link
Collaborator

Unfortunately I'm not sure there's much we can do from here without much more info. If you want to debug why it takes longer than you expect, you can try varying the problem — change the amount of data that is queried from that file; try making a smaller sample, etc. (the attribution of load and save was a good step...)

@mathause
Copy link
Collaborator

You could also try calling ds.load() on the Dataset - sometimes that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to close May be closeable, needs more eyeballs
Projects
None yet
Development

No branches or pull requests

4 participants