-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues with small chunks #601
Comments
@cofinoa You indicated in one of the related postings that in netCDF3 making larger chunks for the time coordinate means that it can't be declared "unlimited". In netCDF4 is that also true or can it be declared "unlimited" and be made into bigger chunks? |
@mauzey1 is there a preset chunking value set in the code somewhere? I recall going over this in some detail many years ago, but a quick search of the repo for "chunk" doesn't appear to have shown any defaults, at least in my viewing |
@taylor13 to mitigate the problem in netcdf-3 the only solution it's not to make unlimited time dimension. In netcdf4/hdf5 you can select different chunksizes to make it bigger size for the time coordinate variable and chunksize of 1 for the principal variable. |
@cofinoa - In netCDF4/HDF5, if you want a chunk size larger than 1 for an unlimited time dimension, do you have pass multiple time-slices (equal or larger than the chunk size) to be written in a single call to the netCDF library? If so, then I would say we shouldn't change the default from 1 because many people write their files one time slice at a time (i.e., they write a single time coordinate value and a corresponding data field that applies to that single time slice. |
@cofinoa we tried to optimize the deflation, shuffling and chunking settings for the best performance vs file sizes. It is a difficult balancing act, as the only way to squeeze the best performance for output formats is to know both the 1) data that you're writing and 2) the use of this data once written before the file is created. We focused more on deflation (to minimize file sizes) rather than chunking (reading written data) as no default for Some of the history about this can be found in #135 (comment), #164, #403. Long story short, we opted to prioritize file size first, while selecting a If you had a better suggestion as to how these should be set, by deploying an algorithm to assess the data being written this would be a useful update. I note there are some comments about the version of the netcdf library playing a role in slow read speeds, see This ref was also an interesting find https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf_4_chunking_performance_results |
@taylor13 with respecto to:
No. The unlimited dimension logical size will increased independen from chunk size. @durack1, about:
I agree and I'm not proposing to modify the To support my point, I have defined a netcdf-4/hdf5 with just one unlimited dimension, and 2 variables with 2 different chunks :
The You can generate the actual netcdf file with the above CDL: $ ncgen -7 example.cdl and compile this simplistic (no error control, ....) program which add a value to each variable on the unlimited dimension every time is executed: #include <netcdf.h>
int main() {
int ncid, time_dimid, time_varid, par_varid;
size_t time_len, pos[1];
double value;
nc_open("example.nc", NC_WRITE, &ncid);
nc_inq_dimid(ncid, "time", &time_dimid);
nc_inq_dimlen(ncid, time_dimid, &time_len);
pos[0] = time_len;
value = (double) time_len * 2 ;
nc_inq_varid(ncid, "time", &time_varid);
nc_put_var1_double(ncid, time_varid, pos, &value);
value = value * 2;
nc_inq_varid(ncid, "par", &par_varid);
nc_put_var1_double(ncid, par_varid, pos, &value);
nc_close(ncid);
} If you execute it: $ ./addOneValue The content of the existing netcdf file will be:
With respect to Unidata/netcdf-c#489 issue it mentions performance issues with metadata, but it relates to number netcdf entities itself (variables, attributes, dimension) and the library strategy to cache them when netcdf file it's open. Hope this helps. Let me know if you need more info. |
@cofinoa in the #601 (comment) above there was no obvious next step regarding chunking coordinate variables. Have I missed something? As noted in #164, this is currently set at 1, what is your proposal (and what is the performance improvement with this)? |
thank you @cofinoa for providing all this good background and information and bringing to our attention the performance issue in reading time-coordinates only. If we can write individual time-slices and their associated time-coordinate value one at a time to a file (i.e., in separate calls to the nc "write" function), then I agree that a vector of coordinates values should probably never be "chunked", i.e., the entire vector of coordinate values should be written as a single chunk. I wouldn't think changing the default for chunking of coordinates would be that difficult, and it would apply to the "unlimited" time coordinate as well as other "limited" coordinates. It appears no changes would be needed for the chunking of the data array itself. Please let us know if this would be satisfactory. |
@taylor13, yes, the data array (principal variable) it's not been affected. Its @durack1 my proposal it's to define a chunk size balancing size issues (#164) and performance. The performance issue it's being explained at the issue description with an excerpt from HDF5 which explains the performance issue with small chunks size. Currently, the netcdf-c library defines a Then for the time coordinate variable, the chunk size can be 512, and with
This will reduce chunk search and I/O in a factor (maximum) of 512 (see [2]). [1] https://github.com/Unidata/netcdf-c/blob/15e1bbbd43e5deede72c34ad0674083c7805b6bd/libhdf5/hdf5var.c#L191-L227 |
@cofinoa this issue has been stale for ~4 years, so will close. If there are additional tweaks that make sense, please comment and reopen |
Perhaps, the suggested changes should be implemented prior to closing? |
@cofinoa to be honest, your suggestions are probably better directed at updating defaults for the netcdf-c library, as CMOR is a downstream user of this. If there is some obvious defaults that could be updated in CMOR, which optimizes file sizes and file/variable accesses then this would be useful to incorporate. Reading the above, it is not obvious to me what is required to fully address the issue - if you wanted to submit a PR for consideration this would be the fastest path to a solution. As I noted, feel free to reopen if you wanted to submit a PR |
@durack1 I have opened the PR #733 where I guess the fixing for the CMOR should be applied. The issue it's not with netCDF-C library, the issue it's with CMOR itself where assumption of having unlimited dimensions, enforces chunking A) with size 1 on unlimited diemsion and B) same chunking size for all netcdf vars which shared the unlimited dimension in the same file. This asumption it's right for netCDF3 data and storage model, not any more for netCDF4 data and storage model. @taylor13 and @durack1 I would like also to suggest introducing a recommendation on this issue for DATA producers, when they start to encode data for the next CMIP7, but I don't know where is the appropriate forum: https://pcmdi.llnl.gov/CMIP6/Guide/modelers.html#7-archivingpublishing-output |
@durack1 I can't re-open this issue, can you re-open it for me? |
#733 merges the changes, but we need to add a test to ensure that we're not a) breaking anything, and b) not causing performance issues for "standard" datasets - for 3.9 |
We are facing some performance issue accessing to metadata, i.e. values for time variable, because the number of I/O reading operations required to access all the chunks.
In particular the time coordinate variable it's created with chunk size 1, requiring one chunk per time value, therefore if in the netcdf-4 there is a lot of time steps (in 6-hr or 3-hr > 10k), the netcdf-4 library has too look and read for each chunk (i.e. 8 bytes per chunk).
A better explanation of this pitfall, can be found on [1]:
This relates to: #99, #100, #164
[1] https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/
The text was updated successfully, but these errors were encountered: