-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support block gzip for streams #9175
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tshauck -- this code looks great to me.
I do think this PR needs a test otherwise we could potentially end up breaking this feature by accident
here some the existing coverage
Which I think is referencing this file https://github.com/apache/arrow-testing/blob/master/data/csv/aggregate_test_100.csv.gz
Perhaps you can check in a small .bgzip'd file in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/data and then add a small test to https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/csv_files.slt ?
Thanks @alamb -- certainly down to add a test. Before I go the route you mentioned, I wanted to ask if you knew that would follow the stream decoding path? Looking at https://github.com/apache/arrow-datafusion/blob/ae882356171513c9d6c22b3bd966898fb4e8cac0/datafusion/core/src/datasource/physical_plan/csv.rs#L393-L441 in the CsvOpener, I think local files would result in If not I'll code that up, otherwise perhaps adding a specific unittest that calls |
I think a specific unit test for convert stream would be great and easiest (and you can probably just generate some data directly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @tshauck
Thanks! |
Which issue does this PR close?
Closes #9156
Rationale for this change
Updates the stream gzip decoding to check for multiple members within a file.
What changes are included in this PR?
Doesn't update the
Read
decoder because it already supports multiple members in a file.Are these changes tested?
The gzip decoding still seems to work, and I checked a manually bgzipped file, but I'm not sure if this code path is tested already.
Are there any user-facing changes?
No