Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parquet chunked reader can throw an 'unexpected short subpass' exception under certain conditions. #18043

Open
nvdbaranec opened this issue Feb 19, 2025 · 0 comments
Assignees
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS

Comments

@nvdbaranec
Copy link
Contributor

This was discovered as a byproduct of changes to nvcomp temporary memory usage for decompression. The change caused us to produce a slightly different set of chunks, exposing the underlying bug in the chunked reader itself (nvcomp was not doing anything wrong). Spark Rapids customers have experienced this as well, under difficult-to-reproduce conditions, so having a clean repro case here is nice.

To reproduce, build cudf using nvcomp 4.2.0.11 (#18042) and run the tests. Two of the list tests, ParquetChunkedReaderInputLimitConstrainedTest.MixedColumns and ParquetChunkedReaderInputLimitTest.List will throw the exception.

@nvdbaranec nvdbaranec added bug Something isn't working Spark Functionality that helps Spark RAPIDS labels Feb 19, 2025
raydouglass pushed a commit that referenced this issue Feb 20, 2025
…18019)

Fixes  #18043

An incorrect computation in the subpass generation code would come to
the conclusion that there weren't enough rows to decode for list columns
under certain circumstances.

This PR fixes the issue and I did a little bit of variable naming
cleanup around the area. Ultimately the true source of the bug was
poorly named variables causing them to be used incorrectly.

Edit: I've disabled various checks in the chunked reader tests that
expect specific chunk counts being returned from chunking operations.
Changes to decompression temporary memory usage can make this
unreliable. We will need a smarter solution down the road.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

1 participant