Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu]. #18063

Open
will-hill opened this issue Feb 21, 2025 · 2 comments
Labels
bug Something isn't working cudf.polars Issues specific to cudf.polars

Comments

@will-hill
Copy link

https://gist.github.com/will-hill/24200d675b10537027bda96772dce277

Bug Description
polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu].

Steps/Code to reproduce bug
Here is a gist with Colab link to reproduce the error:
https://gist.github.com/will-hill/24200d675b10537027bda96772dce277

Expected behavior
I would expect a parquet file to be scanned given it can be read by polars w/out RAPIDS and cudf.pandas. Or maybe more descriptive error message.

Environment overview (please complete the following information)
I ran this locally on an RTX Ada 6000. Details are in the gist.

@will-hill will-hill added the bug Something isn't working label Feb 21, 2025
@davidwendt davidwendt changed the title [BUG] [BUG] polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu]. Feb 21, 2025
@davidwendt davidwendt added the cudf.polars Issues specific to cudf.polars label Feb 21, 2025
@wence-
Copy link
Contributor

wence- commented Feb 21, 2025

This appears to be an issue with the schema polars reads for the file.

# get the file 
# wget https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2024-01.parquet
import polars as pl

df = pl.read_parquet("fhvhv_tripdata_2024-01.parquet", columns=["request_datetime"], n_rows=10)
df.schema # datetime[ns]

But

import pyarrow.parquet as pq
table = pq.read_table("fhvhv_tripdata_2024-01.parquet", columns=["request_datetime"])
table.schema # timestamp[us]

In cudf-polars we check that the file we read produces a schema that is the same as what polars thinks we have. Pylibcudf reads the datetime columns correctly as us resolution, so we fail this check.

If I rewrite the table with pyarrow:

import pyarrow.parquet as pq

table = pq.read_table("fhvhv_tripdata_2024-01.parquet")
pq.write_table(table, "rewritten.parquet")

Then I can run things fine with cudf-polars.

So I think this is a polars bug.

@wence-
Copy link
Contributor

wence- commented Feb 21, 2025

pola-rs/polars#21392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.polars Issues specific to cudf.polars
Projects
Status: Todo
Development

No branches or pull requests

3 participants