[BUG] polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu]. #18063

will-hill · 2025-02-21T17:46:57Z

https://gist.github.com/will-hill/24200d675b10537027bda96772dce277

Bug Description
polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu].

Steps/Code to reproduce bug
Here is a gist with Colab link to reproduce the error:
https://gist.github.com/will-hill/24200d675b10537027bda96772dce277

Expected behavior
I would expect a parquet file to be scanned given it can be read by polars w/out RAPIDS and cudf.pandas. Or maybe more descriptive error message.

Environment overview (please complete the following information)
I ran this locally on an RTX Ada 6000. Details are in the gist.

wence- · 2025-02-21T17:53:08Z

This appears to be an issue with the schema polars reads for the file.

# get the file 
# wget https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2024-01.parquet
import polars as pl

df = pl.read_parquet("fhvhv_tripdata_2024-01.parquet", columns=["request_datetime"], n_rows=10)
df.schema # datetime[ns]

But

import pyarrow.parquet as pq
table = pq.read_table("fhvhv_tripdata_2024-01.parquet", columns=["request_datetime"])
table.schema # timestamp[us]

In cudf-polars we check that the file we read produces a schema that is the same as what polars thinks we have. Pylibcudf reads the datetime columns correctly as us resolution, so we fail this check.

If I rewrite the table with pyarrow:

import pyarrow.parquet as pq

table = pq.read_table("fhvhv_tripdata_2024-01.parquet")
pq.write_table(table, "rewritten.parquet")

Then I can run things fine with cudf-polars.

So I think this is a polars bug.

wence- · 2025-02-21T18:08:47Z

pola-rs/polars#21392

will-hill added the bug Something isn't working label Feb 21, 2025

davidwendt changed the title ~~[BUG]~~ [BUG] polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu]. Feb 21, 2025

davidwendt added the cudf.polars Issues specific to cudf.polars label Feb 21, 2025

github-project-automation bot added this to cuDF Python Feb 21, 2025

github-project-automation bot moved this to Todo in cuDF Python Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu]. #18063

[BUG] polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu]. #18063

will-hill commented Feb 21, 2025

wence- commented Feb 21, 2025

wence- commented Feb 21, 2025

[BUG] polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu]. #18063

[BUG] polars[gpu] is unable to scan_parquet() a file that is able to be read by cudf.pandas and polars[cpu]. #18063

Comments

will-hill commented Feb 21, 2025

wence- commented Feb 21, 2025

wence- commented Feb 21, 2025