-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong result when operation parquet #2044
Comments
@alitrack which version? |
@jiangzhx 0.5.1 |
i test your parquet file yellow_taxi_sample.parquet.zip use rust datafusion master version; query response: does parquet file right? |
also test with python 0.5.1; |
please try pandas , pyarrow , vaex, all have the same result(correct one), import pandas as pd
#pd.read_parquet("yellow_taxi_sample.parquet", engine='pyarrow')
pd.read_parquet("yellow_taxi_sample.parquet", engine='fastparquet') |
yes, but yellow_taxi_2009_2015_f32.parquet is about 28G, so I want to use register_parquet, not pandas or vaex read it first. |
@alitrack, the issue may be caused by "ARROW:schema" key-value pair in .parquet metadata - it contains schema which treats pickup/dropoff_datatime fields as Timestamp(Nanosecond) instead of Timestamp(Microseconds) in actual file schema. I suppose removing this tag from file metadata should help. |
i did more research, read parquet metadata with parquet = { version = "9.0.0"} . @korowa was right, the column pickup_datetim's datatype was datetime64[ns] ` ` |
confused.... print_row_with_parquet testcase get right result print_row_with_datafusion get wrong result use datafusion::error::Result;
use datafusion::prelude::ExecutionContext;
use std::convert::TryFrom;
use std::fs::File;
use std::path::Path;
use parquet::file::reader::FileReader;
use parquet::file::serialized_reader::SerializedFileReader;
#[tokio::test]
async fn print_row_with_parquet() -> Result<()> {
let path = Path::new("yellow_taxi_sample.parquet");
let row_iter = SerializedFileReader::try_from(path).unwrap().into_iter();
for row in row_iter {
let s = row.to_string();
println!("{}", s);
}
Ok(())
}
#[tokio::test]
async fn print_row_with_datafusion() -> Result<()> {
let mut ctx = ExecutionContext::new();
ctx.register_parquet("taxi_sample", "yellow_taxi_sample.parquet")
.await?;
let df = ctx.sql("SELECT * from taxi_sample").await?;
df.show().await?;
Ok(())
}
|
This is likely related to apache/arrow-rs#1459 |
I think this should have been resolved by apache/arrow-rs#1682, could you let me know if the issue still persists? |
@tustvold I tested last version roapi, fixed, just datafusion python bind sill has the issue, thanks! |
Describe the bug
A clear and concise description of what the bug is.
when use register_parquet, datetime got wrong result, but register_csv no problem.
if use pandas read it dataframe and use register_record_batches also OK.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
expected result ,
but got,
Additional context
Add any other context about the problem here.
the sample data is part of Year 2009-2015 - 1 billion rows - 107GB
yellow_taxi_sample.parquet.zip
The text was updated successfully, but these errors were encountered: