-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ctx.read_parquet
and ctx.register_parquet
don't load schema metadata
#9081
Comments
I wonder if you set this configuration setting on the session context as well if the same problem still happens?
https://arrow.apache.org/datafusion/user-guide/configs.html#configuration-settings |
@alamb Thanks for the suggestions. I added the above configuration to my script async fn main() {
let mut session_config = SessionConfig::new();
session_config = session_config.set_bool("datafusion.execution.parquet.skip_metadata", false);
let ctx = SessionContext::new_with_config(session_config);
println!("Session config {:?}", ctx.copied_config());
let table = ctx.read_parquet("test.parquet", ParquetReadOptions::default().skip_metadata(false)).await.unwrap();
println!("Schema {:?}", table.schema());
println!("Metadata {:?}", table.schema().metadata());
ctx.register_parquet("t", "test.parquet", ParquetReadOptions::default().skip_metadata(false)).await.unwrap();
println!("Schema {:?}", ctx.table("t").await.unwrap().schema());
println!("Metadata {:?}", ctx.table("t").await.unwrap().schema().metadata()); and checked that
|
hey! will this do a good first issue? If so I would like to give it a shot. |
Thanks @brayanjuls -- I think it would be a great way to help - basically the initial task is to debug what is going on and then ideally fix it. Since I don't know what the problem is I don't really know how complicated it will turn out to be It would be amazing if you had a chance to give it a look and I think it would be a great learning opportunity as well |
@l45k @alamb I was able to reproduce the issue and also checked the unit test used in Datafusion to test this functionality and I notice a different approach to get the schema with metadata. It seems that you are able to get the metadata only after collecting the DataFrame, so for example the following works. use datafusion::prelude::*;
#[tokio::main]
async fn main(){
let ctx = SessionContext::new();
ctx.register_parquet("ta", "test.parquet", ParquetReadOptions::default().skip_metadata(false)).await.unwrap();
let df = ctx.table("ta").await.unwrap();
let batches = df.collect().await.unwrap();
for batch in &batches{
println!("Schema {:?}", batch.schema());
println!("Metadata {:?}", batch.schema().metadata());
}
} output:
I am still investigating if it make sense to be able to access the metadata before collecting the DataFrame. |
That is a great find @brayanjuls -- nice 🕵️ . I wonder if the code that does schema inference is ignorning metadata somehow This code seems to imply it is handling metadata 🤔 |
@alamb I debugged this section of the code and it is not ignoring the metadata. The issue happens in the following code when listing the table. The metadata is being ignored when building the table schema. Creating the schema from the schema instead of fields in line 551 solved the issue in my local environment. This is the modification I did |
Awesome -- thank you @brayanjuls 🙏 -- do you by any chance have time to make a PR with a fix? If not, I think other contributors may be interested and willing to help too
That would be super helpful if you have time |
Yes, I would like to open a PR to fix it. |
Describe the bug
I try to load a parquet file with some metadata in its schema. Both
ctx.read_parquet
andctx.register_parquet
will not return the schema with the metadata, even ifParquetReadOptions::default().skip_metadata(false)
is provided.To Reproduce
To test, I create a table with metadata and write it to parquet using PyArrow:
Output
Next, reading or registering the file with DataFusion:
Output:
Expected behavior
I would expect the metadata to be the same DataFusion and PyArrow.
Additional context
I use the following DataFusion and PyArrow versions:
datafusion = { version = "35.0.0", features = ["parquet"] }
pyarrow:15.0.0
The text was updated successfully, but these errors were encountered: