-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update parse_protobuf_file_scan_config to remove any partition columns from the file_schema in FileScanConfig #9126
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @bcmcmill -- I don't fully understand the proposal though I did not read through to the ballista issues
@@ -56,4 +56,4 @@ serde_json = { workspace = true, optional = true } | |||
[dev-dependencies] | |||
doc-comment = { workspace = true } | |||
strum = { version = "0.26.1", features = ["derive"] } | |||
tokio = "1.18" | |||
tokio = { version = "1.18", features = ["rt-multi-thread"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error: The default runtime flavor is `multi_thread`, but the `rt-multi-thread` feature is disabled. --> datafusion/proto/examples/logical_plan_serde.rs:22:1 | 22 | #[tokio::main] | ^^^^^^^^^^^^^^ | = note: this error originates in the attribute macro `tokio::main` (in Nightly builds, run with -Z macro-backtrace for more info)
The above error is hit when rt-multi-thread
is absent.
// Currently when the schema for the file is set the partition columns | ||
// are present, which is illegal because they aren't actually in the files. | ||
// This is a workaround to remove them from the schema. | ||
let file_schema = Arc::new(Schema::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't undersrtand why the protobuf code would be doing this filtering -- wouldn't this potentially mask an error in whatever was creating the protobuf?
Perhaps it would be better to apply this filtering prior to serializing the plan?
If this is needed in the protobuf serialization code, can you please add a test to
- Show how this happens and
- Ensure we don't accidentally break this functionality in a future refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thanks again @bcmcmill
@@ -561,6 +563,32 @@ fn roundtrip_parquet_exec_with_pruning_predicate() -> Result<()> { | |||
))) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn roundtrip_parquet_exec_with_table_partition_cols() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified this test fails without the code changes in this file. 👍
from proto: ArrowError(SchemaError("Unable to get field named \"part\". Valid fields: [\"col\"]"), None)
thread 'cases::roundtrip_physical_plan::roundtrip_parquet_exec_with_table_partition_cols' panicked at datafusion/proto/tests/cases/roundtrip_physical_plan.rs:111:10:
from proto: ArrowError(SchemaError("Unable to get field named \"part\". Valid fields: [\"col\"]"), None)
stack backtrace:
Thanks again @bcmcmill |
Which issue does this PR close?
Closes apache/datafusion-ballista#747.
Closes apache/datafusion-ballista#862.
Rationale for this change
This change corrects an issue in
arrow-ballista
causing partitioned ListingTables fail when being read.What changes are included in this PR?
A simple change to the
parse_protobuf_file_scan_config
function to filter the table_partition_cols out of the file_schema.Are these changes tested?
All current tests cases pass and the change corrects the issue in
arrow-ballista
Are there any user-facing changes?
No