Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update parse_protobuf_file_scan_config to remove any partition columns from the file_schema in FileScanConfig #9126

Merged
merged 4 commits into from
Feb 6, 2024

Conversation

bcmcmill
Copy link
Contributor

@bcmcmill bcmcmill commented Feb 4, 2024

Which issue does this PR close?

Closes apache/datafusion-ballista#747.
Closes apache/datafusion-ballista#862.

Rationale for this change

This change corrects an issue in arrow-ballista causing partitioned ListingTables fail when being read.

What changes are included in this PR?

A simple change to the parse_protobuf_file_scan_config function to filter the table_partition_cols out of the file_schema.

Are these changes tested?

All current tests cases pass and the change corrects the issue in arrow-ballista

Are there any user-facing changes?

No

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @bcmcmill -- I don't fully understand the proposal though I did not read through to the ballista issues

@@ -56,4 +56,4 @@ serde_json = { workspace = true, optional = true }
[dev-dependencies]
doc-comment = { workspace = true }
strum = { version = "0.26.1", features = ["derive"] }
tokio = "1.18"
tokio = { version = "1.18", features = ["rt-multi-thread"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: The default runtime flavor is `multi_thread`, but the `rt-multi-thread` feature is disabled. --> datafusion/proto/examples/logical_plan_serde.rs:22:1 | 22 | #[tokio::main] | ^^^^^^^^^^^^^^ | = note: this error originates in the attribute macro `tokio::main` (in Nightly builds, run with -Z macro-backtrace for more info)

The above error is hit when rt-multi-thread is absent.

// Currently when the schema for the file is set the partition columns
// are present, which is illegal because they aren't actually in the files.
// This is a workaround to remove them from the schema.
let file_schema = Arc::new(Schema::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't undersrtand why the protobuf code would be doing this filtering -- wouldn't this potentially mask an error in whatever was creating the protobuf?

Perhaps it would be better to apply this filtering prior to serializing the plan?

If this is needed in the protobuf serialization code, can you please add a test to

  1. Show how this happens and
  2. Ensure we don't accidentally break this functionality in a future refactor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are complete.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thanks again @bcmcmill

@@ -561,6 +563,32 @@ fn roundtrip_parquet_exec_with_pruning_predicate() -> Result<()> {
)))
}

#[tokio::test]
async fn roundtrip_parquet_exec_with_table_partition_cols() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this test fails without the code changes in this file. 👍


from proto: ArrowError(SchemaError("Unable to get field named \"part\". Valid fields: [\"col\"]"), None)
thread 'cases::roundtrip_physical_plan::roundtrip_parquet_exec_with_table_partition_cols' panicked at datafusion/proto/tests/cases/roundtrip_physical_plan.rs:111:10:
from proto: ArrowError(SchemaError("Unable to get field named \"part\". Valid fields: [\"col\"]"), None)
stack backtrace:

@alamb alamb merged commit 2b6ca7b into apache:main Feb 6, 2024
22 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 6, 2024

Thanks again @bcmcmill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants