-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validate and adjust Substrait NamedTable schemas (#12223) #12245
Conversation
Substrait plans are not valid without this, and it is generally useful for round trip testing
If the schema registered with DataFusion and the schema as given by the Substrait NamedScan do not have the same names and types, DataFusion should reject it
* added substrait_validation test * extracted useful test utilities The utils::test::TestSchemaCollector::generate_context_from_plan function can be used to dynamically generate a SessionContext from a Substrait plan, which will include the schemas for NamedTables as given in the Substrait plan. This helps us avoid the issue of DataFusion test schemas and Substrait plan schemas not being in sync.
\n TableScan: nation projection=[a, b, c, d, e, f]" | ||
"Projection: nation.n_name\ | ||
\n Filter: contains(nation.n_name, Utf8(\"IA\"))\ | ||
\n TableScan: nation projection=[n_nationkey, n_name, n_regionkey, n_comment]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see in this change how the DataFusion and Substrait plans had different schemas.
|
||
pub(crate) struct TestSchemaCollector { | ||
ctx: SessionContext, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This collector is based on a similar bit of tooling in Isthmus, a Substrait library used for integrating with Apache Calcite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition in producer is great, and I feel like some validation is fine to add for consumer - but I'd prefer it to be more robust (ie. accept plans where the DF schema is a superset of the Substrait schema)
allow cases where the Substrait schema is a subset of the DataFusion schema
substrait_err!( | ||
"Field '{}' is nullable in the Substrait schema but not nullable in the DataFusion schema.", | ||
substrait_field.name() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of what's mentioned in the TODO, this code never fires. I'm opting not to fix that as part of this PR.
@@ -1,2 +1,2 @@ | |||
n_nationkey,n_name,n_regionkey,n_comment | |||
N_NATIONKEY,N_NAME,N_REGIONKEY,N_COMMENT | |||
0,ALGERIA,0, haggle. carefully final deposits detect slyly agai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Blizzara this is based on what you did in this PR to fix the tests that fail due to case difference in fields name.
.replace_qualifier(table_reference.clone()); | ||
|
||
let t = ctx.table(table_reference.clone()).await?; | ||
let t = ensure_schema_compatability(t, substrait_schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should maybe add this for the local file reads below as well (I didn't have it in my branch yet as I didn't need it immediately). Somewhat annoyingly it'll cause the rest of the TPCH tests to fail...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth doing as one big swoop in this PR, or would it make sense to do it as a followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me to do it separately!
.replace_qualifier(table_reference.clone()); | ||
|
||
let t = ctx.table(table_reference.clone()).await?; | ||
let t = ensure_schema_compatability(t, substrait_schema)?; | ||
let t = t.into_optimized_plan()?; | ||
extract_projection(t, &read.projection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started wondering if the ensure_schema_compatability
can now conflict with extract_projection
- and I think it can, either by failing if DF doesn't optimize the select into a projection, or if DF does, then by overriding the select's projection with the Substrait projection...
I guess a fix would be something like in extract_projection
, if there is an existing scan.projection
, then apply columnIndices
on it first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That did indeed cause problems. It triggered an error of unexpected plan for table
in extract_projection
.
I added some code for this case in d571eb2
(#12245). Is something like this what you had in mind?
I am noticing that the plans generated look a little weird/bad with a lot of redundant projects
"Projection: DATA.a, DATA.b\
\n Projection: DATA.a, DATA.b\
\n Projection: DATA.a, DATA.b, DATA.c\
\n TableScan: DATA projection=[b, a, c]"
but they are at least correct for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is something like this what you had in mind?
What I had in mind was manipulating the scan.projection
directly - kinda like it is alreadydone in extract_projection, we could do it that way also for ensure_schema_compatibility
. That way there wouldn't be additional Projections, and maybe it'd be a bit more efficient if the current setup doesn't push the column-pruning into the scan level (though I'm a bit surprised they don't get optimized anyways).
But I don't think it's necessary - the way you've done it here seems correct, and we (I?) can do the project-mangling as a followup, unless you want to take a stab at it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think project unmangling would be better as a follow-up. Possible as part of #12347 because supporting remaps is going to add yet another layer of Projects 😅
I think this is good by me - @alamb would you (or someone else) be able to do the official review, please? :) Only note I have is that I think this change makes Substrait consumer case-sensitive wrt column names, which it wasn't before. I don't strictly have opinion on whether that's a good or a bad thing. I think for my usecase it would be fine, but dunno about others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @vbarua and for the review @Blizzara 🙏
I kicked off the CI tests and quickly skimmed the PR . Once they pass I think this PR is ready to go from my perspective
cc @waynexia and @Lordworms
Thanks again everyone |
Closes #12223
Rationale for this change
When DataFusion consumes a Substrait plan, if the schema it has a for a table is incompatible with what is given/expected by the Substrait plan it should reject the plan. A Substrait schema is compatible with a DataFusion schema if the Substrait schema is a subset of the DataFusion schema.
Attempting to execute a plan when DataFusion and Substrait disagree on the schema is unlikely to lead to meaningful results, so in cases where the schemas are not compatible DataFusion should reject the plan.
What changes are included in this PR?
This PR:
DFSchema
instead of aDFSchemaRef
to aid with re-use.Are these changes tested?
Additional tests were added in substrait_validations.rs for the validation functionality.
Existing tests failed because of the validation (correctly) and where updated to test their original functionality.
Are there any user-facing changes?
The added validation could potentially cause calls to
from_substrait_plan
that worked in prior versions to fail. The plans that fail though are unlikely to have been meaningful given that they have major field differences between DataFusion and Substrait.