validate and adjust Substrait NamedTable schemas (#12223) #12245

vbarua · 2024-08-29T23:43:22Z

Rationale for this change

When DataFusion consumes a Substrait plan, if the schema it has a for a table is incompatible with what is given/expected by the Substrait plan it should reject the plan. A Substrait schema is compatible with a DataFusion schema if the Substrait schema is a subset of the DataFusion schema.

Attempting to execute a plan when DataFusion and Substrait disagree on the schema is unlikely to lead to meaningful results, so in cases where the schemas are not compatible DataFusion should reject the plan.

What changes are included in this PR?

This PR:

Adds a validation to the the Substrait plan consumer for NamedScans that rejects Substrait plans when the schema in Substrait is incompatible with that in DataFusion.
Updates existing tests that fail because of the above validation.
Updates the Substrait plan producer to include base schemas for ReadRels, which helps with round trip testing (this also caused test failures with the validation).
Updates from_substrait_named_struct to return a DFSchema instead of a DFSchemaRef to aid with re-use.
Makes the from_substrait_named_struct public as it is generally useful, and also assists with testing.

Are these changes tested?

Additional tests were added in substrait_validations.rs for the validation functionality.

Existing tests failed because of the validation (correctly) and where updated to test their original functionality.

Are there any user-facing changes?

The added validation could potentially cause calls to from_substrait_plan that worked in prior versions to fail. The plans that fail though are unlikely to have been meaningful given that they have major field differences between DataFusion and Substrait.

Substrait plans are not valid without this, and it is generally useful for round trip testing

If the schema registered with DataFusion and the schema as given by the Substrait NamedScan do not have the same names and types, DataFusion should reject it

* added substrait_validation test * extracted useful test utilities The utils::test::TestSchemaCollector::generate_context_from_plan function can be used to dynamically generate a SessionContext from a Substrait plan, which will include the schemas for NamedTables as given in the Substrait plan. This helps us avoid the issue of DataFusion test schemas and Substrait plan schemas not being in sync.

datafusion/substrait/src/logical_plan/consumer.rs

vbarua · 2024-08-29T23:50:53Z

datafusion/substrait/tests/cases/function_test.rs

-            \n    TableScan: nation projection=[a, b, c, d, e, f]"
+            "Projection: nation.n_name\
+            \n  Filter: contains(nation.n_name, Utf8(\"IA\"))\
+            \n    TableScan: nation projection=[n_nationkey, n_name, n_regionkey, n_comment]"


You can see in this change how the DataFusion and Substrait plans had different schemas.

vbarua · 2024-08-29T23:53:32Z

datafusion/substrait/tests/utils.rs

+
+    pub(crate) struct TestSchemaCollector {
+        ctx: SessionContext,
+    }


This collector is based on a similar bit of tooling in Isthmus, a Substrait library used for integrating with Apache Calcite.

Blizzara

The addition in producer is great, and I feel like some validation is fine to add for consumer - but I'd prefer it to be more robust (ie. accept plans where the DF schema is a superset of the Substrait schema)

datafusion/substrait/src/logical_plan/consumer.rs

datafusion/substrait/tests/utils.rs

datafusion/substrait/src/logical_plan/producer.rs

allow cases where the Substrait schema is a subset of the DataFusion schema

datafusion/substrait/src/logical_plan/consumer.rs

vbarua · 2024-09-04T23:23:55Z

datafusion/substrait/src/logical_plan/consumer.rs

+            substrait_err!(
+                "Field '{}' is nullable in the Substrait schema but not nullable in the DataFusion schema.",
+                substrait_field.name()
+                )


Because of what's mentioned in the TODO, this code never fires. I'm opting not to fix that as part of this PR.

datafusion/substrait/src/logical_plan/consumer.rs

vbarua · 2024-09-04T23:48:59Z

datafusion/substrait/tests/testdata/tpch/nation.csv

@@ -1,2 +1,2 @@
-n_nationkey,n_name,n_regionkey,n_comment
+N_NATIONKEY,N_NAME,N_REGIONKEY,N_COMMENT
 0,ALGERIA,0, haggle. carefully final deposits detect slyly agai


@Blizzara this is based on what you did in this PR to fix the tests that fail due to case difference in fields name.

datafusion/substrait/src/logical_plan/consumer.rs

Blizzara · 2024-09-05T14:44:45Z

datafusion/substrait/src/logical_plan/consumer.rs

+                        .replace_qualifier(table_reference.clone());
+
+                let t = ctx.table(table_reference.clone()).await?;
+                let t = ensure_schema_compatability(t, substrait_schema)?;


we should maybe add this for the local file reads below as well (I didn't have it in my branch yet as I didn't need it immediately). Somewhat annoyingly it'll cause the rest of the TPCH tests to fail...

Is this worth doing as one big swoop in this PR, or would it make sense to do it as a followup?

Fine by me to do it separately!

Blizzara · 2024-09-05T14:58:59Z

datafusion/substrait/src/logical_plan/consumer.rs

+                        .replace_qualifier(table_reference.clone());
+
+                let t = ctx.table(table_reference.clone()).await?;
+                let t = ensure_schema_compatability(t, substrait_schema)?;
                let t = t.into_optimized_plan()?;
                extract_projection(t, &read.projection)


I started wondering if the ensure_schema_compatability can now conflict with extract_projection - and I think it can, either by failing if DF doesn't optimize the select into a projection, or if DF does, then by overriding the select's projection with the Substrait projection...

I guess a fix would be something like in extract_projection, if there is an existing scan.projection, then apply columnIndices on it first

That did indeed cause problems. It triggered an error of unexpected plan for table in extract_projection.

I added some code for this case in d571eb2 (#12245). Is something like this what you had in mind?

I am noticing that the plans generated look a little weird/bad with a lot of redundant projects

"Projection: DATA.a, DATA.b\ \n Projection: DATA.a, DATA.b\ \n Projection: DATA.a, DATA.b, DATA.c\ \n TableScan: DATA projection=[b, a, c]"

but they are at least correct for now.

Is something like this what you had in mind?

What I had in mind was manipulating the scan.projection directly - kinda like it is alreadydone in extract_projection, we could do it that way also for ensure_schema_compatibility. That way there wouldn't be additional Projections, and maybe it'd be a bit more efficient if the current setup doesn't push the column-pruning into the scan level (though I'm a bit surprised they don't get optimized anyways).

But I don't think it's necessary - the way you've done it here seems correct, and we (I?) can do the project-mangling as a followup, unless you want to take a stab at it :)

I think project unmangling would be better as a follow-up. Possible as part of #12347 because supporting remaps is going to add yet another layer of Projects 😅

datafusion/substrait/src/logical_plan/consumer.rs

Co-authored-by: Arttu <[email protected]>

Blizzara · 2024-09-06T19:46:21Z

I think this is good by me - @alamb would you (or someone else) be able to do the official review, please? :)

Only note I have is that I think this change makes Substrait consumer case-sensitive wrt column names, which it wasn't before. I don't strictly have opinion on whether that's a good or a bad thing. I think for my usecase it would be fine, but dunno about others.

alamb

Thank you for the contribution @vbarua and for the review @Blizzara 🙏

I kicked off the CI tests and quickly skimmed the PR . Once they pass I think this PR is ready to go from my perspective

cc @waynexia and @Lordworms

alamb · 2024-09-10T10:58:03Z

Thanks again everyone

vbarua added 5 commits August 29, 2024 16:29

fix: producer did not emit base_schema struct field for ReadRel

429cd2b

Substrait plans are not valid without this, and it is generally useful for round trip testing

feat: include field_qualifier param for from_substrait_named_struct

a44a689

feat: verify that Substrait and DataFusion agree on NamedScan schemas

86a7339

If the schema registered with DataFusion and the schema as given by the Substrait NamedScan do not have the same names and types, DataFusion should reject it

feat: expose from_substrait_named_struct

2f71997

github-actions bot added the substrait label Aug 29, 2024

vbarua commented Aug 29, 2024

View reviewed changes

datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved

refactor: remove unused imports

8634d23

vbarua commented Aug 29, 2024

View reviewed changes

vbarua added 2 commits August 30, 2024 08:07

docs: add missing licenses

5051991

refactor: deal with unused code warnings

f52f01c

Blizzara reviewed Sep 2, 2024

View reviewed changes

vbarua added 7 commits September 4, 2024 11:17

remove optional qualifier from from_substrait_named_struct

aa063c7

return DFSchema from from_substrait_named_struct

f9319ba

one must imagine clippy happy

444aad6

accidental blah

c2b0cb0

loosen the validation for schemas

0afb845

allow cases where the Substrait schema is a subset of the DataFusion schema

minor doc tweaks

9ba5c4d

update test data to deal with case issues in tests

83529f5

vbarua changed the title ~~validate Substrait NamedScan schemas (#12223)~~ validate Substrait NamedTable schemas (#12223) Sep 4, 2024

vbarua changed the title ~~validate Substrait NamedTable schemas (#12223)~~ validate and adjust Substrait NamedTable schemas (#12223) Sep 4, 2024

vbarua commented Sep 4, 2024

View reviewed changes

Blizzara reviewed Sep 5, 2024

View reviewed changes

datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved

Blizzara reviewed Sep 5, 2024

View reviewed changes

datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved

fix error message

acb2b47

Blizzara reviewed Sep 5, 2024

View reviewed changes

datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved

Blizzara reviewed Sep 5, 2024

View reviewed changes

improve readability of field compatability check

386264e

Blizzara reviewed Sep 5, 2024

View reviewed changes

datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved

vbarua and others added 4 commits September 5, 2024 08:51

make TestSchemaCollector more flexible

f6017eb

fix doc typo

185bff5

Co-authored-by: Arttu <[email protected]>

remove unecessary TODO

aeb6c3d

handle ReadRel projection on top of mismatched schema

d571eb2

alamb approved these changes Sep 9, 2024

View reviewed changes

alamb merged commit 41c5f4e into apache:main Sep 10, 2024
25 checks passed

vbarua deleted the vbarua/substrait/validate-schemas branch November 1, 2024 20:53

Blizzara mentioned this pull request Nov 18, 2024

Substrait plan read relation baseSchema does not include the struct with type information #12244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate and adjust Substrait NamedTable schemas (#12223) #12245

validate and adjust Substrait NamedTable schemas (#12223) #12245

vbarua commented Aug 29, 2024 •

edited

Loading

vbarua Aug 29, 2024

vbarua Aug 29, 2024

Blizzara left a comment

vbarua Sep 4, 2024

vbarua Sep 4, 2024

Blizzara Sep 5, 2024

vbarua Sep 5, 2024

Blizzara Sep 6, 2024

Blizzara Sep 5, 2024

vbarua Sep 6, 2024

Blizzara Sep 6, 2024

vbarua Sep 6, 2024

Blizzara commented Sep 6, 2024

alamb left a comment

alamb commented Sep 10, 2024

validate and adjust Substrait NamedTable schemas (#12223) #12245

validate and adjust Substrait NamedTable schemas (#12223) #12245

Conversation

vbarua commented Aug 29, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara commented Sep 6, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Sep 10, 2024

vbarua commented Aug 29, 2024 •

edited

Loading