-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in parquet reader's handling of non-null children of null parent #7119
Comments
Running the provided test we get the following output
These two batches are actually logically equal, this can be seen as this assertion passes:
However, I therefore don't think this is a bug. |
If it's not a bug, (a) why does the JSON reader take pains to do it in the expected way; (b) why does |
And (d) does this behavior need to be documented under e.g. https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.is_null with support for it added to https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls? |
Arguably it probably shouldn't, and probably suffers from the same issue as #6510
Because the value is arbitrary and therefore can be NULL, validation necessarily has to be conservative.
When projecting a nested column, one must take into account the validity masks of any parents. This an inherent property of the arrow specification, and the way nested children work - not just for StructArray. Edit: To give another example of where the null masks might diverge, consider calling nullif with a StructArray (or ListArray). This will modify the null mask of the StructArray, but won't recursively recompute new null masks for each of the children.
Those methods only refer to the array in question, so I wouldn't expect it no. If nothing else, a child array doesn't know anything about its parents. That being said, if you wanted to add a section documenting nested nullability, I'm sure that would be well received. |
That's surprising and painfully row-oriented for a column-oriented format -- especially for wide schemas. But if that's the spec that's the spec, I guess.
If JSON reader does have to change, perhaps it can be flagged as a breaking change with a major version bump instead of sneaking out in a point release? |
At least for StructArray, you can compute the combined null masks quite easily and efficiently using
I'll make a note when I file the ticket Edit: Actually the arrow_json reader avoids #6510 by computing null masks for all children of nullable StructArray see here. The parquet reader could do this, in fact I suggested this #6510 (comment), but it is effectively wasted work, and if anything encourages incorrect assumptions. |
For
Maybe also
Anywhere else? |
Given that
I'm not sure about the need for these, given this null buffer is literally just the null buffer for that StructArray, and makes no claims to the contrary. I think what might help is adding a representation section, similar to we have for ListArray as added by @alamb in #7039. This could clearly show that null slots are arbitrary. |
That does seem like a glaring gap, given how nice the
Could you elaborate why it's wasted work? Presumably the column should be read at least once (else why materialize it -- should have pruned the read schema). If whoever consumes a given leaf column anyway has to union the null masks (possibly multiple times, if more than one read), it seems not-worse to just have the writer do it once up front. Especially when the writer is in a better position to do that unioning efficiently as it assembles the columns into structs, vs. readers having to reverse engineer it every time they access each column. Or is the main concern that storing null masks for non-nullable nested columns would consume too much memory?
If the spec (or the implementation) doesn't require valid null masks at every level, then I have to agree with you there. No point computing any null masks unless they are trustworthy -- it just forces both reader and writer to pay the ~same cost because they don't trust each other's work. |
As I look at the implications for our code that projects out nested columns, manually computing "proper" null masks on the read path is going to be a pretty massive headache. Highly complex and requires allocation to store the updated (and temporary) null mask. Does anyone know of an example of a project that is successfully doing this, that could serve as a reference point? |
... but the more I look at the code changes required to handle this on the read path, the more I'm convinced the creator of a |
I think the code should be relatively straightforward
I think adding a kernel to union nulls as described in #6528 (comment) would simplify this down to
|
Row visitor machinery in https://github.com/delta-io/delta-kernel-rs recently started behaving strangely, with row visitors returning
0
or""
for values that should have been NULL. Eventually, we bisected a regression in 53.3, specifically #6524, which attempted to address #6510.tl;dr: When accessing column
a.b
, wherea
is nullable andb
is non-nullable, any row for whicha
is NULL will incorrectly yield a non-NULLb
(default-initialized to e.g.0
or""
).From what I understood, the correct way to handle non-nullable columns with nullable ancestors should have the "non-nullable" column to take null values exactly and only on rows for which some ancestor is also NULL. The JSON reader does this correctly, and the parquet reader used to as well.
The minimal repro below, which compares columns before and after a round trip through parquet, fails with:
Minimal repro test case
The text was updated successfully, but these errors were encountered: