Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in parquet reader's handling of non-null children of null parent #7119

Closed
scovich opened this issue Feb 11, 2025 · 12 comments
Closed
Labels
question Further information is requested

Comments

@scovich
Copy link
Contributor

scovich commented Feb 11, 2025

Row visitor machinery in https://github.com/delta-io/delta-kernel-rs recently started behaving strangely, with row visitors returning 0 or "" for values that should have been NULL. Eventually, we bisected a regression in 53.3, specifically #6524, which attempted to address #6510.

tl;dr: When accessing column a.b, where a is nullable and b is non-nullable, any row for which a is NULL will incorrectly yield a non-NULL b (default-initialized to e.g. 0 or "").

From what I understood, the correct way to handle non-nullable columns with nullable ancestors should have the "non-nullable" column to take null values exactly and only on rows for which some ancestor is also NULL. The JSON reader does this correctly, and the parquet reader used to as well.

The minimal repro below, which compares columns before and after a round trip through parquet, fails with:

assertion `left == right` failed
  left: PrimitiveArray<Int32>
[
  null,
]
 right: PrimitiveArray<Int32>
[
  0,
]
Minimal repro test case
#[test]
fn test_arrow_bug() {
    use arrow::datatypes::{DataType, Field, Schema};
    use arrow_array::cast::AsArray as _;
    use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
    use std::sync::Arc;

    // Parse some JSON into a nested schema
    let schema = Arc::new(Schema::new(vec![
        Field::new(
            "outer",
            DataType::Struct(vec![
                Field::new(
                    "inner",
                    DataType::Int32,
                    false,
                )].into()),
            true,
        ),
    ]));
    let json_string = r#"{"outer": null}"#;
    let batch1 = arrow::json::ReaderBuilder::new(schema.clone())
        .build(json_string.as_bytes())
        .unwrap()
        .next()  
        .unwrap()
        .unwrap();
    println!("Batch 1: {batch1:?}");

    let col1 = batch1.column(0).as_struct().column(0);
    println!("Col1: {col1:?}");

    // Write the batch to a parquet file and read it back 
    let mut buffer = vec![];
    let mut writer = parquet::arrow::ArrowWriter::try_new(&mut buffer, schema.clone(), None)
        .unwrap();
    writer.write(&batch1).unwrap();
    writer.close().unwrap(); // writer must be closed to write footer 
    let batch2 = ParquetRecordBatchReaderBuilder::try_new(bytes::Bytes::from(buffer))
        .unwrap()
        .build()
        .unwrap()
        .next()
        .unwrap()
        .unwrap();
    println!("Batch 2: {batch2:?}");

    let col2 = batch2.column(0).as_struct().column(0);
    println!("Col2: {col2:?}");

    // Verify accuracty of the round trip 
    assert_eq!(batch1, batch2);
    assert_eq!(col1, col2);
}
@tustvold
Copy link
Contributor

tustvold commented Feb 11, 2025

Running the provided test we get the following output

Batch 1: RecordBatch { schema: Schema { fields: [Field { name: "outer", data_type: Struct([Field { name: "inner", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [StructArray
-- validity: 
[
  null,
]
[
-- child 0: "inner" (Int32)
PrimitiveArray<Int32>
[
  null,
]
]], row_count: 1 }
Col1: PrimitiveArray<Int32>
[
  null,
]
Batch 2: RecordBatch { schema: Schema { fields: [Field { name: "outer", data_type: Struct([Field { name: "inner", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [StructArray
-- validity: 
[
  null,
]
[
-- child 0: "inner" (Int32)
PrimitiveArray<Int32>
[
  0,
]
]], row_count: 1 }
Col2: PrimitiveArray<Int32>
[
  0,
]

These two batches are actually logically equal, this can be seen as this assertion passes:

assert_eq!(batch1, batch2);

However, assert_eq!(col1, col2); fails because the columns themselves are not equal, because they don't need to be - the value of a child masked by a null in a parent is arbitrary. Parquet doesn't even encode these values at all.

I therefore don't think this is a bug.

@tustvold tustvold added the question Further information is requested label Feb 11, 2025
@scovich
Copy link
Contributor Author

scovich commented Feb 11, 2025

If it's not a bug, (a) why does the JSON reader take pains to do it in the expected way; (b) why does RecordBatch take pains to validate nested NULL masks this way; and (c) what would be the "correct" way to project out a nested column for columnar access, e.g. SQL SELECT a.b FROM t? Is it now the user's responsibility to validate the nesting of null masks manually?

@scovich
Copy link
Contributor Author

scovich commented Feb 11, 2025

And (d) does this behavior need to be documented under e.g. https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.is_null with support for it added to https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls?

@tustvold
Copy link
Contributor

tustvold commented Feb 11, 2025

why does the JSON reader take pains to do it in the expected way

Arguably it probably shouldn't, and probably suffers from the same issue as #6510

Why does RecordBatch take pains to validate nested NULL masks this way

Because the value is arbitrary and therefore can be NULL, validation necessarily has to be conservative.

what would be the "correct" way to project out a nested column for columnar access, e.g. SQL SELECT a.b FROM t? Is it now the user's responsibility to validate the nesting of null masks manually?

When projecting a nested column, one must take into account the validity masks of any parents. This an inherent property of the arrow specification, and the way nested children work - not just for StructArray.

Edit: To give another example of where the null masks might diverge, consider calling nullif with a StructArray (or ListArray). This will modify the null mask of the StructArray, but won't recursively recompute new null masks for each of the children.

(d) does this behavior need to be documented under

Those methods only refer to the array in question, so I wouldn't expect it no. If nothing else, a child array doesn't know anything about its parents.

That being said, if you wanted to add a section documenting nested nullability, I'm sure that would be well received.

@scovich
Copy link
Contributor Author

scovich commented Feb 11, 2025

When projecting a nested column, one must take into account the validity masks of any parents.

That's surprising and painfully row-oriented for a column-oriented format -- especially for wide schemas.

But if that's the spec that's the spec, I guess.

why does the JSON reader take pains to do it in the expected way

Arguably it probably shouldn't, and probably suffers from the same issue as #6510

If JSON reader does have to change, perhaps it can be flagged as a breaking change with a major version bump instead of sneaking out in a point release?

@tustvold
Copy link
Contributor

tustvold commented Feb 11, 2025

That's surprising and painfully row-oriented for a column-oriented format -- especially for wide schemas.

At least for StructArray, you can compute the combined null masks quite easily and efficiently using NullBuffer::union. ListArray is a whole different kettle of fish though... Nested data in arrow is fraught for sure, but at least it isn't dremel... that's 🤯

f JSON reader does have to change, perhaps it can be flagged as a breaking change with a major version bump instead of sneaking out in a point release?

I'll make a note when I file the ticket

Edit: Actually the arrow_json reader avoids #6510 by computing null masks for all children of nullable StructArray see here. The parquet reader could do this, in fact I suggested this #6510 (comment), but it is effectively wasted work, and if anything encourages incorrect assumptions.

@scovich
Copy link
Contributor Author

scovich commented Feb 11, 2025

if you wanted to add a section documenting nested nullability, I'm sure that would be well received.

For Array::is_null and Array::logical_nulls, something like:

WARNING: The logical nullability of a nested column is the union of logical nulls for all ancestors, which this method does not cover. See NullBuffer::union.

StructArray::into_parts should also be straightforward, something like:

WARNING: The returned columns may contain physically non-null but logically invalid values, due to logically null ancestors -- including but not limited to the returned NullBuffer, which itself may also be incomplete for the same reason.

Maybe also StructArray::try_new?

NOTE: Physically non-null values in child columns can become logically NULL if masked by the provided NullBuffer -- even if they are non-nullable.

Anywhere else?

@tustvold
Copy link
Contributor

Given that Array::is_null refers to logical_nulls, I think adding the below to Array::logical_nulls should be sufficient

WARNING: The logical nullability of a nested column is the union of logical nulls for all ancestors, which this method does not cover. See NullBuffer::union.

StructArray::into_parts should also be straightforward, something like:
Maybe also StructArray::try_new?

I'm not sure about the need for these, given this null buffer is literally just the null buffer for that StructArray, and makes no claims to the contrary.

I think what might help is adding a representation section, similar to we have for ListArray as added by @alamb in #7039. This could clearly show that null slots are arbitrary.

@scovich
Copy link
Contributor Author

scovich commented Feb 12, 2025

I think what might help is adding a representation section, similar to we have for ListArray as added by @alamb in #7039. This could clearly show that null slots are arbitrary.

That does seem like a glaring gap, given how nice the ListArray doc is...

the arrow_json reader avoids #6510 by computing null masks for all children of nullable StructArray see here. The parquet reader could do this, in fact I suggested this #6510 (comment), but it is effectively wasted work

Could you elaborate why it's wasted work? Presumably the column should be read at least once (else why materialize it -- should have pruned the read schema). If whoever consumes a given leaf column anyway has to union the null masks (possibly multiple times, if more than one read), it seems not-worse to just have the writer do it once up front. Especially when the writer is in a better position to do that unioning efficiently as it assembles the columns into structs, vs. readers having to reverse engineer it every time they access each column.

Or is the main concern that storing null masks for non-nullable nested columns would consume too much memory?

if anything [storing pre-unioned null masks] encourages incorrect assumptions

If the spec (or the implementation) doesn't require valid null masks at every level, then I have to agree with you there. No point computing any null masks unless they are trustworthy -- it just forces both reader and writer to pay the ~same cost because they don't trust each other's work.

@scovich
Copy link
Contributor Author

scovich commented Feb 12, 2025

As I look at the implications for our code that projects out nested columns, manually computing "proper" null masks on the read path is going to be a pretty massive headache. Highly complex and requires allocation to store the updated (and temporary) null mask.

Does anyone know of an example of a project that is successfully doing this, that could serve as a reference point?

@scovich
Copy link
Contributor Author

scovich commented Feb 12, 2025

if anything [storing pre-unioned null masks] encourages incorrect assumptions

If the spec (or the implementation) doesn't require valid null masks at every level, then I have to agree with you there. No point computing any null masks unless they are trustworthy -- it just forces both reader and writer to pay the ~same cost because they don't trust each other's work.

... but the more I look at the code changes required to handle this on the read path, the more I'm convinced the creator of a StructArray is in a far better position to solve the problem than readers can ever be. It would be really helpful if the parquet reader at least offered a way to opt in to storing pre-unioned null masks (= library code written once), rather than requiring every reader (= user code replicated many times across many projects) to attempt the unioning and probably get it wrong at least some most of the time.

@tustvold
Copy link
Contributor

tustvold commented Feb 12, 2025

I think the code should be relatively straightforward

fn project(a: &StructArray, idx: usize) -> Result<ArrayRef, ArrowError> {
    match a.nulls() {
        Some(x) => nullif(a.column(idx), &BooleanArray::new(x.inner().not(), None)),
        None => Ok(a.column(idx).clone()),
    }
}

I think adding a kernel to union nulls as described in #6528 (comment) would simplify this down to

fn project(a: &StructArray, idx: usize) -> Result<ArrayRef, ArrowError> {
    match a.nulls() {
        Some(x) => mask(a.column(idx), x.inner()),
        None => Ok(a.column(idx).clone()),
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants