Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-37245] Prevent masking null values in BinaryRowData #26100

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mxm
Copy link
Contributor

@mxm mxm commented Feb 3, 2025

RowData#createFieldGetter is the go-to method for creating field getters for a given field type and position. The method FieldGetter#getFieldOrNull suggests that null is returned if the field has been nulled. But that is not always the case.

When using BinaryRowData with a non-null field, which has been set to null, a call to FieldGetter#getFieldOrNull will return a non-null value, interpreting whatever bytes are backing the field as an actual value instead of null.

Example:

  public static void main(String[] args) {
    IntType nullableIntType = new IntType(true);
    IntType nonNullableIntType = new IntType(false);
    RowDataSerializer rowDataSerializer = new RowDataSerializer(
            nullableIntType, nonNullableIntType
    );
    BinaryRowData binaryRow = rowDataSerializer.toBinaryRow(GenericRowData.of(null, null));
    RowData.FieldGetter fieldGetter1 = RowData.createFieldGetter(nullableIntType, 0);
    RowData.FieldGetter fieldGetter2 = RowData.createFieldGetter(nonNullableIntType, 1);
    System.out.println(fieldGetter1.getFieldOrNull(binaryRow));
    System.out.println(fieldGetter2.getFieldOrNull(binaryRow));
  }

Output is:

null
0

The expected output would be that the second non-null field also returns null, or raises a NullPointerException directly. That's not the case because RowData#createFieldGetter only checks for null values (via a call to Rowdata#isNullAt(pos)) when the type is nullable (see

).

It seems fair to always check for null fields, instead of deferring this easy to forget check to the caller.

We needed to work around this bug in apache/iceberg#12049.

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 3, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@davidradl davidradl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxm If I have understood this correctly before the change a non nullable field containing null would return a value. I am thinking this is correct. If a field is defined as non nullable then I would not expect to be able to get a null from it. It does seem reasonable to return a non null value in this case. After the fix we can get a null from a non-nullable field - this does not seem correct. WDYT?

@mxm
Copy link
Contributor Author

mxm commented Feb 6, 2025

I think there are different concerns here: (a) data integrity and (b) data semantics.

(a) The value of a field should always return its correct value. That's why the Flink runtime always check isNullAt before attempting to read. In Iceberg, we don't always do that because we rely on the FieldGetters for which this change is intended. If a field is null, it should read null. If a field is 42, it should read 42, but it shouldn't return 42 if its value is null. Unfortunately, this can happen without a null check when reading from BinaryRowData.

(b) It is a separate concern, whether or not to allow null values for nonNull types. This needs to be enforced by the layer which initializes the RowData. In Iceberg there are actually writers which allow nulled "required" types.

Does that make sense?

@lincoln-lil
Copy link
Contributor

My intuitive understanding of the mentioned problem in Iceberg is why not fix the nullability of the field?
RowData as the basic structure of flink sql, follows the semantics of sql nullability, one field declared not-null should not be filled with null values. On the other hand, if a nullable field has set null bit, then its value can be seen as null directly and no need to read the value in the memorysegment (because it is reused as much as possible at the sql runtime implementation, which can save some operations).
Accurate nullability is important to the sql system, not only to ensure proper processing on the consumer side (e.g., the field may be used as the primary key of the target table, which requires not null), but also gives the planner more optimization possibilities (e.g., predicate derivation, and also the runtime code can skip the null check overhead for non-null fields, etc.).

@mxm
Copy link
Contributor Author

mxm commented Feb 7, 2025

My intuitive understanding of the mentioned problem in Iceberg is why not fix the nullability of the field?

In Iceberg, users can directly provide a stream of RowData, for which the precise schema is only known when writing the stream to the table. Once the table schema is available, Iceberg relies on the Flink FieldGetters to read values from the provided data, but it turns out that the result of the FieldGetter#getFieldOrNull cannot be trusted. When a null field returns an arbitrary value, depending what random bytes are currently stored for that field with the null flag being set, then the result is plainly wrong.

Accurate nullability is important to the sql system, not only to ensure proper processing on the consumer side (e.g., the field may be used as the primary key of the target table, which requires not null), but also gives the planner more optimization possibilities (e.g., predicate derivation, and also the runtime code can skip the null check overhead for non-null fields, etc.).

Thanks for highlighting the value of the non-null concept. I'm definitely not trying to get rid of non-null, but we need to be able to trust the values stored in RowData when using FieldGetters. If there are lower levels that don't want that check, that is fine too, they can directly operate on RowData and not use the FieldGetters.

What you say about skipping null checks is not in line with what Flink does. The Flink runtime performs null checks when reading from RowData, regardless of the nullability type. I believe the externally facing API of RowData#createFieldGetter(..) should do that as well.

I think it would be interesting to check how Flink handles user-provided null values for non-null types. I've seen table.exec.sink.not-null-enforcer option, but that only applies to sinks. That wouldn't help in this case because the null value would have already been resurrected in case of a missing isNullAt(..) check.

@lincoln-lil
Copy link
Contributor

For the 1st part, my question is since the nullability info provided by users are not reliable, why not enforce the use of nullable type when use RowData#createFieldGetter? Am I missing something here?

Regarding Flink's current handling, you're right. I checked the code, and during the codegen phase, null checks are always added when accessing fields of Row types (not relying on fieldgetter).

@mxm
Copy link
Contributor Author

mxm commented Feb 18, 2025

For the 1st part, my question is since the nullability info provided by users are not reliable, why not enforce the use of nullable type when use RowData#createFieldGetter? Am I missing something here?

I understand where you are coming from, but not all Iceberg writers have the concept of "non-null" / required columns. A schema may have certain type definitions, but ultimately the writer decides what can be written.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants