-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-37245] Prevent masking null values in BinaryRowData #26100
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mxm If I have understood this correctly before the change a non nullable field containing null would return a value. I am thinking this is correct. If a field is defined as non nullable then I would not expect to be able to get a null from it. It does seem reasonable to return a non null value in this case. After the fix we can get a null from a non-nullable field - this does not seem correct. WDYT?
I think there are different concerns here: (a) data integrity and (b) data semantics. (a) The value of a field should always return its correct value. That's why the Flink runtime always check (b) It is a separate concern, whether or not to allow null values for nonNull types. This needs to be enforced by the layer which initializes the RowData. In Iceberg there are actually writers which allow nulled "required" types. Does that make sense? |
My intuitive understanding of the mentioned problem in Iceberg is why not fix the nullability of the field? |
In Iceberg, users can directly provide a stream of RowData, for which the precise schema is only known when writing the stream to the table. Once the table schema is available, Iceberg relies on the Flink FieldGetters to read values from the provided data, but it turns out that the result of the
Thanks for highlighting the value of the non-null concept. I'm definitely not trying to get rid of non-null, but we need to be able to trust the values stored in RowData when using FieldGetters. If there are lower levels that don't want that check, that is fine too, they can directly operate on RowData and not use the FieldGetters. What you say about skipping null checks is not in line with what Flink does. The Flink runtime performs null checks when reading from RowData, regardless of the nullability type. I believe the externally facing API of I think it would be interesting to check how Flink handles user-provided null values for non-null types. I've seen |
For the 1st part, my question is since the nullability info provided by users are not reliable, why not enforce the use of nullable type when use Regarding Flink's current handling, you're right. I checked the code, and during the codegen phase, null checks are always added when accessing fields of Row types (not relying on fieldgetter). |
I understand where you are coming from, but not all Iceberg writers have the concept of "non-null" / required columns. A schema may have certain type definitions, but ultimately the writer decides what can be written. |
RowData#createFieldGetter
is the go-to method for creating field getters for a given field type and position. The method FieldGetter#getFieldOrNull suggests that null is returned if the field has been nulled. But that is not always the case.When using BinaryRowData with a non-null field, which has been set to null, a call to FieldGetter#getFieldOrNull will return a non-null value, interpreting whatever bytes are backing the field as an actual value instead of null.
Example:
Output is:
The expected output would be that the second non-null field also returns null, or raises a NullPointerException directly. That's not the case because RowData#createFieldGetter only checks for null values (via a call to Rowdata#isNullAt(pos)) when the type is nullable (see
flink/flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/RowData.java
Line 289 in b86fdf9
It seems fair to always check for null fields, instead of deferring this easy to forget check to the caller.
We needed to work around this bug in apache/iceberg#12049.