Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All array functions should represent NULL as an element #7142

Open
izveigor opened this issue Jul 30, 2023 · 15 comments
Open

All array functions should represent NULL as an element #7142

izveigor opened this issue Jul 30, 2023 · 15 comments
Labels
enhancement New feature or request

Comments

@izveigor
Copy link
Contributor

Is your feature request related to a problem or challenge?

Follow on to #6662
We have important problem related to using array functions with NULL statements. As @alamb said in PR (see below for full details) DataFusion casting system should be adapted for it, otherwise it will always throw errors.

The current implementation:

❯ select array_append([1, 2, 3, 4, 5], NULL);
Optimizer rule 'simplify_expressions' failed
caused by
This feature is not implemented: Array_append is not implemented for types 'Int64' and 'Null'.

Should be:

❯ select array_append([1, 2, 3, 4, 5], NULL);
----
[1, 2, 3, 4, 5, NULL]

Describe the solution you'd like

@alamb statement about the issue:

I am not sure about this approach of taking either a ListArray or a NullArray

In the other functions, the way NULL is treated is that the input types are always the same (in this case ListArray) and the values would be null (aka array.is_valid(i) would return false for rows that are null.

Complicating matters is if you type a literal null in sql like:

select array_concat([1,2], null)

That comes to DataFusion as a null literal (with DataType::Null). The coercion / casting logic normally will coerce this to the appropriate type.

For example, here is how I think arithmetic works with null:

select 1 + NULL

Arrives like

ScalarValue::Int32(Some(1)) + ScalarValue::Null

And then the coercion logic will add a cast to Int32:

ScalarValue::Int32(Some(1)) + CAST(ScalarValue::Null, DataType::Int32)

And then the constant folder will collapse this into:

ScalarValue::Int32(Some(1)) + ScalarValue::Int32(None)

So by the time the arithmetic kernel sees it, it only has to deal with arguments of Int32

Describe alternatives you've considered

No response

Additional context

No response

@alamb
Copy link
Contributor

alamb commented Jul 31, 2023

To be clear, to be consistent with the rest of the datafusion type system, I think we should stop using DataType::Null to represent null constants in array functions and instead move to using DataType::XXX where XXX is the type of the other list elements.

So after type coercion I would expect make_list(1, 2, NULL) to be

make_list(
  ScalarValue::Int64(Some(1)),
  ScalarValue::Int64(Some(2)),
  ScalarValue::Int64(None),  // NOTE this is not ScalarValue::Null
) -> ListArray w/ Int64 value type

@izveigor
Copy link
Contributor Author

@alamb, it looks reasonable for me. But what we would do with pure null arrays (see the ticket: #7144). Should it be ScalarValue::Null or something else (like ScalarValue::Int64(None))?
I also agree that we should check empty arrays by length (len == 0 -> empty array) and don't use for these cases NULL checking (like the current version DataFusion array[NULL] -> empty array).

@alamb
Copy link
Contributor

alamb commented Jul 31, 2023

Good question @izveigor -- I left a response here #7144 (comment)

@jayzhan211
Copy link
Contributor

It seems only these two functions have issues with it?

  • array_append
  • array_prepend

@crepererum
Copy link
Contributor

It seems only these two functions have issues with it?

Apparently array_empty as well.

@jayzhan211
Copy link
Contributor

What is the expected behavior for array concat with nulls and empty arrays?
I try to think about the result by converting null to a compatible non-null type.

// convert [null] to [i64], so got []
select array_concat(make_array(make_array(1, 2), make_array(3, 4)), make_array(null));
----
[[1, 2], [3, 4], []]

// NOTE: Duckdb [[1, 2], [3, 4], null]
// convert [[]] to [[i64]], so got []
select array_concat(make_array(make_array(1, 2), make_array(3, 4)), make_array(make_array()));
----
[[1, 2], [3, 4], []]

// convert null to i64, since is should be at least [i64], so ignore it (follows on duckdb behavior).
query ?
select array_concat(make_array(make_array(1, 2), make_array(3, 4)), null);
----
[[1, 2], [3, 4]]

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 11, 2023

#7995 (comment)

@alamb Should we introduce another Expr::ArrayFunction that differs from other Expr::ScalarFunctions?
I think null handling is unique to ArrayFunction, instead of pattern matching all the array function in ScalarFunction, maybe we can just introduce a new abstraction?

@alamb
Copy link
Contributor

alamb commented Nov 13, 2023

@alamb Should we introduce another Expr::ArrayFunction that differs from other Expr::ScalarFunctions?
I think null handling is unique to ArrayFunction, instead of pattern matching all the array function in ScalarFunction, maybe we can just introduce a new abstraction?

I don't think null handling has to be unique to array functions, and it is a complicated enough topic that adding a new abstraction I think will simply make the code even more complicated.

In this case, I think the intention is to treat a NULL constant as a one element list with a single NULL

I think:

  1. array_append(arg1, arg2, ...) should accept NULL constants (which will probably come from the parser as ScalarValue::Null)
  2. During type coercion, I expect the code to handle coercing ScalarValue::Null to ScalarValue::<ElementType>)

So for the example given by @izveigor above:

select array_append([1, 2, 3, 4, 5], NULL);

I would expect this to be parsed like

select array_append(ScalarValue::List(<ListArray of Int32>), ScalarValue::Null);

Then I would expect type coercion to coerce theNull to be a type Int32 (a null).

select array_append(ScalarValue::List(<ListArray of Int32>), ScalarValue::Int32(None));

And then the function implementation only handles ListArray / child array of the same type (ListArray and Int32Array).

The type coercion for function arguments is here: https://github.com/apache/arrow-datafusion/blob/2185842be22b695cf00e615db68b373f86fd162b/datafusion/expr/src/type_coercion/functions.rs#L33-L72

Does that make sense @jayzhan211 ?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 14, 2023

@alamb Do you mean we should do null conversion in coerce_arguments_for_signature instead of coerce_arguments_for_fun? I understand we should convert nulls in type_coerce step, but where should we do it? I previously done this in coerce_arguments_for_fun since I think this conversion is more easily done in function-wise than signature-wise.

In coerce_arguments_for_signature step, if we need to consider null as invalid and convert it to int32(none). We probably need a better type signature for each function.
https://github.com/apache/arrow-datafusion/blob/a38ac202f8db5bb7e184b57a59875b05990ee7c3/datafusion/expr/src/type_coercion/functions.rs#L74-L117

ArrayAppend has signature Any(2). We don't have a way to know it has null so we need to convert it. Maybe we should review #7580?

@alamb
Copy link
Contributor

alamb commented Nov 14, 2023

ArrayAppend has signature Any(2). We don't have a way to know it has null so we need to convert it. Maybe we should review #7580?

I haven't had a chance to review #7580 in depth

However, I agree one core problem is there is no current signature that can express something like "the first agument is a ListArray and the subsequent arguments must be the same element type of that list"

Perhaps we can either:

  1. add a TypeSignture variant that expresses this concept somehow
  2. Add a TypeSignature that has custom rules (via a trait or something)

Something like

trait SignatureComputation {
 /// Returns a Vec of all possible valid argument types for this signature. 
 fn get_valid_types( 
     &self,
     signature: &TypeSignature, 
     current_types: &[DataType], 
 ) -> Result<Vec<Vec<DataType>>>;

And then

enum TypeSignature { 
 ...
 Computation(Arc<dyn SignatureComputation),
...
}

Which would allow arbitrarily complex type signatures

@jayzhan211
Copy link
Contributor

Take a note that array literal select [1, null, 3] is very different from make_array for handling nulls. We may need to write specialized conversion here
https://github.com/apache/arrow-datafusion/blob/325a3fbe7623d3df0ab64867545c4d93a0c96015/datafusion/sql/src/expr/value.rs#L153-L163

@jayzhan211
Copy link
Contributor

D select array[[[1]], null, [null], [[3, 4]]];
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ (ARRAY[main.list_value(main.list_value(1)), NULL, main.list_value(NULL), main.list_value(main.list_value(3, 4))]) │
│                                                    int32[][][]                                                    │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ [[[1]], NULL, [NULL], [[3, 4]]]                                                                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

@alamb Do we have to deal with nested list like duckdb? If we naively convert null to i32. We got list(list(i64)), i32, list(i32), list(list(i64)) It is hard to tell whether it is null cases or invalid unaligned dimension inputs.

If we just find the highest dimension one and convert null to it. We got list(list(i64)), list(list(i32)), list(list(i32)), list(list(i64)), then we lost the information for dimension of null, so we can not come out a result like null, [null] as duckdb does.

I found current approach is hard to deal with if we consider dimension.

@alamb
Copy link
Contributor

alamb commented Nov 19, 2023

I think it would help in the situation above to keep the types and values separate and I think the logic should be able to handle it.

So I think from the parser [[[1]], null, [null], [[3, 4]]] would be (not that I purposely listed the null literal as "wildcard" type to avoid confusion, even if the actual type is DataType::Null. It can be cast to any other type)

[[1]]    type = list(list(i64)),  value = 1 element list [1]
null     type = wildcard,         value = null
[null]   type = list(int32),      value = 1 element list of null
[[3, 4]] type = list(list(int32)) value = 1 element list [3, 4]

So the code needs to determine the output type ONLY based on the input types:

list(list(i64))
wildcard
list(int32)
list(list(int32))

Which should be list(list(i64))

And then the coercion code needs to try and cast each value to list(list(i64)), which is possible in this case

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 21, 2023

Enhance nulls and empty array handling for each array functions

@Weijun-H
Copy link
Member

Weijun-H commented Feb 2, 2024

When working on task #9108, I found handling NULL to be quite inconsistent 😥 . Some follow PostgreSQL while others follow DuckDB. It should be consistent as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants