-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer count() aggregation is not null #11256
Infer count() aggregation is not null #11256
Conversation
f346b75
to
2cba7df
Compare
This comment was marked as resolved.
This comment was marked as resolved.
2cba7df
to
37019a5
Compare
f9dadfe
to
fe8fdef
Compare
Added unit and integration tests. Should be ready to review. |
fe8fdef
to
d839797
Compare
d839797
to
0352eda
Compare
@alamb PTAL |
Thank you very much @findepi -- for both this PR and your other recent contributions. I will indeed review this PR, though I may not get to it until tomorrow or Friday |
I think there might be a simpler way to achieve this. We can declare the count aggregation expr as non-nullable like this, and then the datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs Lines 1450 to 1454 in 6e63748
|
@jonahgao thanks! |
0352eda
to
ba27f93
Compare
match func_def { | ||
// TODO: min/max over non-nullable input should be recognized as non-nullable | ||
AggregateFunctionDefinition::BuiltIn(fun) => fun.nullable(), | ||
// TODO: UDF should be able to customize nullability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address this TODO we could extend the pattern here
datafusion/datafusion/expr/src/aggregate_function.rs
Lines 84 to 88 in 4bc3228
pub fn return_type( | |
&self, | |
input_expr_types: &[DataType], | |
input_expr_nullable: &[bool], | |
) -> Result<DataType> { |
ie now we have return_type(input_types, nullability) -> return_type
we could have return_type(input_types, nullability) -> (return_type, nullability)
this is an API for builtin functions, we would need similar API for UDAFs
datafusion/datafusion/expr/src/udaf.rs
Line 330 in 5f02c8a
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType>; |
Obviously, this would be a breaking change: both method signature and return type would change.
We can avoid the breaking change by making nullability a separate method, but it would be good to first determine the ideal end-state, and only then think how to get there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied into #11274
ba27f93
to
24f4ff1
Compare
24f4ff1
to
53b34ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks @findepi . Other questions could be resolved through subsequent PRs.
`count([DISTINCT [expr]])` aggregate function never returns null. Infer non-nullness of such aggregate expression. This allows elimination of the HAVING filter for a query such as SELECT ... count(*) AS c FROM ... GROUP BY ... HAVING c IS NOT NULL
53b34ae
to
1a7d541
Compare
thanks @jonahgao for your review! updated. |
* Fix test function name typo * Improve formatting * Infer count() aggregation is not null `count([DISTINCT [expr]])` aggregate function never returns null. Infer non-nullness of such aggregate expression. This allows elimination of the HAVING filter for a query such as SELECT ... count(*) AS c FROM ... GROUP BY ... HAVING c IS NOT NULL
In apache#11256 it started inferring, for logical plans only, that count() aggregation does not return nulls. However, it did not introduce the counterpart inference for physical plans. Somehow, this discrepancy did not get detected by DF tests, but we run into it with, e.g. `sdf run -q "SELECT count(*) FROM (VALUES (3, 'c')) as T(c, x) GROUP BY c "` The fix here is ad hoc. We should keep an eye on apache#11274 where the goal seems to make UDAFs to communicate nullability of their result. If done right, that should have affect both logical and physical plans.
In apache#11256 it started inferring, for logical plans only, that count() aggregation does not return nulls. However, it did not introduce the counterpart inference for physical plans. Somehow, this discrepancy did not get detected by DF tests, but we run into it with, e.g. `sdf run -q "SELECT count(*) FROM (VALUES (3, 'c')) as T(c, x) GROUP BY c "` The fix here is ad hoc. We should keep an eye on apache#11274 where the goal seems to make UDAFs to communicate nullability of their result. If done right, that should have affect both logical and physical plans.
count([DISTINCT [expr]])
aggregate function never returns null. Infer non-nullness of such aggregate expression. This allows elimination of the HAVING filter for a query such as