Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer count() aggregation is not null #11256

Conversation

findepi
Copy link
Member

@findepi findepi commented Jul 3, 2024

count([DISTINCT [expr]]) aggregate function never returns null. Infer non-nullness of such aggregate expression. This allows elimination of the HAVING filter for a query such as

SELECT ... count(*) AS c
FROM ...
GROUP BY ...
HAVING c IS NOT NULL

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jul 3, 2024
@findepi findepi force-pushed the findepi/infer-count-aggregation-is-not-null-64b31c branch from f346b75 to 2cba7df Compare July 3, 2024 16:35
@findepi findepi marked this pull request as draft July 3, 2024 16:35
@findepi

This comment was marked as resolved.

@findepi findepi force-pushed the findepi/infer-count-aggregation-is-not-null-64b31c branch from 2cba7df to 37019a5 Compare July 3, 2024 17:49
@findepi findepi marked this pull request as ready for review July 3, 2024 17:49
@findepi findepi force-pushed the findepi/infer-count-aggregation-is-not-null-64b31c branch 3 times, most recently from f9dadfe to fe8fdef Compare July 3, 2024 19:06
@findepi
Copy link
Member Author

findepi commented Jul 3, 2024

Added unit and integration tests. Should be ready to review.

@findepi findepi force-pushed the findepi/infer-count-aggregation-is-not-null-64b31c branch from fe8fdef to d839797 Compare July 3, 2024 19:08
@findepi findepi force-pushed the findepi/infer-count-aggregation-is-not-null-64b31c branch from d839797 to 0352eda Compare July 3, 2024 19:28
@findepi
Copy link
Member Author

findepi commented Jul 3, 2024

@alamb PTAL

@alamb
Copy link
Contributor

alamb commented Jul 3, 2024

Thank you very much @findepi -- for both this PR and your other recent contributions. I will indeed review this PR, though I may not get to it until tomorrow or Friday

@jonahgao
Copy link
Member

jonahgao commented Jul 4, 2024

I think there might be a simpler way to achieve this.

We can declare the count aggregation expr as non-nullable like this, and then the simplify_expressions rule and the eliminate_filter rule will handle subsequent processing.

Expr::IsNotNull(expr) | Expr::IsNotUnknown(expr)
if !info.nullable(&expr)? =>
{
Transformed::yes(lit(true))
}

@findepi
Copy link
Member Author

findepi commented Jul 4, 2024

@jonahgao thanks!
I think we may later choose to add a similar optimizer to the one that was proposed here for the sake of filter expressions. Let's not add one if not needed today.

@findepi findepi force-pushed the findepi/infer-count-aggregation-is-not-null-64b31c branch from 0352eda to ba27f93 Compare July 4, 2024 19:07
@github-actions github-actions bot added logical-expr Logical plan and expressions and removed sqllogictest SQL Logic Tests (.slt) labels Jul 4, 2024
match func_def {
// TODO: min/max over non-nullable input should be recognized as non-nullable
AggregateFunctionDefinition::BuiltIn(fun) => fun.nullable(),
// TODO: UDF should be able to customize nullability
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address this TODO we could extend the pattern here

pub fn return_type(
&self,
input_expr_types: &[DataType],
input_expr_nullable: &[bool],
) -> Result<DataType> {

ie now we have return_type(input_types, nullability) -> return_type
we could have return_type(input_types, nullability) -> (return_type, nullability)

this is an API for builtin functions, we would need similar API for UDAFs

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType>;

Obviously, this would be a breaking change: both method signature and return type would change.
We can avoid the breaking change by making nullability a separate method, but it would be good to first determine the ideal end-state, and only then think how to get there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied into #11274

@findepi findepi force-pushed the findepi/infer-count-aggregation-is-not-null-64b31c branch from 24f4ff1 to 53b34ae Compare July 4, 2024 19:55
@github-actions github-actions bot added the core Core DataFusion crate label Jul 4, 2024
Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @findepi . Other questions could be resolved through subsequent PRs.

`count([DISTINCT [expr]])` aggregate function never returns null. Infer
non-nullness of such aggregate expression. This allows elimination of
the HAVING filter for a query such as

    SELECT ... count(*) AS c
    FROM ...
    GROUP BY ...
    HAVING c IS NOT NULL
@findepi findepi force-pushed the findepi/infer-count-aggregation-is-not-null-64b31c branch from 53b34ae to 1a7d541 Compare July 5, 2024 20:15
@findepi
Copy link
Member Author

findepi commented Jul 5, 2024

thanks @jonahgao for your review! updated.

@jonahgao jonahgao merged commit 13cb65e into apache:main Jul 6, 2024
1 check passed
@findepi findepi deleted the findepi/infer-count-aggregation-is-not-null-64b31c branch July 6, 2024 15:17
findepi added a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Fix test function name typo

* Improve formatting

* Infer count() aggregation is not null

`count([DISTINCT [expr]])` aggregate function never returns null. Infer
non-nullness of such aggregate expression. This allows elimination of
the HAVING filter for a query such as

    SELECT ... count(*) AS c
    FROM ...
    GROUP BY ...
    HAVING c IS NOT NULL
vgapeyev added a commit to sdf-labs/arrow-datafusion that referenced this pull request Sep 9, 2024
In apache#11256 it started inferring, for logical plans only,
that count() aggregation does not return nulls.
However, it did not introduce the counterpart inference for physical plans.
Somehow, this discrepancy did not get detected by DF tests, but we run into it with, e.g.
`sdf run -q "SELECT count(*) FROM (VALUES (3, 'c')) as T(c, x) GROUP BY c "`

The fix here is ad hoc.
We should keep an eye on apache#11274 where the goal seems to make UDAFs to communicate nullability of their result.  If done right, that should have affect both logical and physical plans.
findepi pushed a commit to findepi/datafusion that referenced this pull request Oct 7, 2024
In apache#11256 it started inferring, for logical plans only,
that count() aggregation does not return nulls.
However, it did not introduce the counterpart inference for physical plans.
Somehow, this discrepancy did not get detected by DF tests, but we run into it with, e.g.
`sdf run -q "SELECT count(*) FROM (VALUES (3, 'c')) as T(c, x) GROUP BY c "`

The fix here is ad hoc.
We should keep an eye on apache#11274 where the goal seems to make UDAFs to communicate nullability of their result.  If done right, that should have affect both logical and physical plans.
findepi added a commit to findepi/datafusion that referenced this pull request Oct 23, 2024
findepi added a commit to sdf-labs/arrow-datafusion that referenced this pull request Oct 23, 2024
findepi added a commit to sdf-labs/arrow-datafusion that referenced this pull request Oct 23, 2024
findepi added a commit to sdf-labs/arrow-datafusion that referenced this pull request Oct 23, 2024
findepi added a commit to sdf-labs/arrow-datafusion that referenced this pull request Oct 23, 2024
findepi added a commit to sdf-labs/arrow-datafusion that referenced this pull request Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants