Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some aggregates silently ignore IGNORE NULLS and ORDER BY on arguments #9924

Open
alamb opened this issue Apr 3, 2024 · 7 comments
Open
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Apr 3, 2024

Describe the bug

DataFusion now supports providing ordering and nulls information to aggregates, but some aggregates ignore the flags silently

For example

select first_value(column1 ORDER BY column2) FROM (values (1,2), (3,4), (-1,0)) ;
+----------------------+
| FIRST_VALUE(column1) |
+----------------------+
| -1                   |
+----------------------+select first_value(column1 ORDER BY column2) IGNORE NULLS FROM (values (1,2), (3,4), (null,0)) ;
+----------------------+
| FIRST_VALUE(column1) |
+----------------------+
| 1                    |
+----------------------+
1 row in set. Query took 0.002 seconds.

To Reproduce

select count(*) from (values (1), (null), (2));
+----------+
| COUNT(*) |
+----------+
| 3        |
+----------+
1 row in set. Query took 0.039 seconds.

❯ select count(*) IGNORE NULLS from (values (1), (null), (2));
+----------+
| COUNT(*) |
+----------+
| 3        |
+----------+
1 row in set. Query took 0.001 seconds.

Also, for ordering

select avg(column1 ORDER BY column2) FROM (values (1,2), (3,4), (null,0)) ;
+--------------+
| AVG(column1) |
+--------------+
| 2.0          |
+--------------+
1 row in set. Query took 0.008 seconds.

Expected behavior

I expect

select count(*) IGNORE NULLS from (values (1), (null), (2));

To error with "IGNORE NULLS is not supported

I also expect

select avg(column1 ORDER BY column2) FROM (values (1,2), (3,4), (null,0)) ;

to error with "ORDER BY" not supported for avg

Additional context

@jayzhan211 has some good ideas at #9920 (comment) about how to make checking this easier / harder to miss

@alamb alamb added the bug Something isn't working label Apr 3, 2024
@alamb alamb changed the title Aggregates ignore IGNORE NULLS and ORDER BY on arguments Some aggregates silently ignore IGNORE NULLS and ORDER BY on arguments Apr 3, 2024
@comphead
Copy link
Contributor

comphead commented Apr 3, 2024

@huaxingao this might be interesting

@huaxingao
Copy link
Contributor

@comphead Thanks for pinging me! I will work on this issue.

@mustafasrepo
Copy link
Contributor

I think for AVG, or COUNT type of aggregates we can safely ignore ORDER BY expression. Since operation result is same for any permutation of the input. However, for first_value, nth_value, last_value this is not the case. However, I agree that IGNORE NULLS should raise an error.

@alamb
Copy link
Contributor Author

alamb commented Apr 4, 2024

I think for AVG, or COUNT type of aggregates we can safely ignore ORDER BY expression. Since operation result is same for any permutation of the input.

That is true, though applying an ORDER BY to the argument of COUNT may be much slower 🤔 (as DataFusion would have to sort the input)

However, we could remove ORDER BY that don't change the output as a follow on optimization

@alamb
Copy link
Contributor Author

alamb commented Apr 4, 2024

FWIW @jayzhan211 has a really nice API suggestion here I think https://github.com/apache/arrow-datafusion/pull/9920/files#r1549905825

Specifically, add these two functions to the AggregateUDFImpl

/// Return true if the aggregate supports `IGNORE NULL`s
fn support_ignore_nulls() -> bool  { false }

/// Return true if the aggregate supports and `ORDER BY` to specify its ordering:
/// `SELECT first_value(x ORDER BY y) ... `
fn support_ordering() -> bool { false }

@alamb
Copy link
Contributor Author

alamb commented Apr 4, 2024

I wrote some tests for this feature in another PR that we decided not to use. I put them here #9953 in case that is helpful @huaxingao

@mustafasrepo
Copy link
Contributor

FWIW @jayzhan211 has a really nice API suggestion here I think https://github.com/apache/arrow-datafusion/pull/9920/files#r1549905825

Specifically, add these two functions to the AggregateUDFImpl

/// Return true if the aggregate supports `IGNORE NULL`s
fn support_ignore_nulls() -> bool  { false }

/// Return true if the aggregate supports and `ORDER BY` to specify its ordering:
/// `SELECT first_value(x ORDER BY y) ... `
fn support_ordering() -> bool { false }

I think these APIs are very nice, clear. We can add these supports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants