Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant Aggregate when DISTINCT & GROUP BY are in the same query #11781

Merged
merged 31 commits into from
Aug 4, 2024

Conversation

mertak-synnada
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Inefficient planning is produced when a redundant DISTINCT & GROUP BY clause is used together.

Eg.

EXPLAIN SELECT c3 FROM aggregate_test_100 GROUP BY c3 LIMIT 5;
logical_plan
Limit: skip=0, fetch=5
--Aggregate: groupBy=[[aggregate_test_100.c3]], aggr=[[]]
----Aggregate: groupBy=[[aggregate_test_100.c3]], aggr=[[]]
------TableScan: aggregate_test_100 projection=[c3]
physical_plan
GlobalLimitExec: skip=0, fetch=5
--AggregateExec: mode=Final, gby=[c3@0 as c3], aggr=[], lim=[5]
----CoalescePartitionsExec
------AggregateExec: mode=Partial, gby=[c3@0 as c3], aggr=[], lim=[5]
--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
----------AggregateExec: mode=Final, gby=[c3@0 as c3], aggr=[], lim=[5]
------------CoalescePartitionsExec
--------------AggregateExec: mode=Partial, gby=[c3@0 as c3], aggr=[], lim=[5]
----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c3], has_header=true

Consecutive empty aggregates with same group by are unnecessary.

What changes are included in this PR?

It's fixed that if a distinct is the same as a following group by statement, it should not be turned into an Aggregate plan.

Are these changes tested?

Yes

Are there any user-facing changes?

No

metesynnada and others added 29 commits February 22, 2024 09:56
Fix deploying DataFusion site error
…eature/eliminate-aggregate

# Conflicts:
#	.github/workflows/docs.yaml
#	datafusion/optimizer/src/lib.rs
#	datafusion/optimizer/src/optimizer.rs
#	datafusion/optimizer/src/single_distinct_to_groupby.rs
add additional tests for not removing cases
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
@github-actions github-actions bot added the optimizer Optimizer rules label Aug 2, 2024
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 2, 2024
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this and it LGTM. Easy win!

@alamb alamb changed the title Feature/eliminate aggregate Remove redundant Aggregate when DISTINCT & GROUP BY Aug 2, 2024
@alamb alamb changed the title Remove redundant Aggregate when DISTINCT & GROUP BY Remove redundant Aggregate when DISTINCT & GROUP BY are in the same query Aug 2, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mertak-synnada and @ozankabak

@@ -4536,19 +4536,14 @@ EXPLAIN SELECT DISTINCT c3 FROM aggregate_test_100 group by c3 limit 5;
logical_plan
01)Limit: skip=0, fetch=5
02)--Aggregate: groupBy=[[aggregate_test_100.c3]], aggr=[[]]
03)----Aggregate: groupBy=[[aggregate_test_100.c3]], aggr=[[]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These certainly look like bettr plans to me.

However, I wonder if these changes now mean the query doesn't fulfill the original intent of the tests

#
# Push limit into distinct group-by aggregation tests
#

Therefore I wonder if we should add another test like

EXPLAIN SELECT DISTINCT c3, min(c1) FROM aggregate_test_100 group by c3 limit 5;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW in my opinion it would be fine to add the other tests as a new PR too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's do it that way

@ozankabak ozankabak merged commit c8e5996 into apache:main Aug 4, 2024
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants