Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support unnest in GROUP BY clause #11469

Merged
merged 8 commits into from
Jul 17, 2024

Conversation

JasonLi-cn
Copy link
Contributor

@JasonLi-cn JasonLi-cn commented Jul 15, 2024

Which issue does this PR close?

Closes #11470

Rationale for this change

support SQLs like:

SELECT unnest(list_type_col) c0, count(1) FROM tab GROUP BY c0;
SELECT unnest(unnest(recursive_list_type_col)) c0, count(1) FROM tab GROUP BY c0;

What changes are included in this PR?

Are these changes tested?

Yes. Has added some test cases into unnest.slt.

Are there any user-facing changes?

Simplified SQL syntax, eg:

SELECT t0.c0, count(1) FROM (SELECT unnest(list_col) c0 FROM tal) t0 GROUP BY t0.c0;

can replace by

SELECT unnest(list_col) c0, count(1) FROM tal GROUP BY c0;

@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jul 15, 2024
@jayzhan211
Copy link
Contributor

I wonder why we need to specialize group by expression for unnest, does that mean we need to specialize unnest with filter by too? 🤔

@JasonLi-cn
Copy link
Contributor Author

I wonder why we need to specialize group by expression for unnest, does that mean we need to specialize unnest with filter by too? 🤔

  • Pg supports group by unnest expression;
  • Our user needs this syntax support.

So, I tried to implement it in the community version.

As far as I know, pg does not currently support unnest in filter, so I'm not sure it needs to be implemented.

@JasonLi-cn
Copy link
Contributor Author

Thank you for the new feature, I have some thoughts for an alternative design:

Recursive aggregate functions are generally unsupported: e.g. select sum(sum(v1)) from t1 is not supported in DataFusion/Postgres/DuckDB. For this unnest() case, a recursive call is required, DuckDB uses: https://duckdb.org/docs/sql/query_syntax/unnest.html#recursive-unnest

What do you think about implementing DuckDB syntax, I think it's better for:

  1. Keep the SQL syntax consistent and avoid nested aggregate function
  2. Avoid changing planner code (which might be error-prone), and keep the recursive unnest logic within the aggregate function body

Thank you for your reply.

1 If "keep the recursive unnest logic within the aggregate function body" is implemented, at least the following work needs to be done:
(1) Improve the logic of Expr::Unnest processing in TypeCoercionRewriter;
(2) Implement create_physical_expr function to support Expr::Unnest;
(3) Maybe there are other tasks...
Also, I would like to check whether you are talking about aggregate function or AggregateExec here. Because pg currently does not support aggregation functions that include unnest. (Of course, it is not appropriate to illustrate this issue only by pg here)

2 If 'Recursive Unnest' is handled like DuckDB, how do we implement the unnest(unnest(list_column)) syntax when the user needs it? By optimization rules?

@2010YOUY01
Copy link
Contributor

Thank you for the new feature, I have some thoughts for an alternative design:
Recursive aggregate functions are generally unsupported: e.g. select sum(sum(v1)) from t1 is not supported in DataFusion/Postgres/DuckDB. For this unnest() case, a recursive call is required, DuckDB uses: https://duckdb.org/docs/sql/query_syntax/unnest.html#recursive-unnest
What do you think about implementing DuckDB syntax, I think it's better for:

  1. Keep the SQL syntax consistent and avoid nested aggregate function
  2. Avoid changing planner code (which might be error-prone), and keep the recursive unnest logic within the aggregate function body

Thank you for your reply.

1 If "keep the recursive unnest logic within the aggregate function body" is implemented, at least the following work needs to be done: (1) Improve the logic of Expr::Unnest processing in TypeCoercionRewriter; (2) Implement create_physical_expr function to support Expr::Unnest; (3) Maybe there are other tasks... Also, I would like to check whether you are talking about aggregate function or AggregateExec here. Because pg currently does not support aggregation functions that include unnest. (Of course, it is not appropriate to illustrate this issue only by pg here)

2 If 'Recursive Unnest' is handled like DuckDB, how do we implement the unnest(unnest(list_column)) syntax when the user needs it? By optimization rules?

@JasonLi-cn I mistakenly thought unnest was some aggregate function... please ignore my previous reply, and also sorry for the confusion.

@jayzhan211
Copy link
Contributor

@JasonLi-cn Is it better to move the logic into try_process_unnest?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JasonLi-cn -- I think this PR looks good to me. Thank you for the contribution

I think adding some comments about what try_process_group_by_unnest is doing and why would help improve this code but I don't think it is required prior to merge

c @jonahgao

@JasonLi-cn
Copy link
Contributor Author

Thank you @JasonLi-cn -- I think this PR looks good to me. Thank you for the contribution

I think adding some comments about what try_process_group_by_unnest is doing and why would help improve this code but I don't think it is required prior to merge

c @jonahgao

Thank you @alamb for your review. I have completed the code modification according to your modification suggestions.

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but I think it can be merged as is. It works well. Perhaps we can try to improve it through subsequent PRs.

// Projection: tab.array_col AS unnest(tab.array_col)
// TableScan: tab
// ```
let mut intermediate_plan = input.clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid this clone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 810ca8b which avoids the cloning in this function


/// Try converting Unnest(Expr) of group by to Unnest/Projection
/// Return the new input and group_by_exprs of Aggregate.
fn try_process_group_by_unnest(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to have some code repetition with function try_process_unnest.

I wonder if there's a better way to handle this, such as planning unnest before aggregation, and then reusing the current group-by planning logic. This seems more intuitive to me. But I'm not sure about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #11498

@@ -297,6 +298,9 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
input: LogicalPlan,
select_exprs: Vec<Expr>,
) -> Result<LogicalPlan> {
// Try process group by unnest
let input = self.try_process_aggregate_unnest(input)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unnest has already been processed by try_process_aggregate_unnest, does the following logic for handling unnest become redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. We need to verify this logic in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #11498

@alamb alamb changed the title feat: support group by unnest feat: support unnest in GROUP BY clause Jul 16, 2024
@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

I merged up to resolve a conflict as well

@JasonLi-cn
Copy link
Contributor Author

I merged up to resolve a conflict as well

Thank you @alamb for your help to modify the code, I learned a lot from it.

@jonahgao jonahgao merged commit a979f3e into apache:main Jul 17, 2024
24 checks passed
@jonahgao
Copy link
Member

Thanks @JasonLi-cn @alamb

xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
* feat: support group by unnest

* pass slt

* refactor: mv process_group_by_unnest into try_process_unnest

* chore: add some documentation comments and tests

* Avoid cloning input

* use consistent field names

---------

Co-authored-by: Andrew Lamb <[email protected]>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
* feat: support group by unnest

* pass slt

* refactor: mv process_group_by_unnest into try_process_unnest

* chore: add some documentation comments and tests

* Avoid cloning input

* use consistent field names

---------

Co-authored-by: Andrew Lamb <[email protected]>
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
* feat: support group by unnest

* pass slt

* refactor: mv process_group_by_unnest into try_process_unnest

* chore: add some documentation comments and tests

* Avoid cloning input

* use consistent field names

---------

Co-authored-by: Andrew Lamb <[email protected]>
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
* feat: support group by unnest

* pass slt

* refactor: mv process_group_by_unnest into try_process_unnest

* chore: add some documentation comments and tests

* Avoid cloning input

* use consistent field names

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feat: Support GROUP BY unnest expr
5 participants