Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve column naming by aliasing with expression name #280

Closed
wants to merge 7 commits into from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented May 6, 2021

Which issue does this PR close?

Closes #279

Rationale for this change

To support optimizations like constant folding the column names need to keep consistent.
This PR also improves the display of column names.

What changes are included in this PR?

  • Wrap unnamed expressions in a alias using the formatted sql expression. This keeps us much closer to the original expression (which you can see in the tests).
  • Don't display alias in logical plan for column expressions where the alias is the same as the column name. This avoids the problem that every formatted logical plan would look like #col AS col
  • TODO: think a bit more about implications. Adding alias means that a query could use the alias in another projection, which is not what we want.

Are there any user-facing changes?

Yes, column names will be different.

@Dandandan Dandandan marked this pull request as ready for review May 7, 2021 10:19
@Dandandan Dandandan requested review from alamb and jorgecarleitao May 7, 2021 10:19
@Dandandan Dandandan marked this pull request as draft May 7, 2021 10:31
@codecov-commenter
Copy link

Codecov Report

Merging #280 (8522633) into master (b8805d4) will decrease coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #280      +/-   ##
==========================================
- Coverage   76.19%   76.13%   -0.06%     
==========================================
  Files         140      140              
  Lines       23595    23607      +12     
==========================================
- Hits        17978    17974       -4     
- Misses       5617     5633      +16     
Impacted Files Coverage Δ
datafusion/src/execution/context.rs 92.66% <100.00%> (+0.02%) ⬆️
datafusion/src/execution/dataframe_impl.rs 89.10% <100.00%> (+0.21%) ⬆️
datafusion/src/logical_plan/expr.rs 81.68% <100.00%> (-2.72%) ⬇️
datafusion/src/sql/planner.rs 83.65% <100.00%> (+0.02%) ⬆️
datafusion/src/sql/utils.rs 48.12% <0.00%> (-0.63%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8805d4...8522633. Read the comment docs.

@Dandandan
Copy link
Contributor Author

I will do some more research into what's needed for column names, I think we have to avoid reusing alias for this and see if we should support unnamed columns better.

@Dandandan Dandandan closed this May 7, 2021
@alamb
Copy link
Contributor

alamb commented May 7, 2021

This is an interesting idea -- and seems similar to what @wqc200 was working on in apache/arrow#9600

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the concerns about column expressions and what might be bad here. The alternate is probably to keep a "name" field on all Exprs which was the approach in apache/arrow#9600

@@ -1517,7 +1520,7 @@ mod tests {
fn select_no_relation() {
quick_test(
"SELECT 1",
"Projection: Int64(1)\
"Projection: Int64(1) AS 1\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an improvement to me.

@Dandandan
Copy link
Contributor Author

I don't really understand the concerns about column expressions and what might be bad here. The alternate is probably to keep a "name" field on all Exprs which was the approach in apache/arrow#9600

Happy to reconsider! My main concern is that introducing aliases for unnamed expressions will cause other "strange" issues like using the name of the expression, like:
select "1+1" from (select 1+1) t

Which should fail but would return 2?

I think we should make sure unnamed expressions can only be referenced by index, or with * and not by a generated name.

@alamb
Copy link
Contributor

alamb commented May 8, 2021 via email

@houqp
Copy link
Member

houqp commented May 9, 2021

I think we should make sure unnamed expressions can only be referenced by index, or with * and not by a generated name.

+1, I did some research on this as well, postgresql even name most of the unnamed expression ?column? in their output. Users should either use index or manually create aliases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve unnamed expressions
4 participants