-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change compound column field name rules #952
Conversation
Hi @houqp, could you please take a look at this? And according to the PR template, this PR should be labeled as "API change". But I can't add the label myself. |
Thanks for the proposal. imo it is a bit counter-intuitive to write I am curious how this interacts with the dataframe API. E.g. (pseudo-code) df = ctx.create_in_memory_table("table1", batches)?;
df = df.agg(sum(df["c1"]))?;
df.collect() would yield the schema |
Thanks for your question @jorgecarleitao.
Not that I know of for now. There are some other engines' behavior here. This proposal does change a lot. From eliminating all qualifiers to bring all to those composite expressions.
It will be |
Thanks @waynexia for the PR, I am on vacation this week, but I will try to find some time to review it today or tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I defer to @houqp and @jorgecarleitao on this -- they spent quite some time studying and writing up the output field name semantics: https://github.com/apache/arrow-datafusion/blob/master/docs/specification/output-field-name-semantic.md which I think we should continue to follow, if possible
cc @jimexist
Our current output field name semantics mostly aligns with what spark has, which strips column qualifiers in all cases. This PR changes the semantics to handle compound and bare column names differently. For bare names, we strip column qualifiers, but not for compound columns. Unlike Spark, relational databases like MySQL, Posgresql and Sqlite all treat compound and bare column names differently. MySQL, Postgresql and Sqlite strip qualifers for bare column names like we do. But MySQL and Sqlite use raw user query input for compound column names. Postgresql on the other hand just uses In all compute engines, users are not expected to reference compound columns by generated names because these names are not guaranteed to be valid sql expressions. Instead, they should always manually alias them. As a result, the output filed name for compound columns are just there for display/debug purpose only. See also @Dandandan 's comment at #280 (comment). @jorgecarleitao as for the counter-intuitive example you mentioned, the current implementation will output field name The proposed new behavior provides better UX compared to Postgresql's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left two minor naming comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I will give some time for @alamb @jorgecarleitao and @Dandandan to review and comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @waynexia ! |
Which issue does this PR close?
Closes #.
Rationale for this change
Origin in #792, some discussions here: #792 (comment)
In the current implementation and spec., all field names don't contain the table or relation name. It might be better to hold those qualifiers in compound columns resulting from projection with calculating expr (
select t1 + 1
) or aggr expr (sum(t2)
) etc. This can improve readability by figuring out where a column is from. And also in some parts of inner processing (like the logical plan phase), we use this qualified form to identify a column.What changes are included in this PR?
This PR changes the rule of field name generation. For now, all compound columns' (columns with other outer expressions) field names will be qualified form.
Are there any user-facing changes?
Yes.
Compound columns' field name will now contain table / relation name, like
AVG(table.c1)
.