Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change compound column field name rules #952

Merged
merged 6 commits into from
Aug 31, 2021
Merged

Conversation

waynexia
Copy link
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Origin in #792, some discussions here: #792 (comment)

In the current implementation and spec., all field names don't contain the table or relation name. It might be better to hold those qualifiers in compound columns resulting from projection with calculating expr (select t1 + 1) or aggr expr (sum(t2)) etc. This can improve readability by figuring out where a column is from. And also in some parts of inner processing (like the logical plan phase), we use this qualified form to identify a column.

What changes are included in this PR?

This PR changes the rule of field name generation. For now, all compound columns' (columns with other outer expressions) field names will be qualified form.

Are there any user-facing changes?

Yes.

Compound columns' field name will now contain table / relation name, like AVG(table.c1).

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Aug 28, 2021
@waynexia
Copy link
Member Author

Hi @houqp, could you please take a look at this?

And according to the PR template, this PR should be labeled as "API change". But I can't add the label myself.

@jorgecarleitao
Copy link
Member

Thanks for the proposal.

imo it is a bit counter-intuitive to write SELECT MIN(c1) FROM table1 and get MIN(table.c1) as the schema. Are there other engine doing this?

I am curious how this interacts with the dataframe API. E.g. (pseudo-code)

df = ctx.create_in_memory_table("table1", batches)?;
df = df.agg(sum(df["c1"]))?;
df.collect()

would yield the schema "SUM(table1.c1)" or "SUM(c1)"? Would I need to write df = df.agg(sum(df["table.c1"]))?; instead?

@waynexia
Copy link
Member Author

Thanks for your question @jorgecarleitao.

imo it is a bit counter-intuitive to write SELECT MIN(c1) FROM table1 and get MIN(table.c1) as the schema. Are there other engine doing this?

Not that I know of for now. There are some other engines' behavior here. This proposal does change a lot. From eliminating all qualifiers to bring all to those composite expressions.

would yield the schema "SUM(table1.c1)" or "SUM(c1)"

It will be SUM(table1.c1). The way of interacting with dataframe doesn't change. Only the result field names are different. There is one example in execution/dataframe_impl.rs

@houqp
Copy link
Member

houqp commented Aug 28, 2021

Thanks @waynexia for the PR, I am on vacation this week, but I will try to find some time to review it today or tomorrow.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defer to @houqp and @jorgecarleitao on this -- they spent quite some time studying and writing up the output field name semantics: https://github.com/apache/arrow-datafusion/blob/master/docs/specification/output-field-name-semantic.md which I think we should continue to follow, if possible

cc @jimexist

@houqp
Copy link
Member

houqp commented Aug 30, 2021

Our current output field name semantics mostly aligns with what spark has, which strips column qualifiers in all cases.

This PR changes the semantics to handle compound and bare column names differently. For bare names, we strip column qualifiers, but not for compound columns.

Unlike Spark, relational databases like MySQL, Posgresql and Sqlite all treat compound and bare column names differently. MySQL, Postgresql and Sqlite strip qualifers for bare column names like we do. But MySQL and Sqlite use raw user query input for compound column names. Postgresql on the other hand just uses ?column? for all compound column names.

In all compute engines, users are not expected to reference compound columns by generated names because these names are not guaranteed to be valid sql expressions. Instead, they should always manually alias them. As a result, the output filed name for compound columns are just there for display/debug purpose only. See also @Dandandan 's comment at #280 (comment).

@jorgecarleitao as for the counter-intuitive example you mentioned, the current implementation will output field name SUM(id) column name for query SELECT SUM(t1.id), while the proposed new behavior will output SUM(t1.id) field name for query SELECT SUM(id). So both of them will not use the exact user query input as the output field name. Either way, it should have no impact to how users construct queries.

The proposed new behavior provides better UX compared to Postgresql's ?column? column name. I think it's also an improvement over the current (spark's) behavior because it will produce an unambiguous column name for queries like SELECT t1.id * t2.id FROM t1 JOIN t2 USING id. The current behavior will output id * id, which is not as good as t1.id * t2.id IMO.

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left two minor naming comments.

@houqp houqp requested a review from alamb August 30, 2021 06:42
@waynexia
Copy link
Member Author

Thanks @alamb and @houqp for reviewing! I pushed a commit to address the comments.

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I will give some time for @alamb @jorgecarleitao and @Dandandan to review and comment.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for clarifying @waynexia and @houqp .

This looks great, the code and tests look solid, and the changes to the spec also look great.

:shipit:

Thanks for this PR, @waynexia =)

@houqp houqp merged commit 7932cb9 into apache:master Aug 31, 2021
@houqp houqp added the enhancement New feature or request label Aug 31, 2021
@waynexia waynexia deleted the qualified-col branch September 1, 2021 13:51
@alamb
Copy link
Contributor

alamb commented Sep 10, 2021

Thanks @waynexia !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants