Change compound column field name rules #952

waynexia · 2021-08-28T09:04:51Z

Which issue does this PR close?

Closes #.

Rationale for this change

Origin in #792, some discussions here: #792 (comment)

In the current implementation and spec., all field names don't contain the table or relation name. It might be better to hold those qualifiers in compound columns resulting from projection with calculating expr (select t1 + 1) or aggr expr (sum(t2)) etc. This can improve readability by figuring out where a column is from. And also in some parts of inner processing (like the logical plan phase), we use this qualified form to identify a column.

What changes are included in this PR?

This PR changes the rule of field name generation. For now, all compound columns' (columns with other outer expressions) field names will be qualified form.

Are there any user-facing changes?

Yes.

Compound columns' field name will now contain table / relation name, like AVG(table.c1).

waynexia · 2021-08-28T09:09:07Z

Hi @houqp, could you please take a look at this?

And according to the PR template, this PR should be labeled as "API change". But I can't add the label myself.

jorgecarleitao · 2021-08-28T09:16:49Z

Thanks for the proposal.

imo it is a bit counter-intuitive to write SELECT MIN(c1) FROM table1 and get MIN(table.c1) as the schema. Are there other engine doing this?

I am curious how this interacts with the dataframe API. E.g. (pseudo-code)

df = ctx.create_in_memory_table("table1", batches)?;
df = df.agg(sum(df["c1"]))?;
df.collect()

would yield the schema "SUM(table1.c1)" or "SUM(c1)"? Would I need to write df = df.agg(sum(df["table.c1"]))?; instead?

waynexia · 2021-08-28T09:50:26Z

Thanks for your question @jorgecarleitao.

imo it is a bit counter-intuitive to write SELECT MIN(c1) FROM table1 and get MIN(table.c1) as the schema. Are there other engine doing this?

Not that I know of for now. There are some other engines' behavior here. This proposal does change a lot. From eliminating all qualifiers to bring all to those composite expressions.

would yield the schema "SUM(table1.c1)" or "SUM(c1)"

It will be SUM(table1.c1). The way of interacting with dataframe doesn't change. Only the result field names are different. There is one example in execution/dataframe_impl.rs

houqp · 2021-08-28T15:01:04Z

Thanks @waynexia for the PR, I am on vacation this week, but I will try to find some time to review it today or tomorrow.

alamb

I defer to @houqp and @jorgecarleitao on this -- they spent quite some time studying and writing up the output field name semantics: https://github.com/apache/arrow-datafusion/blob/master/docs/specification/output-field-name-semantic.md which I think we should continue to follow, if possible

cc @jimexist

houqp · 2021-08-30T06:30:51Z

Our current output field name semantics mostly aligns with what spark has, which strips column qualifiers in all cases.

This PR changes the semantics to handle compound and bare column names differently. For bare names, we strip column qualifiers, but not for compound columns.

Unlike Spark, relational databases like MySQL, Posgresql and Sqlite all treat compound and bare column names differently. MySQL, Postgresql and Sqlite strip qualifers for bare column names like we do. But MySQL and Sqlite use raw user query input for compound column names. Postgresql on the other hand just uses ?column? for all compound column names.

In all compute engines, users are not expected to reference compound columns by generated names because these names are not guaranteed to be valid sql expressions. Instead, they should always manually alias them. As a result, the output filed name for compound columns are just there for display/debug purpose only. See also @Dandandan 's comment at #280 (comment).

@jorgecarleitao as for the counter-intuitive example you mentioned, the current implementation will output field name SUM(id) column name for query SELECT SUM(t1.id), while the proposed new behavior will output SUM(t1.id) field name for query SELECT SUM(id). So both of them will not use the exact user query input as the output field name. Either way, it should have no impact to how users construct queries.

The proposed new behavior provides better UX compared to Postgresql's ?column? column name. I think it's also an improvement over the current (spark's) behavior because it will produce an unambiguous column name for queries like SELECT t1.id * t2.id FROM t1 JOIN t2 USING id. The current behavior will output id * id, which is not as good as t1.id * t2.id IMO.

houqp

LGTM, left two minor naming comments.

datafusion/src/physical_plan/planner.rs

waynexia · 2021-08-30T14:22:04Z

Thanks @alamb and @houqp for reviewing! I pushed a commit to address the comments.

houqp

LGTM, I will give some time for @alamb @jorgecarleitao and @Dandandan to review and comment.

jorgecarleitao

Thanks a lot for clarifying @waynexia and @houqp .

This looks great, the code and tests look solid, and the changes to the spec also look great.

Thanks for this PR, @waynexia =)

alamb · 2021-09-10T11:02:42Z

Thanks @waynexia !

waynexia added 5 commits August 28, 2021 11:25

change physical name semantic

8607771

replace expect output in context.rs

97ec34e

replace expect output in sql & dataframe_impl

ccb3d7b

add spec entry

ad6589a

replace expect output in lib doc & planner

7de3a81

github-actions bot added the datafusion Changes in the datafusion crate label Aug 28, 2021

alamb reviewed Aug 29, 2021

View reviewed changes

houqp requested review from Dandandan and jorgecarleitao August 30, 2021 06:36

houqp approved these changes Aug 30, 2021

View reviewed changes

datafusion/src/physical_plan/planner.rs Show resolved Hide resolved

datafusion/src/physical_plan/planner.rs Show resolved Hide resolved

houqp requested a review from alamb August 30, 2021 06:42

Merge branch 'apache:master' into qualified-col

17fced6

houqp approved these changes Aug 30, 2021

View reviewed changes

jorgecarleitao approved these changes Aug 30, 2021

View reviewed changes

houqp merged commit 7932cb9 into apache:master Aug 31, 2021

houqp added the enhancement New feature or request label Aug 31, 2021

waynexia deleted the qualified-col branch September 1, 2021 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change compound column field name rules #952

Change compound column field name rules #952

waynexia commented Aug 28, 2021

waynexia commented Aug 28, 2021

jorgecarleitao commented Aug 28, 2021

waynexia commented Aug 28, 2021

houqp commented Aug 28, 2021

alamb left a comment

houqp commented Aug 30, 2021 •

edited

Loading

houqp left a comment

waynexia commented Aug 30, 2021

houqp left a comment

jorgecarleitao left a comment

alamb commented Sep 10, 2021

Change compound column field name rules #952

Change compound column field name rules #952

Conversation

waynexia commented Aug 28, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

waynexia commented Aug 28, 2021

jorgecarleitao commented Aug 28, 2021

waynexia commented Aug 28, 2021

houqp commented Aug 28, 2021

alamb left a comment

Choose a reason for hiding this comment

houqp commented Aug 30, 2021 • edited Loading

houqp left a comment

Choose a reason for hiding this comment

waynexia commented Aug 30, 2021

houqp left a comment

Choose a reason for hiding this comment

jorgecarleitao left a comment

Choose a reason for hiding this comment

alamb commented Sep 10, 2021

houqp commented Aug 30, 2021 •

edited

Loading