Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should DEDUPLICATE, FINAL and ID be keywords? #1382

Closed
samuelcolvin opened this issue Aug 15, 2024 · 11 comments
Closed

Should DEDUPLICATE, FINAL and ID be keywords? #1382

samuelcolvin opened this issue Aug 15, 2024 · 11 comments

Comments

@samuelcolvin
Copy link
Contributor

#1359 added DEDUPLICATE, FINAL and ID as keywords, apparently with no discussion of that change.

This is breaking tests on datafusion, see apache/datafusion#12014 (easy to fix), but more to the point seems like a change of behaviour that might be unwanted for many users.

In particular, id is a very column variable or column name, and starting to mark it as a keyword for all dialects seems mistaken.

What should we do?

@samuelcolvin samuelcolvin changed the title Should DEDUPLICATE, FINAL and ID be keywords Should DEDUPLICATE, FINAL and ID be keywords? Aug 15, 2024
@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

apache/datafusion#12014 shows the difference and it is related to creating SQL (thank you @samuelcolvin )

Previously when converting a sql statement to string, it would result in

 SELECT c.id FROM (SELECT j1.j1_id FROM j1) AS c (id)

However, now the id is quoted for some reason:

SELECT c.\"id\" FROM (SELECT j1.j1_id FROM j1) AS c (id)

Full failure is https://github.com/apache/datafusion/actions/runs/10402816913/job/28808419851?pr=12014

thread 'cases::plan_to_sql::roundtrip_statement_with_dialect' panicked at datafusion/sql/tests/cases/plan_to_sql.rs:435:9:
assertion `left == right` failed
  left: "SELECT c.id FROM (SELECT j1.j1_id FROM j1) AS c (id)"
 right: "SELECT c.\"id\" FROM (SELECT j1.j1_id FROM j1) AS c (id)"
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: core::panicking::assert_failed_inner
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:408:17
   3: core::panicking::assert_failed
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:363:5
   4: roundtrip_statement_with_dialect
             at ./tests/cases/plan_to_sql.rs:435:9
   5: sql_integration::cases::plan_to_sql::roundtrip_statement_with_dialect::{{closure}}
             at ./tests/cases/plan_to_sql.rs:220:42
   6: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

What should we do?

I have two questions:

  1. What are the implications of id being an identifier (one is in the display results above).
  2. What would the alternatives be?

I can try and debug this later today. Thank you @samuelcolvin for preparing the upgrade

@samuelcolvin
Copy link
Contributor Author

It seems that id being an identifier doesn't break usage of it in queries:

> cargo run --manifest-path datafusion-cli/Cargo.toml

> create table foo (id int, name string);
0 row(s) fetched. 
Elapsed 0.016 seconds.

> insert into foo (id, name) values (1, 'samuel');
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched. 
Elapsed 0.012 seconds.

> select id from foo where name='samuel';
+----+
| id |
+----+
| 1  |
+----+
1 row(s) fetched. 
Elapsed 0.008 seconds.

Then again, the same works fine in clickhouse:

macbook-pro-5.local :) create table foo (id Int, name String) ENGINE = Memory;

CREATE TABLE foo
(
    `id` Int,
    `name` String
)
ENGINE = Memory

Query id: 323cda8f-7154-42cd-81d1-1346addb40dd

Ok.

0 rows in set. Elapsed: 0.002 sec. 

macbook-pro-5.local :) insert into foo (id, name) values (1, 'samuel');

INSERT INTO foo (id, name) FORMAT Values

Query id: 753fa6a1-5cce-4c65-94ad-588580f56332

Ok.

1 row in set. Elapsed: 0.003 sec. 

macbook-pro-5.local :) select id from foo where name='samuel';

SELECT id
FROM foo
WHERE name = 'samuel'

Query id: fa9d868d-70e1-4461-8b4e-7370a6d5ade3

┌─id─┐
│  1 │
└────┘

1 row in set. Elapsed: 0.001 sec. 

So I'm not really sure what value adding ID to keywords brings.

@git-hulk can you explain what those identifiers help with Clickhouse syntax?

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

I think ID needs to be added as an keyword in clickhouse to parse certin syntactic constructs and adding new keywords is consistent with adding other new features

Maybe we can just special case the conversion of ast --> SQL string in DataFusion to avoid adding the quote on id:

https://github.com/apache/datafusion/blob/ea2e7ab6885de734243dffc4642e2742206de5b9/datafusion/sql/src/unparser/dialect.rs#L132-L141

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

I will hold releasing 0.50.0 #1349 until we resolve this issue

I don't think it needs any code changes from my perspective

Maybe @iffyio @jmhain or @lovasoa have an opinion they can share

@samuelcolvin
Copy link
Contributor Author

I think ID needs to be added as an keyword in clickhouse to parse certin syntactic constructs and adding new keywords is consistent with adding other new features

Maybe we can just special case the conversion of ast --> SQL string in DataFusion to avoid adding the quote on id:

https://github.com/apache/datafusion/blob/ea2e7ab6885de734243dffc4642e2742206de5b9/datafusion/sql/src/unparser/dialect.rs#L132-L141

👍 on it.

@jmhain
Copy link
Contributor

jmhain commented Aug 15, 2024

IMO the tokenizer should consider the dialect when determining whether some identifier is a keyword (though that's a pretty major change so it's probably unrealistic to do it now).

@samuelcolvin
Copy link
Contributor Author

@jmhain I agree on both points.

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

Ok, so I would say the conclusion to the question posed by this ticket "Should DEDUPLICATE, FINAL and ID be keywords?" is : YES

However, maybe in the future we will contemplate being more sophisticated / targeted and make keywords dialect specific.

@alamb alamb closed this as completed Aug 15, 2024
@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

Thank you everyone for the good discussion

@git-hulk
Copy link
Member

@git-hulk can you explain what those identifiers help with Clickhouse syntax?

@samuelcolvin Those keywords were introduced in PR #1359. For example, the keyword FINAL[1] can be used in the OPTIMIZED TABLE as well as the SELECT query to indicate that ClickHose should fully merge the data before returning the result.

However, maybe in the future we will contemplate being more sophisticated / targeted and make keywords dialect specific.

Agree with what @alamb mentioned, keywords may be different from dialect to dialect, and it's hard to avoid similar issues without making keywords dialect-specific.

[1] https://clickhouse.com/docs/en/sql-reference/statements/select/from#final-modifier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants