-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion for a Systematic Configuration in 'Create External Table' Options #8994
Comments
I think this proposal makes a lot of sense to me Seeing something like CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
OPTIONS ('aws.credentials.basic.accesskeyid' 'asdasd',
'aws.credentials.basic.secretkey', 'asdasd',
'format.csv.scan.datetime_regex' 'asdasd',
'format.csv.sink.timestamp_format' 'asdasd',
'format.csv.sink.date_format' 'asdasd') Seems pretty self explanatory and consistent with the other options as you have pointed out. To ease the transition, we could also make some sort of aliases that allow the old keys to work too 🤔 |
cc @devinjdangelo might also have some ideas |
Yes, this is a great idea! It will also allow for more uniformity in our session level and statement level option naming conventions. E.g. currently we have set datafusion.execution.parquet.max_row_group_size=1234 vs COPY table to 'file.parquet' (MAX_ROW_GROUP_SIZE 1234) could become instead:
A nice side effect would be that |
Why not |
I think the global setting serves the let users change the defaults once and use those settings for all statements in the session, which can then be overridden per statement if desired. |
Is your feature request related to a problem or challenge?
Currently, in our implementation of 'Create External Table', the configuration options are not systematically organized, leading to potential confusion and complexity for the users. This is especially evident when we compare our configuration pattern with other systems like Apache Flink and Apache Spark.
For instance, in our current setup, wiring CSV format options and AWS credential settings are done in the same context. This approach lacks the clarity and structure found in similar systems. Examples of more systematic configurations can be seen in:
Although Spark’s approach could be perceived as confusing when applied to our 'Create External Table' method, our method is currently more aligned with Spark's approach in terms of table creation.
However, one aspect where our system shines is in our session context configuration. We utilize a more intuitive dot (.) divided pattern, like datafusion.execution.parquet.statistics_enabled. This is more user-friendly and logically structured.
Describe the solution you'd like
I propose we adopt a more structured and systematic approach in defining table options, similar to our session context configuration. For example, instead of the current format:
We could structure it more clearly:
Or even more detailed:
This approach would separate AWS credentials from CSV format options and further delineate options for scanning and sinking, enhancing clarity and ease of use.
Impact:
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: