Discussion for a Systematic Configuration in 'Create External Table' Options #8994

metesynnada · 2024-01-25T11:00:41Z

Is your feature request related to a problem or challenge?

Currently, in our implementation of 'Create External Table', the configuration options are not systematically organized, leading to potential confusion and complexity for the users. This is especially evident when we compare our configuration pattern with other systems like Apache Flink and Apache Spark.

For instance, in our current setup, wiring CSV format options and AWS credential settings are done in the same context. This approach lacks the clarity and structure found in similar systems. Examples of more systematic configurations can be seen in:

Although Spark’s approach could be perceived as confusing when applied to our 'Create External Table' method, our method is currently more aligned with Spark's approach in terms of table creation.

However, one aspect where our system shines is in our session context configuration. We utilize a more intuitive dot (.) divided pattern, like datafusion.execution.parquet.statistics_enabled. This is more user-friendly and logically structured.

Describe the solution you'd like

I propose we adopt a more structured and systematic approach in defining table options, similar to our session context configuration. For example, instead of the current format:

CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
OPTIONS ('AWS_ACCESS_KEY_ID' 'asdasd',
         'AWS_SECRET_ACCESS_KEY', 'asdasd',
         'timestamp_format' 'asdasd',
         'date_format' 'asdasd')

We could structure it more clearly:

CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
OPTIONS ('aws.credentials.basic.accesskeyid' 'asdasd',
         'aws.credentials.basic.secretkey', 'asdasd',
         'format.csv.sink.timestamp_format' 'asdasd',
         'format.csv.sink.date_format' 'asdasd')

Or even more detailed:

CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
OPTIONS ('aws.credentials.basic.accesskeyid' 'asdasd',
         'aws.credentials.basic.secretkey', 'asdasd',
         'format.csv.scan.datetime_regex' 'asdasd',
         'format.csv.sink.timestamp_format' 'asdasd',
         'format.csv.sink.date_format' 'asdasd')

This approach would separate AWS credentials from CSV format options and further delineate options for scanning and sinking, enhancing clarity and ease of use.

Impact:

User Experience: This change will significantly improve user experience by making configuration more intuitive and easy to understand.
Documentation: Accompanying documentation will be necessary to guide users through the new configuration pattern.
Compatibility: It’s important to note that this change will introduce breaking changes. Thus, a clear migration path needs to be provided for existing users.

Describe alternatives you've considered

No response

Additional context

No response

metesynnada · 2024-01-25T11:01:35Z

PTAL @ozankabak @alamb @andygrove @Dandandan

alamb · 2024-01-25T16:25:20Z

I think this proposal makes a lot of sense to me

Seeing something like

CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
OPTIONS ('aws.credentials.basic.accesskeyid' 'asdasd',
         'aws.credentials.basic.secretkey', 'asdasd',
         'format.csv.scan.datetime_regex' 'asdasd',
         'format.csv.sink.timestamp_format' 'asdasd',
         'format.csv.sink.date_format' 'asdasd')

Seems pretty self explanatory and consistent with the other options as you have pointed out.

To ease the transition, we could also make some sort of aliases that allow the old keys to work too 🤔

alamb · 2024-01-25T16:25:31Z

cc @devinjdangelo might also have some ideas

devinjdangelo · 2024-01-26T01:42:03Z

Yes, this is a great idea! It will also allow for more uniformity in our session level and statement level option naming conventions. E.g. currently we have

set datafusion.execution.parquet.max_row_group_size=1234

vs

COPY table to 'file.parquet' (MAX_ROW_GROUP_SIZE 1234)

could become instead:

COPY table to 'file.parquet' (datafusion.execution.parquet.max_row_group_size 1234)

A nice side effect would be that FileTypeWriterOptions could safely ignore options outside of relevant name spaces rather than throwing an error. E.g. anything that doesn't match format.* could be ignored.

metesynnada · 2024-01-26T08:37:48Z

Why not format.parquet.max_row_group_size since you may want to create two tables with different settings? I am not sure why it is a global configuration.

alamb · 2024-01-26T10:49:27Z

I am not sure why it is a global configuration.

I think the global setting serves the let users change the defaults once and use those settings for all statements in the session, which can then be overridden per statement if desired.

metesynnada added the enhancement New feature or request label Jan 25, 2024

metesynnada changed the title ~~Proposal for Systematic Configuration in 'Create External Table' Options~~ Discussion for a Systematic Configuration in 'Create External Table' Options Jan 25, 2024

metesynnada mentioned this issue Feb 28, 2024

Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382

Merged

alamb closed this as completed in #9382 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion for a Systematic Configuration in 'Create External Table' Options #8994

Discussion for a Systematic Configuration in 'Create External Table' Options #8994

metesynnada commented Jan 25, 2024

metesynnada commented Jan 25, 2024

alamb commented Jan 25, 2024

alamb commented Jan 25, 2024

devinjdangelo commented Jan 26, 2024 •

edited

Loading

metesynnada commented Jan 26, 2024

alamb commented Jan 26, 2024

Discussion for a Systematic Configuration in 'Create External Table' Options #8994

Discussion for a Systematic Configuration in 'Create External Table' Options #8994

Comments

metesynnada commented Jan 25, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

metesynnada commented Jan 25, 2024

alamb commented Jan 25, 2024

alamb commented Jan 25, 2024

devinjdangelo commented Jan 26, 2024 • edited Loading

metesynnada commented Jan 26, 2024

alamb commented Jan 26, 2024

devinjdangelo commented Jan 26, 2024 •

edited

Loading