Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion for a Systematic Configuration in 'Create External Table' Options #8994

Closed
metesynnada opened this issue Jan 25, 2024 · 6 comments · Fixed by #9382
Closed

Discussion for a Systematic Configuration in 'Create External Table' Options #8994

metesynnada opened this issue Jan 25, 2024 · 6 comments · Fixed by #9382
Labels
enhancement New feature or request

Comments

@metesynnada
Copy link
Contributor

Is your feature request related to a problem or challenge?

Currently, in our implementation of 'Create External Table', the configuration options are not systematically organized, leading to potential confusion and complexity for the users. This is especially evident when we compare our configuration pattern with other systems like Apache Flink and Apache Spark.

For instance, in our current setup, wiring CSV format options and AWS credential settings are done in the same context. This approach lacks the clarity and structure found in similar systems. Examples of more systematic configurations can be seen in:

Although Spark’s approach could be perceived as confusing when applied to our 'Create External Table' method, our method is currently more aligned with Spark's approach in terms of table creation.

However, one aspect where our system shines is in our session context configuration. We utilize a more intuitive dot (.) divided pattern, like datafusion.execution.parquet.statistics_enabled. This is more user-friendly and logically structured.

Describe the solution you'd like

I propose we adopt a more structured and systematic approach in defining table options, similar to our session context configuration. For example, instead of the current format:

CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
OPTIONS ('AWS_ACCESS_KEY_ID' 'asdasd',
         'AWS_SECRET_ACCESS_KEY', 'asdasd',
         'timestamp_format' 'asdasd',
         'date_format' 'asdasd')

We could structure it more clearly:

CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
OPTIONS ('aws.credentials.basic.accesskeyid' 'asdasd',
         'aws.credentials.basic.secretkey', 'asdasd',
         'format.csv.sink.timestamp_format' 'asdasd',
         'format.csv.sink.date_format' 'asdasd')

Or even more detailed:

CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
OPTIONS ('aws.credentials.basic.accesskeyid' 'asdasd',
         'aws.credentials.basic.secretkey', 'asdasd',
         'format.csv.scan.datetime_regex' 'asdasd',
         'format.csv.sink.timestamp_format' 'asdasd',
         'format.csv.sink.date_format' 'asdasd')

This approach would separate AWS credentials from CSV format options and further delineate options for scanning and sinking, enhancing clarity and ease of use.

Impact:

  • User Experience: This change will significantly improve user experience by making configuration more intuitive and easy to understand.
  • Documentation: Accompanying documentation will be necessary to guide users through the new configuration pattern.
  • Compatibility: It’s important to note that this change will introduce breaking changes. Thus, a clear migration path needs to be provided for existing users.

Describe alternatives you've considered

No response

Additional context

No response

@metesynnada metesynnada added the enhancement New feature or request label Jan 25, 2024
@metesynnada metesynnada changed the title Proposal for Systematic Configuration in 'Create External Table' Options Discussion for a Systematic Configuration in 'Create External Table' Options Jan 25, 2024
@metesynnada
Copy link
Contributor Author

PTAL @ozankabak @alamb @andygrove @Dandandan

@alamb
Copy link
Contributor

alamb commented Jan 25, 2024

I think this proposal makes a lot of sense to me

Seeing something like

CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
OPTIONS ('aws.credentials.basic.accesskeyid' 'asdasd',
         'aws.credentials.basic.secretkey', 'asdasd',
         'format.csv.scan.datetime_regex' 'asdasd',
         'format.csv.sink.timestamp_format' 'asdasd',
         'format.csv.sink.date_format' 'asdasd')

Seems pretty self explanatory and consistent with the other options as you have pointed out.

To ease the transition, we could also make some sort of aliases that allow the old keys to work too 🤔

@alamb
Copy link
Contributor

alamb commented Jan 25, 2024

cc @devinjdangelo might also have some ideas

@devinjdangelo
Copy link
Contributor

devinjdangelo commented Jan 26, 2024

Yes, this is a great idea! It will also allow for more uniformity in our session level and statement level option naming conventions. E.g. currently we have

set datafusion.execution.parquet.max_row_group_size=1234

vs

COPY table to 'file.parquet' (MAX_ROW_GROUP_SIZE 1234)

could become instead:

COPY table to 'file.parquet' (datafusion.execution.parquet.max_row_group_size 1234)

A nice side effect would be that FileTypeWriterOptions could safely ignore options outside of relevant name spaces rather than throwing an error. E.g. anything that doesn't match format.* could be ignored.

@metesynnada
Copy link
Contributor Author

Why not format.parquet.max_row_group_size since you may want to create two tables with different settings? I am not sure why it is a global configuration.

@alamb
Copy link
Contributor

alamb commented Jan 26, 2024

I am not sure why it is a global configuration.

I think the global setting serves the let users change the defaults once and use those settings for all statements in the session, which can then be overridden per statement if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants