Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement extensible configuration mechanism #2754

Merged
merged 14 commits into from
Jun 22, 2022

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jun 20, 2022

Which issue does this PR close?

Closes #138

Rationale for this change

We are getting to the point where there are multiple settings we could add to operators to fine-tune performance. Custom operators provided by crates that extend DataFusion may also need this capability.

I propose that we add support for key-value configuration options so that we don't need to plumb through each new configuration setting that we add.

This also makes it easier for users to change configuration options without code changes by accepting command-line arguments or reading configuration files.

Another advantage of this approach is that we can now automatically generate markdown documentation to include in the user guide.

key type default description
datafusion.optimizer.filterNullJoinKeys BOOLEAN false When set to true, the optimizer will insert filters before a join between a nullable and non-nullable column to filter out nulls on the nullable side. This filter can add additional overhead when the file format does not fully support predicate push down.

I filed #2756 as a follow-on to migrate the existing configuration settings to this new approach.

What changes are included in this PR?

  • New structs for defining named configs
  • Add a config to enable/disable an optimizer rule as the first example

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Jun 20, 2022
@andygrove andygrove changed the title WIP: Implement extensible configuration mechanism Implement extensible configuration mechanism Jun 21, 2022
@andygrove andygrove marked this pull request as ready for review June 21, 2022 22:27
@andygrove andygrove requested a review from alamb June 21, 2022 22:46
@codecov-commenter
Copy link

codecov-commenter commented Jun 21, 2022

Codecov Report

Merging #2754 (928abee) into master (4153d88) will decrease coverage by 0.01%.
The diff coverage is 80.59%.

@@            Coverage Diff             @@
##           master    #2754      +/-   ##
==========================================
- Coverage   84.95%   84.93%   -0.02%     
==========================================
  Files         271      272       +1     
  Lines       48053    48119      +66     
==========================================
+ Hits        40825    40872      +47     
- Misses       7228     7247      +19     
Impacted Files Coverage Δ
datafusion/core/src/execution/context.rs 78.37% <68.42%> (-0.28%) ⬇️
datafusion/core/src/config.rs 85.41% <85.41%> (ø)
datafusion/optimizer/src/filter_null_join_keys.rs 91.83% <0.00%> (-2.05%) ⬇️
datafusion/expr/src/window_frame.rs 92.43% <0.00%> (-0.85%) ⬇️
datafusion/optimizer/src/utils.rs 33.68% <0.00%> (-0.53%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 73.51% <0.00%> (-0.40%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4153d88...928abee. Read the comment docs.

@andygrove andygrove requested a review from yjshen June 22, 2022 13:14
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good start 👍

//! DataFusion Configuration Options

use datafusion_common::ScalarValue;
use sqlparser::ast::DataType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using arrow::data_types::DataType for the type would be easier to understand (as that is the DataType used by ScalarValue).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, this was an auto import and I hadn't noticed

Comment on lines 42 to 43
name: &str,
description: &str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: &str,
description: &str,
name: impl Into<String>,
description: impl Into<String>,

And then you can change name.to_string() to name.into() and description.to_string() into description.into()

It is a minor thing, but that way the caller can pass in a String if they have it and avoid the extra copy 🤷

use sqlparser::ast::DataType;
use std::collections::HashMap;

/// Configuration option "datafusion.optimizer.filterNullJoinKeys"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using . for namespacing the configuration makes sense to me

@@ -0,0 +1,187 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this module is in datafusion-core it will not be able to be used in other modules directly. That seems ok for now 🤔 but I haven't fully thought about it

Copy link
Member

@yjshen yjshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Is the next step to have SessionConfig refactored with ConfigDefinition?

@andygrove
Copy link
Member Author

LGTM. Is the next step to have SessionConfig refactored with ConfigDefinition?

Yes. I filed #2756 for converting the existing configs to this new mechanism and after that I plan to do some refactoring so that SessionConfig::to_props just returns the config options map. I also need to look more deeply at how this is used in TaskProperties.

@andygrove andygrove merged commit cbb0517 into apache:master Jun 22, 2022
@andygrove andygrove deleted the extensible-configs branch June 22, 2022 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement extensible configuration mechanism
4 participants