Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate remaining parquet config options into ConfigOptions #3885

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion benchmarks/src/bin/tpch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -431,7 +431,8 @@ async fn get_table(
}
"parquet" => {
let path = format!("{}/{}", path, table);
let format = ParquetFormat::default().with_enable_pruning(true);
let format = ParquetFormat::new(ctx.config.config_options())
.with_enable_pruning(true);

(Arc::new(format), path, DEFAULT_PARQUET_EXTENSION)
}
Expand Down
5 changes: 3 additions & 2 deletions datafusion-examples/examples/flight_server.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,11 +67,12 @@ impl FlightService for FlightServiceImpl {
) -> Result<Response<SchemaResult>, Status> {
let request = request.into_inner();

let listing_options = ListingOptions::new(Arc::new(ParquetFormat::default()));
let ctx = SessionContext::new();
let format = Arc::new(ParquetFormat::new(ctx.config_options()));
let listing_options = ListingOptions::new(format);
let table_path =
ListingTableUrl::parse(&request.path[0]).map_err(to_tonic_err)?;

let ctx = SessionContext::new();
let schema = listing_options
.infer_schema(&ctx.state(), &table_path)
.await
Expand Down
2 changes: 1 addition & 1 deletion datafusion-examples/examples/parquet_sql_multiple_files.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ async fn main() -> Result<()> {
let testdata = datafusion::test_util::parquet_test_data();

// Configure listing options
let file_format = ParquetFormat::default().with_enable_pruning(true);
let file_format = ParquetFormat::new(ctx.config_options());
let listing_options = ListingOptions {
file_extension: FileType::PARQUET.get_ext(),
format: Arc::new(file_format),
Expand Down
33 changes: 33 additions & 0 deletions datafusion/core/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,16 @@ pub const OPT_PARQUET_REORDER_FILTERS: &str =
pub const OPT_PARQUET_ENABLE_PAGE_INDEX: &str =
"datafusion.execution.parquet.enable_page_index";

/// Configuration option "datafusion.execution.parquet.pruning"
pub const OPT_PARQUET_ENABLE_PRUNING: &str = "datafusion.execution.parquet.pruning";

/// Configuration option "datafusion.execution.parquet.skip_metadata"
pub const OPT_PARQUET_SKIP_METADATA: &str = "datafusion.execution.parquet.skip_metadata";

/// Configuration option "datafusion.execution.parquet.metadata_size_hint"
pub const OPT_PARQUET_METADATA_SIZE_HINT: &str =
"datafusion.execution.parquet.metadata_size_hint";

/// Configuration option "datafusion.optimizer.skip_failed_rules"
pub const OPT_OPTIMIZER_SKIP_FAILED_RULES: &str =
"datafusion.optimizer.skip_failed_rules";
Expand Down Expand Up @@ -237,6 +247,29 @@ impl BuiltInConfigs {
to reduce the number of rows decoded.",
false,
),
ConfigDefinition::new_bool(
OPT_PARQUET_ENABLE_PRUNING,
"If true, the parquet reader attempts to skip entire row groups based \
on the predicate in the query and the metadata (min/max values) stored in \
the parquet file.",
true,
),
ConfigDefinition::new_bool(
OPT_PARQUET_SKIP_METADATA,
"If true, the parquet reader skip the optional embedded metadata that may be in \
the file Schema. This setting can help avoid schema conflicts when querying \
multiple parquet files with schemas containing compatible types but different metadata.",
true,
),
ConfigDefinition::new(
OPT_PARQUET_METADATA_SIZE_HINT,
"If specified, the parquet reader will try and fetch the last `size_hint` \
bytes of the parquet file optimistically. If not specified, two read are required: \
One read to fetch the 8-byte parquet footer and \
another to fetch the metadata length encoded in the footer.",
DataType::UInt64,
ScalarValue::UInt64(None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to default this something? Back in the day we would default to reading the last 64k to try and capture the entire footer in a single read which seems like a sensible default (especially for object storage where the cost of an additional read is very expensive relative to reading a bit more data in the first read)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64K seems reasonable to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon more thought I would like to propose we do the "consolidate config code" and "change default value" in separate PRs (which I think will be easier to see / review once the configuration is consolidated)

),
ConfigDefinition::new_bool(
OPT_OPTIMIZER_SKIP_FAILED_RULES,
"When set to true, the logical plan optimizer will produce warning \
Expand Down
Loading