Skip to content

Expose remaining parquet config options into ConfigOptions (try 2) #4427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion benchmarks/src/bin/tpch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -396,7 +396,8 @@ async fn get_table(
}
"parquet" => {
let path = format!("{}/{}", path, table);
let format = ParquetFormat::default().with_enable_pruning(true);
let format = ParquetFormat::new(ctx.config_options())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the parquet format now reads defaults from ConfigOptions they need to be passed to the constructor

.with_enable_pruning(Some(true));

(Arc::new(format), path, DEFAULT_PARQUET_EXTENSION)
}
Expand Down
4 changes: 3 additions & 1 deletion datafusion-examples/examples/flight_server.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,9 @@ impl FlightService for FlightServiceImpl {
) -> Result<Response<SchemaResult>, Status> {
let request = request.into_inner();

let listing_options = ListingOptions::new(Arc::new(ParquetFormat::default()));
let config = SessionConfig::new();
let listing_options =
ListingOptions::new(Arc::new(ParquetFormat::new(config.config_options())));
let table_path =
ListingTableUrl::parse(&request.path[0]).map_err(to_tonic_err)?;

Expand Down
3 changes: 2 additions & 1 deletion datafusion-examples/examples/parquet_sql_multiple_files.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ async fn main() -> Result<()> {
let testdata = datafusion::test_util::parquet_test_data();

// Configure listing options
let file_format = ParquetFormat::default().with_enable_pruning(true);
let file_format =
ParquetFormat::new(ctx.config_options()).with_enable_pruning(Some(true));
let listing_options = ListingOptions::new(Arc::new(file_format))
.with_file_extension(FileType::PARQUET.get_ext());

Expand Down
39 changes: 39 additions & 0 deletions datafusion/core/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,16 @@ pub const OPT_PARQUET_REORDER_FILTERS: &str =
pub const OPT_PARQUET_ENABLE_PAGE_INDEX: &str =
"datafusion.execution.parquet.enable_page_index";

/// Configuration option "datafusion.execution.parquet.pruning"
pub const OPT_PARQUET_ENABLE_PRUNING: &str = "datafusion.execution.parquet.pruning";

/// Configuration option "datafusion.execution.parquet.skip_metadata"
pub const OPT_PARQUET_SKIP_METADATA: &str = "datafusion.execution.parquet.skip_metadata";

/// Configuration option "datafusion.execution.parquet.metadata_size_hint"
pub const OPT_PARQUET_METADATA_SIZE_HINT: &str =
"datafusion.execution.parquet.metadata_size_hint";

/// Configuration option "datafusion.optimizer.skip_failed_rules"
pub const OPT_OPTIMIZER_SKIP_FAILED_RULES: &str =
"datafusion.optimizer.skip_failed_rules";
Expand Down Expand Up @@ -255,6 +265,29 @@ impl BuiltInConfigs {
to reduce the number of rows decoded.",
false,
),
ConfigDefinition::new_bool(
OPT_PARQUET_ENABLE_PRUNING,
"If true, the parquet reader attempts to skip entire row groups based \
on the predicate in the query and the metadata (min/max values) stored in \
the parquet file.",
true,
),
ConfigDefinition::new_bool(
OPT_PARQUET_SKIP_METADATA,
"If true, the parquet reader skip the optional embedded metadata that may be in \
the file Schema. This setting can help avoid schema conflicts when querying \
multiple parquet files with schemas containing compatible types but different metadata.",
true,
),
ConfigDefinition::new(
OPT_PARQUET_METADATA_SIZE_HINT,
"If specified, the parquet reader will try and fetch the last `size_hint` \
bytes of the parquet file optimistically. If not specified, two read are required: \
One read to fetch the 8-byte parquet footer and \
another to fetch the metadata length encoded in the footer.",
DataType::UInt64,
ScalarValue::UInt64(None),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by @thinkharderdev on #3885 (comment) we should probably change this default to something reasonable (like 64K) but I would rather do that in a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #4459 to track

),
ConfigDefinition::new_bool(
OPT_OPTIMIZER_SKIP_FAILED_RULES,
"When set to true, the logical plan optimizer will produce warning \
Expand Down Expand Up @@ -424,6 +457,12 @@ impl ConfigOptions {
get_conf_value!(self, UInt64, key, "u64")
}

/// get a u64 configuration option as a usize
pub fn get_usize(&self, key: &str) -> Option<usize> {
let v = get_conf_value!(self, UInt64, key, "usize");
v.and_then(|v| v.try_into().ok())
}

/// get a string configuration option
pub fn get_string(&self, key: &str) -> Option<String> {
get_conf_value!(self, Utf8, key, "string")
Expand Down
Loading