Description
Is your feature request related to a problem or challenge?
I want people's first impression of DataFusion to be "that is very fast" without having to tune parameters
DataFusion has many configuration options that control various performance optimization.
There is a tradeoff between some of these options between faster query execution (for more than linear resource consumption) and pure efficiency.
We have benchmarks such as a tpch runner that typically run with a single core. These are great as performance unit tests in well controlled environments (and avoid task overhead, and other non determinism introduced with multi-core execution), however they don't mimic what users typically run with.
Up to now we have taken a conservative approach and only enabled optimizations by default if they make everything faster. I would like to change our philosophy and optimize for "out of the box" performance
You can see examples of other systems tuning knobs up for performance:
Here is a recent example from https://hussainsultan.com/posts/unbundled-datafusion/
runtime = RuntimeConfig().with_disk_manager_os().with_fair_spill_pool(100000000)
config = (
SessionConfig()
.with_create_default_catalog_and_schema(True)
.with_target_partitions(8)
.with_information_schema(True)
.with_repartition_joins(True)
.with_repartition_aggregations(True)
.with_repartition_windows(True)
.with_parquet_pruning(True)
.set("datafusion.execution.parquet.pushdown_filters", "true")
)
ctx = SessionContext(config, runtime)
ctx.register_parquet("orders", "../../../fanniemae-benchmark/sf10/raw/orders.parquet")
Describe the solution you'd like
I would like to change the ConfigOption
defaults to optimize performance in the common case rather than avoid
performance regressions in all cases
Specifically that means:
- Repartition always when possible (to increase parallelism by default)
- Push down all parquet filters (e.g. Enable parquet page level skipping (page index pruning) by default #4085 and Enable parquet filter pushdown by default #3463)
Describe alternatives you've considered
We can leave the defaults alone and make users change the defaults
Additional context
No response