Closed
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We sometimes get questions about support for skipping Parquet row groups based on statistics. It seems that we do not have good documentation around this really cool feature, so we should write something up. We can base it on this response copied from the slack channel.
DataFusion has support for skipping entire row groups using predicates and min and max statistics.
It does not (yet) push the predicates down into the actual scan (e.g. to avoid materializing data that wouldn’t pass the predicate) — instead any row groups that are not pruned are decompressed into RecordBatches and then filtered.
Also, DataFusion will do “projection pushdown” — aka it will read only those columns needed to answer the query.
Describe the solution you'd like
Promote this cool feature in the documentation somewhere (user guide? README?)
Describe alternatives you've considered
None
Additional context
None