Pass PartitionedFile
into FileSource
for late file stats based pruning
#16000
Labels
enhancement
New feature or request
Is your feature request related to a problem or challenge?
As we continue to make progress landing dynamic filters it opens up the opportunity for new optimizations.
This one deals with late evaluation of file-level statistics.
In particular, we may have file level statics available at planning time (see
datafusion.execution.collect_statistics
- our system does a similar thing in a different way).Before dynamic filters there was no point in re-evaluating these right before scanning a file but now it's possible that e.g. a TopK operator passed down a
ts > '2025-05-08T00:00:00Z'
filter -> we may be able to exclude the entire file based on this filter + file level statistics -> we avoid reading any Parquet metadata, etc.In particular, change:
To:
And call it as so from here:
https://github.com/pydantic/datafusion/blob/649851d59cdac80fcae51d66f82f1b47d2aaa3b4/datafusion/datasource/src/file_stream.rs#L129
Then we can implement PruningStatitics for Statistics et. voilá!
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: