- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.7k
Description
Background
We already have an excellent blog post that I think does a great introduction to the importance of optimizing sort order information in queries: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/
Work on dynamic filters has made this even more interesting: https://datafusion.staged.apache.org/blog/2025/09/01/dynamic-filters/. TopK dynamic filters will perform significantly better because we generate a more selective filter earlier on and thus can prune more data. Interestingly, and not touched upon in the article, even without dynamic filters or perfectly sorted inputs TopK operators benefit from partially sorted inputs because they are able to more quickly discard batches as they come in instead of churning the heap, although this effect is probably minor. I think the same sort of thing applies to other sort operators, depending on the algorithms used (see https://www.toptal.com/developers/sorting-algorithms for examples).
Thus there are a couple high level aspects of the current sorting I think should be improved, detailed below.
Exact vs. Partial ordering
A lot of the sort machinery currently focuses on what I'll call "exact" ordering. This is when the data within each file and the files themselves can be ordered in such a way that one can scan the table completely skipping any upstream sorting. There is little done about partial ordering. In particular, I think about the case where the files themselves may not be ordered internally but we can use the min/max stats to order between files to some extent. Two examples that come to mind:
- Timestamp columns. Often if data is appended to files as they come in data will be roughly clustered by time, even if it is not sorted by time.
- Values that can be efficiently zone map indexed: consider something like salary, depending on the data distribution if we wantORDER BY salary DESC LIMIT 10the top 10 outliers might stick out like a sore thumb in the min/max stats -> we can scan files with the largestmax(salary)first -> we can skip more of the other files.
I also think that exact ordering is a subset of partial ordering where you just happen to be able to order amongst the files and you know that the order within the files matches the desired order. There's even cases where files are clustered in such a way that they are not exactly ordered but are statistically much better than randomly ordered, in this case the partial ordering within the files won't let us skip a sort but might make operations more efficient (see point above about some sorts being much faster with partially sorted data).
Order of the Data vs order of the Query
Currently the only way to get some benefit from ordering is to specify a known ordering of the data upfront:
CREATE EXTERNAL TABLE source (
    amount INT NOT NULL,
    price DOUBLE NOT NULL,
    time TIMESTAMP NOT NULL,
    ...
)
STORED AS CSV
WITH ORDER (time ASC)
WITH ORDER (amount ASC, price ASC)
LOCATION '/path/to/FILE_NAME.csv'
OPTIONS ('has_header' 'true');(taken from the blog post)
But what if this order doesn't match the desired order of the query? E.g. if we want ORDER BY price. The physical ordering of the file doesn't help us at all here, it's as good as a random scan. We're much better off using min/max stats on price to re-order the scan so that the file opening is done in ~ the order the query wants, and we don't use the ordering within the files at all.
Inferred order vs Known order
Related to the point above: users currently have to specify the order of a table. I think this could in many cases be inferred from a combination of metadata on each file (Parquet for example has metadata to specify the ordering of a file) and min/max stats (which give an ordering between files). Using this we can reconstruct exact or partial ordering without users having to specify a known order upfront.
Proposal
I propose that we move to a system where:
- The desired ordering of the query is pushed down into the TableScan/TableProvider::scanso that file opens can be optimized to match the desired order of the query, be that an exact order if possible or a partial order if not.
- Our default ListingTableshould use file metadata (min/max stats and any ordering information) to arrange files to best match the queries desired ordering. TheExecutionPlanthatTableProvider::scanshould record if the order is exact so that subsequent physical optimizers can remove unecessary upstream sort operations. The goal should be to produce an exact order if possible and if not fall back to a best effort partial order (which still requires an upstream sort).
- We should infer ordering information from file metadata instead of requiring users to specify a known ordering upfront.
Implementation
Refactor TableProvider::scan into TableProvider::scan_with_args: we need to make API changes to scan() to pass in ordering information. Adding a parameter would be a breaking change. Let's take this opportunity to refactor the method into something we can change over time while minimizing breaking API changes for downstream consumers. #17336
*Push down sorting information into TableScan / TableProvider. This requires some careful evaluation of optimizer rules, i.e. what can we push down sorts through and what can we not. I suggest we start with the obviously okay cases and add more complex ones in the future. #17337
*Connect the two above changes by passing the sort order from TableScan into TableProvider::scan_with_args in our logical -> physical planning step.
*Rewrite our re-partitioning / sorting to first try to create an exact ordering and fall back to partial ordering if not possible, marking the resultant ExecutionPlan with which one it has.
On a per-file-format basis extract ordering information from the files: I think this involves some refactoring to go from e.g. FileFormat::collect_statistics() -> Statistics to FileFormat::collect_metadata() -> StatisticsAndSortInformation as well as some other tbd work.