-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[MAJOR] Equivalence System Overhaul #16217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
remove usage of prefer_existing_sort as default set requirements Hard set soft on AggregateExec and BoundedWindowExec since they have InputOrderMode functionalities
simplify indentation
simplify indentation
remove prefer_existing_sort based test cases
# Conflicts: # datafusion/core/src/datasource/listing/table.rs # datafusion/physical-optimizer/src/enforce_sorting/mod.rs
# Conflicts: # datafusion-examples/examples/custom_file_format.rs # datafusion/core/src/datasource/file_format/csv.rs # datafusion/core/src/datasource/file_format/json.rs # datafusion/core/src/datasource/file_format/mod.rs # datafusion/core/src/datasource/file_format/parquet.rs # datafusion/core/tests/physical_optimizer/enforce_distribution.rs # datafusion/core/tests/physical_optimizer/test_utils.rs # datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs
return alternative on BoundedWindowAggExec
# Conflicts: # datafusion/catalog/src/stream.rs # datafusion/core/tests/physical_optimizer/enforce_distribution.rs # datafusion/physical-plan/src/joins/sort_merge_join.rs # datafusion/physical-plan/src/joins/symmetric_hash_join.rs
simplify sort_pushdown.rs
add requirements compatible test cases
😮 I will put this on my list to review asap |
@wiedld @xudong963 and @suremarc perhaps you are also interested in this PR given your past interest in the equivalence code. Also cc @chenkovsky |
Thanks for the quick response @alamb -- I added some reviewing tips and tricks to the PR body to help you guys out. |
One thing on my mind is that maybe it's better to hold up the PR until DF48 release, then we can have enough time to check and verify the PR after it merges. |
Let's definitely time the merge and DF48 release in a cooperative way. We have two competing realities at play: (1) The PR attracts a lot of conflicts and it is not easy to resolve them without accidents and/or introducing possible regressions, (2) We don't want to dump a large API change on to DF48 without giving prep time to people. Checking out #15771, it seems we can either:
I am fine with both approaches as long as we arrive at a merge point quickly, before or after the release. |
I think this is my preferred choice -- let's get DF 48 out and then give this one the maximum bake time. I think we can be careful of changes we merge to try and minimize the conflicts that this will collect |
Sounds good -- let's go with the first choice (moving fast with the release and merging this right after) unless someone else brings a good argument to the contrary. |
Which issue does this PR close?
UNION
andORDER BY
queries #13748.Rationale for this change
DF's theoretical approach to representing orderings, equivalences and constants is sound, but it suffers from many implementation problems that lead to various issues. Some of these issues include:
None
value for a typeOption<LexOrdering>
, and sometimes with aSome
value with the payload being "empty").EnforceSorting
, which results in missed optimization opportunities. For example, a windowing operator prefers its input to be sorted w.r.t. both PARTITION BY and ORDER BY keys, but if the former is not available, it is still preferable to have just the latter instead of a fully unsorted input.What changes are included in this PR?
ordering_satisfy
checks. Also, the new system caches normalization results for equivalent orderings, which was one of the major sources of excessive computational complexity during sort enforcement.LexOrdering
andLexRequirement
objects do not allow degenerate orderings anymore, whenever you see this object, you can be sure there will be some ordering. Possible absence of an ordering will henceforth be properly represented by anOption<LexOrdering>
.OrderingRequirements
object is introduced to capture ordering preferences of operators.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes -- this is a major API change and will break old code. We spent months looking for ways to (1) break this PR down into smaller ones, (2) yet lump all API changes in one PR -- but couldn't find an effective way to do this.
Therefore, I invite all contributors who are also downstream stakeholders to help by participating in a "pre-scream" test. Please test and report how this PR breaks your code, so we can collectively prepare an upgrade guide for the community at large.
Tips and Tricks for Reviewing
The "meat" of the code is in the following files:
datafusion/physical-expr-common/src/sort_expr.rs
: Definition ofOrderingRequirements
, changes toLexOrdering
and friends.datafusion/physical-plan/src/execution_plan.rs
: API change torequired_input_ordering
.datafusion/physical-expr/src/equivalence
: Re-implementation of the equivalence system.datafusion/functions-aggregate/src
: Effects of the above on aggregate function implementations.The rest of the changes are "mandatory" propagative effects of fixing a fundamental low-level mechanism in a large codebase, and mostly trivial.
@alamb, @andygrove, @berkaysynnada -- feel free to tag others.