[branch-53] perf: skip ensure_distribution rebuild when children are unchanged#58
Merged
Merged
Conversation
…unchanged Cherry-pick of upstream apache#22521 (approved by 2010YOUY01 and alamb, awaiting maintainer merge). Adapted to branch-53's file layout (datafusion/physical-optimizer/src/enforce_distribution.rs vs the upstream ensure_requirements/ subdirectory) and to the older plan.as_any().is::<UnionExec>() API. ensure_distribution was unconditionally calling plan.with_new_children after collecting the (possibly redistributed) children, even when none of the children were actually replaced. For nodes like ProjectionExec that path runs through try_new and recomputes schema, equivalence properties, output ordering, and partitioning, which is pure overhead when the input Arcs are pointer-identical. Routes the non-UnionExec branch through the existing with_new_children_if_necessary helper, which does the Arc::ptr_eq short-circuit and reuses the input plan when no children changed. UnionExec to InterleaveExec special case still runs first because it intentionally produces a new node even when child Arcs are unchanged. Linear: X-2631
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks upstream DataFusion change apache#22521 into branch-53 to reduce overhead in the EnforceDistribution optimizer rule by avoiding unnecessary plan rebuilds when child plans are unchanged.
Changes:
- Update
ensure_distributionto usewith_new_children_if_necessaryfor non-UnionExecnodes, reusing the originalArc<dyn ExecutionPlan>when children are pointer-identical. - Add a regression test asserting
ensure_distributionreturns the exact same planArcwhen no redistribution is required (deepProjectionExecchain,target_partitions = 1).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| datafusion/physical-optimizer/src/enforce_distribution.rs | Uses with_new_children_if_necessary to skip expensive with_new_children rebuilds when children are unchanged. |
| datafusion/core/tests/physical_optimizer/enforce_distribution.rs | Adds regression coverage ensuring the optimizer reuses the input plan Arc when no redistribution occurs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
xudong963
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cherry-pick of upstream apache/datafusion#22521 (approved by 2010YOUY01 and alamb, awaiting maintainer merge) into our branch-53 fork. Companion to #57 (branch-52).
ensure_distributionwas callingplan.with_new_children(children_plans)unconditionally after collecting the (possibly redistributed) children, even when none of the children were actually replaced. For nodes likeProjectionExecthat path recomputes schema / equivalence properties / output ordering / partitioning viatry_new— pure overhead when child Arcs are pointer-identical.Routes the non-
UnionExecbranch through the existingwith_new_children_if_necessaryhelper, which does theArc::ptr_eqshort-circuit and reuses the input plan when no children changed.Why pick now
Linear X-2631. Profiling on NY5 prod showed
ProjectionExec::with_new_childrenas the single largest cost insideensure_distribution(1.94s of a 2.87s rule-time slice in a 60s CPU sample).Adaptations from upstream
datafusion/physical-optimizer/src/ensure_requirements/enforce_distribution.rs; branch-53 keeps the flat path. Same logic, different module path.plan.is::<UnionExec>(); branch-53 still hasplan.as_any().is::<UnionExec>(). Adjusted.Affects
Tests
cargo test --release -p datafusion --test core_integration -- physical_optimizer::enforce_distribution— 63/63 pass (62 pre-existing + 1 new regression testensure_distribution_reuses_plan_arc_when_no_redistribution_needed).Micro-benchmark from upstream PR
Plan shape: 30-deep
ProjectionExecstack over a sorted parquet scan, 5000 iterations.