Skip to content

Reuse last projection layer when renaming columns #14684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

blaginin
Copy link
Contributor

Which issue does this PR close?

Related to #14563

Rationale for this change

Currently, when with_column_renamed is called, a new Projection layer is always created. For example, in the dataframe benchmark, logical plan looks like this:

Zen Browser 2025-02-15 18 38 03

These layers do not affect the query itself (as they are removed by optimize_projections), but they make renaming new columns (and optimization itself) quite slow.

What changes are included in this PR?

I added an optimization for one edge case - when there's already a projection on top and we're adding a new one. This is a common scenario, especially when renaming many columns (as in the dataframe benchmark). Instead of adding a new projection layer on top, we replace the existing one if possible.

Are these changes tested?

Extended test case

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Feb 15, 2025
@blaginin
Copy link
Contributor Author

group                                                main                                    one-layer
with_column_10                                       1.39   764.9±10.37µs        ? ?/sec     1.00    550.0±4.71µs        ? ?/sec
with_column_100                                      3.88  1490.1±22.27ms        ? ?/sec     1.00    384.5±3.24ms        ? ?/sec
with_column_200                                      5.63      36.6±0.43s        ? ?/sec     1.00       6.5±0.63s        ? ?/sec

@Omega359
Copy link
Contributor

Quite a speed improvement! I'll look at the code in a bit

.with_column_renamed("c4", "boom")?;

assert_eq!("\
Projection: aggregate_test_100.c1 AS one, aggregate_test_100.c2 AS two, aggregate_test_100.c3, aggregate_test_100.c2 + aggregate_test_100.c3 AS sum AS total\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AS sum AS total is valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough reading 🙏 AS sum AS total happens because there's an alias over an alias - I believe it's harmless since it just wraps the original expression.

There's a function unalias_nested that can resolve this, but I don't want to call it because it's recursive and expensive. A potentially better option IMO is to reuse the existing alias in alias_qualified and alias if it's already there (similar to this PR). This feels like a separate improvement from this PR though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just worry that another column referenced 'sum' and then it is aliased to 'total' and what the behaviour of that would be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think that's possible? "total" cannot be referenced within the same projection, because for that projection it is not in its input. But to display it in a nicer way, I'll do #14781

SchemaError::FieldNotFound { .. },
_,
)) => {
self.plan = LogicalPlan::Projection(
Copy link
Contributor

@Omega359 Omega359 Feb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field is not found why are we not just returning Ok(self)? I am unsure why we would need to add a projection for the plan in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do that because self was destructed and so I need to put plan back. try_new_with_schema is quite cheap and this shouldn't be happening often anyway

})
.collect();
LogicalPlan::Projection(Projection::try_new(expr, input)?)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of duplication here. I wonder if there is a way we could reduce that? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very fair! I also dislike repeating qualifier_rename, field_rename twice, but it's happening because they reference plan and must come after its destruction (hence needed twice)

@blaginin blaginin marked this pull request as ready for review February 17, 2025 19:08
@blaginin blaginin requested a review from Omega359 February 19, 2025 14:59
@alamb alamb mentioned this pull request Mar 4, 2025
12 tasks
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status of this PR?

I took a brief look at this PR and it seems like it may have been superceded by this one from @Omega359

@blaginin
Copy link
Contributor Author

blaginin commented Mar 5, 2025

Hey, I still want to merge this one. #14653 has indeed made the benches much faster, but I think it's still good to make the logical plan smaller if we can

@alamb
Copy link
Contributor

alamb commented Mar 5, 2025

Looks to me like there are a few outstanding items:

  1. Reduce the code repetition if possible
  2. Merge up to resolve conflicts

@blaginin
Copy link
Contributor Author

Want to merge #14781 and #15159 first so making this as a draft

@blaginin blaginin marked this pull request as draft March 17, 2025 17:56
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label May 18, 2025
@Omega359
Copy link
Contributor

@blaginin I see that #14781 is still draft, this PR stillok otherwise?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) proto Related to proto crate labels May 23, 2025
@blaginin blaginin force-pushed the reuse-projection branch from d56e5f9 to 2fd800c Compare May 23, 2025 16:53
@github-actions github-actions bot removed sqllogictest SQL Logic Tests (.slt) proto Related to proto crate labels May 23, 2025
@blaginin blaginin self-assigned this May 23, 2025
@blaginin blaginin removed the Stale PR has not had any activity for some time label May 23, 2025
@blaginin blaginin marked this pull request as ready for review May 24, 2025 12:08
@blaginin
Copy link
Contributor Author

@blaginin I see that #14781 is still draft, this PR stillok otherwise?

yes, thanks for the reminder - there was some work happening around alias changes with unnest, so I left it stale. But I think this PR can go out now to make the logical plans nicer.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @blaginin

let project_plan = LogicalPlanBuilder::from(self.plan)
.project_with_validation(projection)?
.build()?;
let project_plan = if let LogicalPlan::Projection(Projection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear, the projections are flattened as part of the planning process right? So this PR mostly is about making logical EXPLAIN plans look nicer with with_column_renamed?

If so it seems like a lot of complexity for (just) that feature. Maybe we can switch to showing optimized explain plans or something for people who want to see a simpler version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may have been related to trying to improve planning performance by reducing the number of projections up front so planning doesn't need to do it. Improving the logical plan output is definitely nice though :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants