Skip to content

Dataframe with_column and with_column_renamed performance improvements #14653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 27, 2025

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Feb 13, 2025

Which issue does this PR close?

with_column_10          time:   [3.6602 ms 3.8457 ms 4.2130 ms]
with_column_100         time:   [34.874 ms 35.979 ms 38.018 ms]
with_column_200         time:   [183.49 ms 187.29 ms 191.65 ms]

If there is any dataframe experts here I would love a review of my assumptions. As noted in #14563 (comment)

My thinking for the changes in my current branch was that any 'new' Expr (parameter to with_column, with_column_renamed, etc) would go through the normalization, everything else I would like to think would already have been normalized or else how would it be in the DataFrame? What worries me is that I don't know if that assumption is correct or not.

Rationale for this change

Improve performance for with_column and with_column_renamed dataframe functions.

What changes are included in this PR?

Code

Are these changes tested?

Existing tests

Are there any user-facing changes?

No.

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Feb 13, 2025
@Omega359
Copy link
Contributor Author

After spending more time reviewing the dataframe and logical plan code I have a feeling that my assumption is in fact not correct and that a dataframe can indeed have a plan that is not normalized/columnized prior to with_column being called. Joins, window, aggregate, are possible examples.

@timsaucer timsaucer self-requested a review February 13, 2025 21:38
@timsaucer
Copy link
Contributor

I suspect you're right about that assumption not being correct. I've dug through a bit, but I'd probably need to write up a unit test to verify.

@Omega359 Omega359 closed this Feb 16, 2025
@Omega359 Omega359 reopened this Feb 16, 2025
@Omega359
Copy link
Contributor Author

I've made some changes locally where I test to see if the existing plan is a projection but I realized that I can't just rely on that either as the plan could possibly have been manually made then a DataFrame wrapped around it and the with_column function called.

For my approach to work I would need a way to strongly guarantee that the last projection that was made was done via the project(..) function in the builder where the normalization/columnization is guaranteed to have happened. I'm not sure right now how to do that

@Omega359 Omega359 marked this pull request as ready for review February 17, 2025 18:54
@Omega359
Copy link
Contributor Author

with_column_10          time:   [6.1112 ms 6.2616 ms 6.4226 ms]
                        change: [+18.276% +23.739% +29.703%] (p = 0.00 < 0.05)
with_column_100         time:   [41.379 ms 54.353 ms 66.683 ms]
                        change: [+18.326% +33.842% +55.521%] (p = 0.00 < 0.05)
with_column_200         time:   [198.59 ms 209.45 ms 224.33 ms]
                        change: [-1.0194% +5.1906% +12.830%] (p = 0.20 > 0.05)
with_column_500         time:   [3.5914 s 3.7758 s 4.0800 s]
                        change: [-13.580% -6.5454% +2.3214%] (p = 0.16 > 0.05)
                        No change in performance detected.

@Omega359
Copy link
Contributor Author

This should be ready for review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359 -- this makes sense to me

I think it would be nice to avoid a new pub function and add a few more comments, but I also don't think that is required

@@ -68,8 +67,7 @@ fn run(column_count: u32, ctx: Arc<SessionContext>) {
}

fn criterion_benchmark(c: &mut Criterion) {
// 500 takes far too long right now
for column_count in [10, 100, 200 /* 500 */] {
for column_count in [10, 100, 200, 500] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -183,6 +183,8 @@ pub struct DataFrame {
// Box the (large) SessionState to reduce the size of DataFrame on the stack
session_state: Box<SessionState>,
plan: LogicalPlan,
// whether we can skip validation for projection ops
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some additional comments here about what circumstances permit validation to be skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please review text when you have a chance.

@alamb alamb merged commit 9fb8eae into apache:main Feb 27, 2025
24 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 27, 2025

Thanks again @Omega359

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants