Dataframe with_column and with_column_renamed performance improvements #14653

Omega359 · 2025-02-13T16:11:01Z

Which issue does this PR close?

Part of Perf: Dataframe with_column and with_column_renamed are slow #14563, related to Early exit on column normalisation to improve DataFrame performance #14636

with_column_10          time:   [3.6602 ms 3.8457 ms 4.2130 ms]
with_column_100         time:   [34.874 ms 35.979 ms 38.018 ms]
with_column_200         time:   [183.49 ms 187.29 ms 191.65 ms]

If there is any dataframe experts here I would love a review of my assumptions. As noted in #14563 (comment)

My thinking for the changes in my current branch was that any 'new' Expr (parameter to with_column, with_column_renamed, etc) would go through the normalization, everything else I would like to think would already have been normalized or else how would it be in the DataFrame? What worries me is that I don't know if that assumption is correct or not.

Rationale for this change

Improve performance for with_column and with_column_renamed dataframe functions.

What changes are included in this PR?

Code

Are these changes tested?

Existing tests

Are there any user-facing changes?

No.

Omega359 · 2025-02-13T17:35:15Z

After spending more time reviewing the dataframe and logical plan code I have a feeling that my assumption is in fact not correct and that a dataframe can indeed have a plan that is not normalized/columnized prior to with_column being called. Joins, window, aggregate, are possible examples.

timsaucer · 2025-02-14T13:53:27Z

I suspect you're right about that assumption not being correct. I've dug through a bit, but I'd probably need to write up a unit test to verify.

Omega359 · 2025-02-16T22:12:40Z

I've made some changes locally where I test to see if the existing plan is a projection but I realized that I can't just rely on that either as the plan could possibly have been manually made then a DataFrame wrapped around it and the with_column function called.

For my approach to work I would need a way to strongly guarantee that the last projection that was made was done via the project(..) function in the builder where the normalization/columnization is guaranteed to have happened. I'm not sure right now how to do that

# Conflicts: # datafusion/core/src/dataframe/mod.rs

…on or not.

Omega359 · 2025-02-17T19:24:20Z

with_column_10          time:   [6.1112 ms 6.2616 ms 6.4226 ms]
                        change: [+18.276% +23.739% +29.703%] (p = 0.00 < 0.05)
with_column_100         time:   [41.379 ms 54.353 ms 66.683 ms]
                        change: [+18.326% +33.842% +55.521%] (p = 0.00 < 0.05)
with_column_200         time:   [198.59 ms 209.45 ms 224.33 ms]
                        change: [-1.0194% +5.1906% +12.830%] (p = 0.20 > 0.05)
with_column_500         time:   [3.5914 s 3.7758 s 4.0800 s]
                        change: [-13.580% -6.5454% +2.3214%] (p = 0.16 > 0.05)
                        No change in performance detected.

Omega359 · 2025-02-22T19:52:04Z

This should be ready for review.

alamb

Thanks @Omega359 -- this makes sense to me

I think it would be nice to avoid a new pub function and add a few more comments, but I also don't think that is required

alamb · 2025-02-25T16:45:24Z

datafusion/core/benches/dataframe.rs

@@ -68,8 +67,7 @@ fn run(column_count: u32, ctx: Arc<SessionContext>) {
 }

 fn criterion_benchmark(c: &mut Criterion) {
-    // 500 takes far too long right now
-    for column_count in [10, 100, 200 /* 500 */] {
+    for column_count in [10, 100, 200, 500] {


alamb · 2025-02-25T16:45:53Z

datafusion/core/src/dataframe/mod.rs

@@ -183,6 +183,8 @@ pub struct DataFrame {
    // Box the (large) SessionState to reduce the size of DataFrame on the stack
    session_state: Box<SessionState>,
    plan: LogicalPlan,
+    // whether we can skip validation for projection ops


Could you add some additional comments here about what circumstances permit validation to be skipped?

Updated, please review text when you have a chance.

datafusion/expr/src/logical_plan/builder.rs

…eld.

alamb · 2025-02-27T11:25:15Z

Thanks again @Omega359

Omega359 added 2 commits February 11, 2025 22:30

POC for with_column improvements.

4679c0a

Merge branch 'refs/heads/main' into with_column_updates

ff8c91d

github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Feb 13, 2025

timsaucer self-requested a review February 13, 2025 21:38

Omega359 closed this Feb 16, 2025

Merge branch 'main' into with_column_updates

90876d4

Omega359 reopened this Feb 16, 2025

Omega359 added 3 commits February 17, 2025 13:59

Merge branch 'refs/heads/main' into with_column_updates

aae30e4

# Conflicts: # datafusion/core/src/dataframe/mod.rs

Updates. Assumptions are still not valid here.

953e722

Added flag to indicate whether it is safe to project without validati…

85cc464

…on or not.

Omega359 marked this pull request as ready for review February 17, 2025 18:54

alamb approved these changes Feb 25, 2025

View reviewed changes

Omega359 added 3 commits February 25, 2025 19:28

Merge branch 'main' into with_column_updates

0eb6862

Updated documentation for DataFrame.projection_requires_validation fi…

4b6ab87

…eld.

project_with_validation is no longer public.

0463e82

alamb merged commit 9fb8eae into apache:main Feb 27, 2025
24 checks passed

This was referenced Mar 4, 2025

March 4, 2025: This week(s) in DataFusion #15005

Closed

Reuse last projection layer when renaming columns #14684

Open

Omega359 mentioned this pull request Mar 31, 2025

Perf: Dataframe with_column and with_column_renamed are slow #14563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataframe with_column and with_column_renamed performance improvements #14653

Dataframe with_column and with_column_renamed performance improvements #14653

Uh oh!

Omega359 commented Feb 13, 2025 •

edited

Loading

Uh oh!

Omega359 commented Feb 13, 2025

Uh oh!

timsaucer commented Feb 14, 2025

Uh oh!

Omega359 commented Feb 16, 2025

Uh oh!

Omega359 commented Feb 17, 2025

Uh oh!

Omega359 commented Feb 22, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Feb 25, 2025

Uh oh!

alamb Feb 25, 2025

Uh oh!

Omega359 Feb 25, 2025

Uh oh!

Uh oh!

Uh oh!

alamb commented Feb 27, 2025

Uh oh!

Uh oh!

Dataframe with_column and with_column_renamed performance improvements #14653

Dataframe with_column and with_column_renamed performance improvements #14653

Uh oh!

Conversation

Omega359 commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Omega359 commented Feb 13, 2025

Uh oh!

timsaucer commented Feb 14, 2025

Uh oh!

Omega359 commented Feb 16, 2025

Uh oh!

Omega359 commented Feb 17, 2025

Uh oh!

Omega359 commented Feb 22, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

Omega359 Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Feb 27, 2025

Uh oh!

Uh oh!

Omega359 commented Feb 13, 2025 •

edited

Loading