Perf: Dataframe with_column and with_column_renamed are slow #14563

Omega359 · 2025-02-09T15:46:55Z

Describe the bug

Dataframe functions .with_column and .with_column_renamed (and possibly others) are slow. One can really see this in dataframe's with many many columns where a .with_column call can take seconds

Related: #7698

     Running benches/dataframe.rs (target/release/deps/dataframe-daaaeac4dbc7597d)
Gnuplot not found, using plotters backend
with_column_10          time:   [6.2305 ms 6.2916 ms 6.4057 ms]
                        change: [+15.699% +18.873% +21.723%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking with_column_100: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 13.5s.
with_column_100         time:   [1.3307 s 1.3633 s 1.3996 s]
                        change: [+8.2536% +11.473% +14.825%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking with_column_200: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 235.2s.
with_column_200         time:   [23.350 s 25.011 s 27.300 s]
                        change: [-46.710% -42.652% -37.407%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe

To Reproduce

Just time the function calls. A PR for a benchmark will be coming soon.

Expected behavior

dataframe function calls should be fast, as fast as all other operations in DataFusion.

Additional context

No response

The text was updated successfully, but these errors were encountered:

Omega359 · 2025-02-10T19:16:20Z

If someone would be so kind as to generate a flamegraph for the benchmark it would be appreciated. I'm unable to under wsl2 without doing a serious amount of hackery to my system

Omega359 · 2025-02-11T16:22:04Z

In looking into this issue I have a question for the db experts that happen to be following this issue.

The with_column code builds a Vec<Expr> called fields in the dataframe and then projects all those fields in a new LogicalPlan that is returned in a new DataFrame. That's great ... except that projection seems to be doing far far too much work for fields that are completely unrelated to the Expr being passed into the with_column(self, name, expr) function. I am not an expert here by any means but I just can't see how preexisting expressions in the logical plan would need all the normalization/columnization that is performed since I would think that would have happened already. And if you are calling with_column hundreds of times like I do it definitely is doing that work over and over and over again.

This essentially applies to any fn that call LogicalPlanBuilder.project such as select, with_column, with_column_renamed,

My question is: could we instead have a way to tell the project that 'hey, these expressions are fine, trust me` and only do the work for the expression(s) that are new?

blaginin · 2025-02-11T20:49:44Z

A lot of TreeNodeRecursion::visit_sibling... may be related to #13748 ?

blaginin · 2025-02-11T21:00:58Z

Stacktrace also may also help

blaginin · 2025-02-11T22:12:01Z

Okay, so I think the issue is that with every .with_column_renamed / .with_column we add a new projection - that creates a lot of layers and each time adding a new one is more and more complicated because we need to revisit all existing ones:

I feel like a good start would be to reuse the existing projection if it's already on the top. It won't cover all cases but cover the majority (including the one in the benches).

It can be something like this: https://github.com/apache/datafusion/compare/main...blaginin:datafusion:wip-reuse-projection?expand=1. I'll finish the code if that makes sense to you?

Omega359 · 2025-02-11T22:34:56Z

Interesting. I tried a somewhat different approach - main...Omega359:arrow-datafusion:with_column_updates

It is much much faster, it passes all the tests I can find including my own but it feels rather hackish to me. Essentially, I'm trying to avoid doing the work that I think isn't required (see above comment) but I don't know if this is actually correct or not.

blaginin · 2025-02-12T19:58:20Z

I really like that idea, Bruce! I tried to break your branch, but everything seems to work 🙂 I think the issue was that on every rename, we tried to recursively normalize every column for the query, which is very expensive. You could also potentially just normalize only your newly added columns and not touch the rest - if you avoid normalize_col, it'll be even faster.

I think we can use this issue to make several nice optimizations that will complement each other:

The simplest one: do not call expensive operations when column is already normalized. The most obvious example is this (gives a 30% increase in your benchmarks; maybe I'll find other places as well).
Do not normalize columns (your PoC), which will boost benchmarks further. But it will still require normalization in some cases ((..., true) in your current PR).
For those cases with normalization, we can make an improvement by reusing the existing projection (my PoC from yesterday). I think it's overall a good idea to keep the plan simple - we'll spend less time simplifying and executing it later

What do you think?

Omega359 · 2025-02-12T20:42:18Z

I'll be honest - I'm pretty out of my element with these changes. I don't know what is 'correct behaviour' and what isn't here. My thinking for the changes in my current branch was that any 'new' Expr (parameter to with_column, with_column_renamed, etc) would go through the normalization, everything else I would like to think would already have been normalized or else how would it be in the DataFrame? What worries me is that I don't know if that assumption is correct or not.

I do know that so far I think my usecase is covered - I haven't seen a failure yet and the time it takes to build up a dataframe is < 5 sec now versus 100-200 seconds before.

Omega359 · 2025-03-31T19:41:16Z

This should now be resolved with the changes from #14653

Omega359 added the bug Something isn't working label Feb 9, 2025

Omega359 mentioned this issue Feb 9, 2025

Benchmark showcasing with_column and with_column_renamed function performance #14564

Merged

This was referenced Feb 10, 2025

[DISCUSSION] 2025 Q1-Q2 Roadmap #14580

Closed

[EPIC] A(nother) list of performance improvement tickets #14482

Open

blaginin mentioned this issue Feb 12, 2025

Early exit on column normalisation to improve DataFrame performance #14636

Merged

Omega359 mentioned this issue Feb 13, 2025

Dataframe with_column and with_column_renamed performance improvements #14653

Merged

blaginin mentioned this issue Feb 15, 2025

Reuse last projection layer when renaming columns #14684

Open

Omega359 closed this as completed Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf: Dataframe with_column and with_column_renamed are slow #14563

Perf: Dataframe with_column and with_column_renamed are slow #14563

Omega359 commented Feb 9, 2025 •

edited

Loading

Omega359 commented Feb 10, 2025

Uh oh!

Omega359 commented Feb 11, 2025 •

edited

Loading

Uh oh!

blaginin commented Feb 11, 2025

Uh oh!

blaginin commented Feb 11, 2025

Uh oh!

blaginin commented Feb 11, 2025

Uh oh!

Omega359 commented Feb 11, 2025 •

edited

Loading

Uh oh!

blaginin commented Feb 12, 2025

Uh oh!

Omega359 commented Feb 12, 2025

Uh oh!

Omega359 commented Mar 31, 2025

Uh oh!

Perf: Dataframe with_column and with_column_renamed are slow #14563

Perf: Dataframe with_column and with_column_renamed are slow #14563

Comments

Omega359 commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

To Reproduce

Expected behavior

Additional context

Omega359 commented Feb 10, 2025

Uh oh!

Omega359 commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blaginin commented Feb 11, 2025

Uh oh!

blaginin commented Feb 11, 2025

Uh oh!

blaginin commented Feb 11, 2025

Uh oh!

Omega359 commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blaginin commented Feb 12, 2025

Uh oh!

Omega359 commented Feb 12, 2025

Uh oh!

Omega359 commented Mar 31, 2025

Uh oh!

Omega359 commented Feb 9, 2025 •

edited

Loading

Omega359 commented Feb 11, 2025 •

edited

Loading

Omega359 commented Feb 11, 2025 •

edited

Loading