Skip to content

Perf: Dataframe with_column and with_column_renamed are slow #14563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #14482
Omega359 opened this issue Feb 9, 2025 · 9 comments
Closed
Tracked by #14482

Perf: Dataframe with_column and with_column_renamed are slow #14563

Omega359 opened this issue Feb 9, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@Omega359
Copy link
Contributor

Omega359 commented Feb 9, 2025

Describe the bug

Dataframe functions .with_column and .with_column_renamed (and possibly others) are slow. One can really see this in dataframe's with many many columns where a .with_column call can take seconds

Related: #7698

     Running benches/dataframe.rs (target/release/deps/dataframe-daaaeac4dbc7597d)
Gnuplot not found, using plotters backend
with_column_10          time:   [6.2305 ms 6.2916 ms 6.4057 ms]
                        change: [+15.699% +18.873% +21.723%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking with_column_100: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 13.5s.
with_column_100         time:   [1.3307 s 1.3633 s 1.3996 s]
                        change: [+8.2536% +11.473% +14.825%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking with_column_200: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 235.2s.
with_column_200         time:   [23.350 s 25.011 s 27.300 s]
                        change: [-46.710% -42.652% -37.407%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe

To Reproduce

Just time the function calls. A PR for a benchmark will be coming soon.

Expected behavior

dataframe function calls should be fast, as fast as all other operations in DataFusion.

Additional context

No response

@Omega359
Copy link
Contributor Author

If someone would be so kind as to generate a flamegraph for the benchmark it would be appreciated. I'm unable to under wsl2 without doing a serious amount of hackery to my system

@Omega359
Copy link
Contributor Author

Omega359 commented Feb 11, 2025

In looking into this issue I have a question for the db experts that happen to be following this issue.

The with_column code builds a Vec<Expr> called fields in the dataframe and then projects all those fields in a new LogicalPlan that is returned in a new DataFrame. That's great ... except that projection seems to be doing far far too much work for fields that are completely unrelated to the Expr being passed into the with_column(self, name, expr) function. I am not an expert here by any means but I just can't see how preexisting expressions in the logical plan would need all the normalization/columnization that is performed since I would think that would have happened already. And if you are calling with_column hundreds of times like I do it definitely is doing that work over and over and over again.

This essentially applies to any fn that call LogicalPlanBuilder.project such as select, with_column, with_column_renamed,

My question is: could we instead have a way to tell the project that 'hey, these expressions are fine, trust me` and only do the work for the expression(s) that are new?

@blaginin
Copy link
Contributor

A lot of TreeNodeRecursion::visit_sibling... may be related to #13748 ?

Image

@blaginin
Copy link
Contributor

Stacktrace also may also help

Image

@blaginin
Copy link
Contributor

Okay, so I think the issue is that with every .with_column_renamed / .with_column we add a new projection - that creates a lot of layers and each time adding a new one is more and more complicated because we need to revisit all existing ones:

Image

I feel like a good start would be to reuse the existing projection if it's already on the top. It won't cover all cases but cover the majority (including the one in the benches).

It can be something like this: https://github.com/apache/datafusion/compare/main...blaginin:datafusion:wip-reuse-projection?expand=1. I'll finish the code if that makes sense to you?

@Omega359
Copy link
Contributor Author

Omega359 commented Feb 11, 2025

Interesting. I tried a somewhat different approach - main...Omega359:arrow-datafusion:with_column_updates

It is much much faster, it passes all the tests I can find including my own but it feels rather hackish to me. Essentially, I'm trying to avoid doing the work that I think isn't required (see above comment) but I don't know if this is actually correct or not.

@blaginin
Copy link
Contributor

I really like that idea, Bruce! I tried to break your branch, but everything seems to work 🙂 I think the issue was that on every rename, we tried to recursively normalize every column for the query, which is very expensive. You could also potentially just normalize only your newly added columns and not touch the rest - if you avoid normalize_col, it'll be even faster.

I think we can use this issue to make several nice optimizations that will complement each other:

  • The simplest one: do not call expensive operations when column is already normalized. The most obvious example is this (gives a 30% increase in your benchmarks; maybe I'll find other places as well).
  • Do not normalize columns (your PoC), which will boost benchmarks further. But it will still require normalization in some cases ((..., true) in your current PR).
  • For those cases with normalization, we can make an improvement by reusing the existing projection (my PoC from yesterday). I think it's overall a good idea to keep the plan simple - we'll spend less time simplifying and executing it later

What do you think?

@Omega359
Copy link
Contributor Author

I'll be honest - I'm pretty out of my element with these changes. I don't know what is 'correct behaviour' and what isn't here. My thinking for the changes in my current branch was that any 'new' Expr (parameter to with_column, with_column_renamed, etc) would go through the normalization, everything else I would like to think would already have been normalized or else how would it be in the DataFrame? What worries me is that I don't know if that assumption is correct or not.

I do know that so far I think my usecase is covered - I haven't seen a failure yet and the time it takes to build up a dataframe is < 5 sec now versus 100-200 seconds before.

@Omega359
Copy link
Contributor Author

This should now be resolved with the changes from #14653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants