-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Perf: Dataframe with_column and with_column_renamed are slow #14563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If someone would be so kind as to generate a flamegraph for the benchmark it would be appreciated. I'm unable to under wsl2 without doing a serious amount of hackery to my system |
In looking into this issue I have a question for the db experts that happen to be following this issue. The with_column code builds a This essentially applies to any fn that call LogicalPlanBuilder.project such as My question is: could we instead have a way to tell the project that 'hey, these expressions are fine, trust me` and only do the work for the expression(s) that are new? |
A lot of |
Okay, so I think the issue is that with every ![]() I feel like a good start would be to reuse the existing projection if it's already on the top. It won't cover all cases but cover the majority (including the one in the benches). It can be something like this: https://github.com/apache/datafusion/compare/main...blaginin:datafusion:wip-reuse-projection?expand=1. I'll finish the code if that makes sense to you? |
Interesting. I tried a somewhat different approach - main...Omega359:arrow-datafusion:with_column_updates It is much much faster, it passes all the tests I can find including my own but it feels rather hackish to me. Essentially, I'm trying to avoid doing the work that I think isn't required (see above comment) but I don't know if this is actually correct or not. |
I really like that idea, Bruce! I tried to break your branch, but everything seems to work 🙂 I think the issue was that on every rename, we tried to recursively normalize every column for the query, which is very expensive. You could also potentially just normalize only your newly added columns and not touch the rest - if you avoid I think we can use this issue to make several nice optimizations that will complement each other:
What do you think? |
I'll be honest - I'm pretty out of my element with these changes. I don't know what is 'correct behaviour' and what isn't here. My thinking for the changes in my current branch was that any 'new' Expr (parameter to with_column, with_column_renamed, etc) would go through the normalization, everything else I would like to think would already have been normalized or else how would it be in the DataFrame? What worries me is that I don't know if that assumption is correct or not. I do know that so far I think my usecase is covered - I haven't seen a failure yet and the time it takes to build up a dataframe is < 5 sec now versus 100-200 seconds before. |
This should now be resolved with the changes from #14653 |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
Dataframe functions
.with_column
and.with_column_renamed
(and possibly others) are slow. One can really see this in dataframe's with many many columns where a.with_column
call can take secondsRelated: #7698
To Reproduce
Just time the function calls. A PR for a benchmark will be coming soon.
Expected behavior
dataframe function calls should be fast, as fast as all other operations in DataFusion.
Additional context
No response
The text was updated successfully, but these errors were encountered: