-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Dataframe with_column and with_column_renamed performance improvements #14653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
After spending more time reviewing the dataframe and logical plan code I have a feeling that my assumption is in fact not correct and that a dataframe can indeed have a plan that is not normalized/columnized prior to with_column being called. Joins, window, aggregate, are possible examples. |
I suspect you're right about that assumption not being correct. I've dug through a bit, but I'd probably need to write up a unit test to verify. |
I've made some changes locally where I test to see if the existing plan is a projection but I realized that I can't just rely on that either as the plan could possibly have been manually made then a DataFrame wrapped around it and the with_column function called. For my approach to work I would need a way to strongly guarantee that the last projection that was made was done via the project(..) function in the builder where the normalization/columnization is guaranteed to have happened. I'm not sure right now how to do that |
# Conflicts: # datafusion/core/src/dataframe/mod.rs
|
This should be ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Omega359 -- this makes sense to me
I think it would be nice to avoid a new pub
function and add a few more comments, but I also don't think that is required
@@ -68,8 +67,7 @@ fn run(column_count: u32, ctx: Arc<SessionContext>) { | |||
} | |||
|
|||
fn criterion_benchmark(c: &mut Criterion) { | |||
// 500 takes far too long right now | |||
for column_count in [10, 100, 200 /* 500 */] { | |||
for column_count in [10, 100, 200, 500] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
datafusion/core/src/dataframe/mod.rs
Outdated
@@ -183,6 +183,8 @@ pub struct DataFrame { | |||
// Box the (large) SessionState to reduce the size of DataFrame on the stack | |||
session_state: Box<SessionState>, | |||
plan: LogicalPlan, | |||
// whether we can skip validation for projection ops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some additional comments here about what circumstances permit validation to be skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, please review text when you have a chance.
Thanks again @Omega359 |
Which issue does this PR close?
If there is any dataframe experts here I would love a review of my assumptions. As noted in #14563 (comment)
Rationale for this change
Improve performance for with_column and with_column_renamed dataframe functions.
What changes are included in this PR?
Code
Are these changes tested?
Existing tests
Are there any user-facing changes?
No.