-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[parquet] Avoid a clone while resolving the read strategy #9056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| None => return RowSelectionStrategy::Selectors, | ||
| }; | ||
|
|
||
| let trimmed = selection.clone().trim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this clone shows up in the traces for some of the clickbench queries. I think this is because there are 9000 selectors for these queries
| let total_rows: usize = selectors.iter().map(|s| s.row_count).sum(); | ||
| let selector_count = selectors.len(); | ||
| if selector_count == 0 { | ||
| let non_empty_selector_count = selection.iter().filter(|s| s.row_count > 0).count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good improvement! I didn’t notice this in my benchmarks, likely because the dataset was too small.
Small suggestion:
non_empty_selector_countis too long,effective_countcould sound better.- Using
foldto replace 2 seperate iteration might be better.
let (total_rows, effective_count) = selection.iter()
.fold((0, 0), |(rows, count), s| {
if s.row_count > 0 {
(rows + s.row_count, count + 1)
} else {
(rows, count)
}
});
if effective_count == 0 {
return RowSelectionStrategy::Mask;
}
if total_rows < effective_count.saturating_mul(threshold) {
RowSelectionStrategy::Mask
} else {
RowSelectionStrategy::Selectors
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea -- done in fe10c50
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_reader_row_filter |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
@hhhizzz do you think this PR is ready to merge? |
Yeah, I think its ready to merge. |
Thank you for the review ❤️ |
Which issue does this PR close?
Rationale for this change
While working on apache/datafusion#19477, and profiling ClickBench q7, I noticed that the RowSelectors was being cloned to resolve the strategy -- for a large number of selections this is expensive and shows up in the traces
We should change the code to avoid cloning the RowSelectors when resolving the strategy.
Changes
Don't clone / allocate while resolving the strategy.
I don't expect this to have a massive impact, but it did show up in the profile
FYI @hhhizzz -- perhaps you could review this PR
Are these changes tested?
Yes by CI
Are there any user-facing changes?
small performance improvement