-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add new stats pruning helpers to allow combining partition values in file level stats #16139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ngStatistics for partition + file level stats pruning
@xudong963 any chance you can review this since you've already approved the same code (with less tests!) in the original PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped some suggestion and comments.
datafusion/common/src/pruning.rs
Outdated
) -> Self { | ||
let num_containers = partition_values.len(); | ||
let partition_schema = Arc::new(Schema::new(partition_fields)); | ||
let mut partition_valeus_by_column = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut partition_valeus_by_column = | |
let mut partition_values_by_column = |
/// The outer vector represents the containers while the inner | ||
/// vector represents the partition values for each column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constructor accepts partition_values as a Vec, documented as “outer vector represents the containers while the inner vector represents the partition values for each column.” In code however, each inner Vec is treated as the values for one container, then transpose that into column-major storage.
The phrasing “inner vector represents the partition values for each column” can be read as “one column’s values across containers.”
datafusion/common/src/pruning.rs
Outdated
fn min_values(&self, column: &Column) -> Option<ArrayRef> { | ||
let index = self.schema.index_of(column.name()).ok()?; | ||
if self.statistics.iter().any(|s| { | ||
s.column_statistics | ||
.get(index) | ||
.is_some_and(|stat| stat.min_value.is_exact().unwrap_or(false)) | ||
}) { | ||
match ScalarValue::iter_to_array(self.statistics.iter().map(|s| { | ||
s.column_statistics | ||
.get(index) | ||
.and_then(|stat| { | ||
if let Precision::Exact(min) = &stat.min_value { | ||
Some(min.clone()) | ||
} else { | ||
None | ||
} | ||
}) | ||
.unwrap_or(ScalarValue::Null) | ||
})) { | ||
Ok(array) => Some(array), | ||
Err(_) => { | ||
log::warn!( | ||
"Failed to convert min values to array for column {}", | ||
column.name() | ||
); | ||
None | ||
} | ||
} | ||
} else { | ||
None | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both PrunableStatistics::min_values
and max_values
walk the same steps:
- Find the column index in the schema.
- Check whether any
Statistics
entry has an “exact” value for that column. - Iterate over all
Statistics
, pulling out the exact values or substitutingScalarValue::Null
. - Call
ScalarValue::iter_to_array(...)
and log or returnNone
on error.
By lifting steps (2)–(4) into a helper, we:
- Eliminate duplicate code in each method
- Centralize error handling and logging
- Make future changes (e.g. using a different logging framework) in one place
datafusion/common/src/pruning.rs
Outdated
let mut contained = Vec::with_capacity(self.partition_values.len()); | ||
for partition_value in partition_values { | ||
let contained_value = if values.contains(partition_value) { | ||
Some(true) | ||
} else { | ||
Some(false) | ||
}; | ||
contained.push(contained_value); | ||
} | ||
let array = BooleanArray::from(contained); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of explicit loops; would simplifying to .map(...) chains followed by collect() be better?
let array = BooleanArray::from(
partition_values
.iter()
.map(|pv| Some(values.contains(pv)))
.collect::<Vec<_>>()
);
Benefits:
- Eliminates manual push logic
- More concise: transforms each pv into a boolean directly
- Clearly shows “map input → output” intent
fe0b8f1
to
8b7089f
Compare
Thank you @kosiew that was great feedback 😄 |
A step towards #16014