Skip to content

Add new stats pruning helpers to allow combining partition values in file level stats #16139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

adriangb
Copy link
Contributor

A step towards #16014

adriangb added 2 commits May 21, 2025 08:00
…ngStatistics for partition + file level stats pruning
@github-actions github-actions bot added the common Related to common crate label May 21, 2025
@adriangb
Copy link
Contributor Author

@xudong963 any chance you can review this since you've already approved the same code (with less tests!) in the original PR?

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some suggestion and comments.

) -> Self {
let num_containers = partition_values.len();
let partition_schema = Arc::new(Schema::new(partition_fields));
let mut partition_valeus_by_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut partition_valeus_by_column =
let mut partition_values_by_column =

Comment on lines +160 to +161
/// The outer vector represents the containers while the inner
/// vector represents the partition values for each column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constructor accepts partition_values as a Vec, documented as “outer vector represents the containers while the inner vector represents the partition values for each column.” In code however, each inner Vec is treated as the values for one container, then transpose that into column-major storage.

The phrasing “inner vector represents the partition values for each column” can be read as “one column’s values across containers.”

Comment on lines 264 to 288
fn min_values(&self, column: &Column) -> Option<ArrayRef> {
let index = self.schema.index_of(column.name()).ok()?;
if self.statistics.iter().any(|s| {
s.column_statistics
.get(index)
.is_some_and(|stat| stat.min_value.is_exact().unwrap_or(false))
}) {
match ScalarValue::iter_to_array(self.statistics.iter().map(|s| {
s.column_statistics
.get(index)
.and_then(|stat| {
if let Precision::Exact(min) = &stat.min_value {
Some(min.clone())
} else {
None
}
})
.unwrap_or(ScalarValue::Null)
})) {
Ok(array) => Some(array),
Err(_) => {
log::warn!(
"Failed to convert min values to array for column {}",
column.name()
);
None
}
}
} else {
None
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both PrunableStatistics::min_values and max_values walk the same steps:

  1. Find the column index in the schema.
  2. Check whether any Statistics entry has an “exact” value for that column.
  3. Iterate over all Statistics, pulling out the exact values or substituting ScalarValue::Null.
  4. Call ScalarValue::iter_to_array(...) and log or return None on error.

By lifting steps (2)–(4) into a helper, we:

  • Eliminate duplicate code in each method
  • Centralize error handling and logging
  • Make future changes (e.g. using a different logging framework) in one place

Comment on lines 228 to 237
let mut contained = Vec::with_capacity(self.partition_values.len());
for partition_value in partition_values {
let contained_value = if values.contains(partition_value) {
Some(true)
} else {
Some(false)
};
contained.push(contained_value);
}
let array = BooleanArray::from(contained);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of explicit loops; would simplifying to .map(...) chains followed by collect() be better?

let array = BooleanArray::from(
    partition_values
        .iter()
        .map(|pv| Some(values.contains(pv)))
        .collect::<Vec<_>>()
);

Benefits:

  • Eliminates manual push logic
  • More concise: transforms each pv into a boolean directly
  • Clearly shows “map input → output” intent

@adriangb adriangb force-pushed the add-pruning-structs branch from fe0b8f1 to 8b7089f Compare May 23, 2025 13:57
@adriangb
Copy link
Contributor Author

Thank you @kosiew that was great feedback 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants