Summary
find_value_index() in src/reader/filter.rs:569 uses iter().position() (linear scan) to find coordinate values during filter pushdown. Since coordinate arrays are sorted (an assumption the codebase already relies on for range filters), this should use binary search for O(log n) lookups instead of O(n).
Current Behavior
All 11 match arms in find_value_index() use linear search:
// filter.rs:572
vals.iter().position(|x| x == v) // Int64
vals.iter().position(|x| (x - v).abs() < f64::EPSILON) // Float64
// ... etc for all type combinations
Similarly, find_filter_range() at line 941 delegates to find_value_index() for equality filters and uses binary_search_lower_bound/binary_search_upper_bound for range filters — but those helper functions also appear to use linear approaches in some paths.
Proposed Change
Replace iter().position() with partition_point() (Rust's built-in binary search):
// For exact integer match:
let idx = vals.partition_point(|x| x < v);
if idx < vals.len() && vals[idx] == *v { Some(idx) } else { None }
// For float match with epsilon:
let idx = vals.partition_point(|x| *x < v - f64::EPSILON);
if idx < vals.len() && (vals[idx] - v).abs() < f64::EPSILON { Some(idx) } else { None }
The compact encoding path (CompactCoord::Arithmetic) already does O(1) lookup — this fix targets the explicit Vec<i64>, Vec<f64>, Vec<f32>, and Vec<i64> (timestamp) paths.
Impact
- Small coordinates (lat=10, lon=10): Negligible difference
- Large coordinates (e.g., hourly time series over 10 years = 87,600 values): 17 comparisons vs 87,600 — ~5000x faster per filter evaluation
- Stacked filters (e.g.,
WHERE time = X AND lat = Y AND lon = Z): Multiplicative improvement
Files to Modify
src/reader/filter.rs — find_value_index() (line 569), verify binary_search_lower_bound and binary_search_upper_bound helpers
Motivation
Inspired by Vortex's search_sorted compute kernel which provides O(log n) lookups on sorted arrays. Since zarr-datafusion already assumes sorted coordinates for range filter correctness, binary search is safe to use here.
Summary
find_value_index()insrc/reader/filter.rs:569usesiter().position()(linear scan) to find coordinate values during filter pushdown. Since coordinate arrays are sorted (an assumption the codebase already relies on for range filters), this should use binary search for O(log n) lookups instead of O(n).Current Behavior
All 11 match arms in
find_value_index()use linear search:Similarly,
find_filter_range()at line 941 delegates tofind_value_index()for equality filters and usesbinary_search_lower_bound/binary_search_upper_boundfor range filters — but those helper functions also appear to use linear approaches in some paths.Proposed Change
Replace
iter().position()withpartition_point()(Rust's built-in binary search):The compact encoding path (
CompactCoord::Arithmetic) already does O(1) lookup — this fix targets the explicitVec<i64>,Vec<f64>,Vec<f32>, andVec<i64>(timestamp) paths.Impact
WHERE time = X AND lat = Y AND lon = Z): Multiplicative improvementFiles to Modify
src/reader/filter.rs—find_value_index()(line 569), verifybinary_search_lower_boundandbinary_search_upper_boundhelpersMotivation
Inspired by Vortex's
search_sortedcompute kernel which provides O(log n) lookups on sorted arrays. Since zarr-datafusion already assumes sorted coordinates for range filter correctness, binary search is safe to use here.