Skip to content

Use binary search for coordinate filter lookups #19

@jayendra13

Description

@jayendra13

Summary

find_value_index() in src/reader/filter.rs:569 uses iter().position() (linear scan) to find coordinate values during filter pushdown. Since coordinate arrays are sorted (an assumption the codebase already relies on for range filters), this should use binary search for O(log n) lookups instead of O(n).

Current Behavior

All 11 match arms in find_value_index() use linear search:

// filter.rs:572
vals.iter().position(|x| x == v)        // Int64
vals.iter().position(|x| (x - v).abs() < f64::EPSILON)  // Float64
// ... etc for all type combinations

Similarly, find_filter_range() at line 941 delegates to find_value_index() for equality filters and uses binary_search_lower_bound/binary_search_upper_bound for range filters — but those helper functions also appear to use linear approaches in some paths.

Proposed Change

Replace iter().position() with partition_point() (Rust's built-in binary search):

// For exact integer match:
let idx = vals.partition_point(|x| x < v);
if idx < vals.len() && vals[idx] == *v { Some(idx) } else { None }

// For float match with epsilon:
let idx = vals.partition_point(|x| *x < v - f64::EPSILON);
if idx < vals.len() && (vals[idx] - v).abs() < f64::EPSILON { Some(idx) } else { None }

The compact encoding path (CompactCoord::Arithmetic) already does O(1) lookup — this fix targets the explicit Vec<i64>, Vec<f64>, Vec<f32>, and Vec<i64> (timestamp) paths.

Impact

  • Small coordinates (lat=10, lon=10): Negligible difference
  • Large coordinates (e.g., hourly time series over 10 years = 87,600 values): 17 comparisons vs 87,600 — ~5000x faster per filter evaluation
  • Stacked filters (e.g., WHERE time = X AND lat = Y AND lon = Z): Multiplicative improvement

Files to Modify

  • src/reader/filter.rsfind_value_index() (line 569), verify binary_search_lower_bound and binary_search_upper_bound helpers

Motivation

Inspired by Vortex's search_sorted compute kernel which provides O(log n) lookups on sorted arrays. Since zarr-datafusion already assumes sorted coordinates for range filter correctness, binary search is safe to use here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions