Summary
CoordValues::slice() in src/reader/coord.rs:356 clones data into new Vecs for all non-compact variants. This causes O(n) memory allocation for each sliced coordinate, even when only a view into the original data is needed.
Current Behavior
// coord.rs:383-386
CoordValues::Int64(v) => CoordValues::Int64(v[start..end].to_vec()),
CoordValues::Float32(v) => CoordValues::Float32(v[start..end].to_vec()),
CoordValues::Float64(v) => CoordValues::Float64(v[start..end].to_vec()),
CoordValues::TimestampMicros(v) => CoordValues::TimestampMicros(v[start..end].to_vec()),
Similarly, create_coord_dictionary_typed() (line ~493) clones vals when building Arrow arrays:
CoordValues::Int64(vals) => {
let values_array = Int64Array::from(vals.clone()); // full clone
...
}
And as_i64_vec() / as_f64_vec() clone entire vectors:
CoordValues::Int64(v) => v.clone(), // coord.rs:394
Proposed Change
Option A: Use Arrow Buffers internally
Replace Vec<T> storage with Arrow Buffer (or ScalarBuffer<T>), which supports zero-copy slicing via offset+length:
pub enum CoordValues {
Compact { encoding: CompactCoord, is_timestamp: bool },
Int64(ScalarBuffer<i64>),
Float32(ScalarBuffer<f32>),
Float64(ScalarBuffer<f64>),
TimestampMicros(ScalarBuffer<i64>),
}
impl CoordValues {
pub fn slice(&self, start: usize, end: usize) -> CoordValues {
match self {
CoordValues::Int64(buf) => CoordValues::Int64(buf.slice(start, end - start)),
// ... zero-copy for all variants
}
}
}
ScalarBuffer::slice() returns a view (just adjusts offset+length) with no allocation.
Option B: Use Arc<[T]> with offset/length
Store shared ownership with a view:
pub enum CoordValues {
Compact { encoding: CompactCoord, is_timestamp: bool },
Int64 { data: Arc<[i64]>, offset: usize, len: usize },
// ...
}
Recommendation
Option A is preferred because:
- Arrow
ScalarBuffer is already a dependency
- Direct conversion to Arrow arrays without cloning
Int64Array::new(scalar_buffer, None) is zero-copy
Impact
For a time coordinate with 87,600 hourly values (10 years):
- Current: Each
slice() allocates ~700KB, each clone() in dictionary building allocates ~700KB
- Proposed: Zero allocation for slicing, zero-copy Arrow array construction
Files to Modify
src/reader/coord.rs — Change CoordValues variants from Vec<T> to ScalarBuffer<T>, update slice(), as_i64_vec(), as_f64_vec(), create_coord_dictionary_typed()
src/reader/zarr_reader.rs — Update call sites that construct CoordValues (convert Vec to ScalarBuffer at creation time)
src/reader/filter.rs — Update CoordValuesRef to work with ScalarBuffer slices
Motivation
Inspired by Vortex's SliceArray which provides zero-copy views without data duplication. Arrow's buffer types already provide this capability — we just need to use them instead of raw Vecs.
Summary
CoordValues::slice()insrc/reader/coord.rs:356clones data into newVecs for all non-compact variants. This causes O(n) memory allocation for each sliced coordinate, even when only a view into the original data is needed.Current Behavior
Similarly,
create_coord_dictionary_typed()(line ~493) clonesvalswhen building Arrow arrays:And
as_i64_vec()/as_f64_vec()clone entire vectors:Proposed Change
Option A: Use Arrow Buffers internally
Replace
Vec<T>storage with ArrowBuffer(orScalarBuffer<T>), which supports zero-copy slicing via offset+length:ScalarBuffer::slice()returns a view (just adjusts offset+length) with no allocation.Option B: Use
Arc<[T]>with offset/lengthStore shared ownership with a view:
Recommendation
Option A is preferred because:
ScalarBufferis already a dependencyInt64Array::new(scalar_buffer, None)is zero-copyImpact
For a time coordinate with 87,600 hourly values (10 years):
slice()allocates ~700KB, eachclone()in dictionary building allocates ~700KBFiles to Modify
src/reader/coord.rs— ChangeCoordValuesvariants fromVec<T>toScalarBuffer<T>, updateslice(),as_i64_vec(),as_f64_vec(),create_coord_dictionary_typed()src/reader/zarr_reader.rs— Update call sites that constructCoordValues(convertVectoScalarBufferat creation time)src/reader/filter.rs— UpdateCoordValuesRefto work withScalarBufferslicesMotivation
Inspired by Vortex's
SliceArraywhich provides zero-copy views without data duplication. Arrow's buffer types already provide this capability — we just need to use them instead of rawVecs.