Skip to content

Use zero-copy slicing for CoordValues instead of Vec cloning #21

@jayendra13

Description

@jayendra13

Summary

CoordValues::slice() in src/reader/coord.rs:356 clones data into new Vecs for all non-compact variants. This causes O(n) memory allocation for each sliced coordinate, even when only a view into the original data is needed.

Current Behavior

// coord.rs:383-386
CoordValues::Int64(v) => CoordValues::Int64(v[start..end].to_vec()),
CoordValues::Float32(v) => CoordValues::Float32(v[start..end].to_vec()),
CoordValues::Float64(v) => CoordValues::Float64(v[start..end].to_vec()),
CoordValues::TimestampMicros(v) => CoordValues::TimestampMicros(v[start..end].to_vec()),

Similarly, create_coord_dictionary_typed() (line ~493) clones vals when building Arrow arrays:

CoordValues::Int64(vals) => {
    let values_array = Int64Array::from(vals.clone());  // full clone
    ...
}

And as_i64_vec() / as_f64_vec() clone entire vectors:

CoordValues::Int64(v) => v.clone(),  // coord.rs:394

Proposed Change

Option A: Use Arrow Buffers internally

Replace Vec<T> storage with Arrow Buffer (or ScalarBuffer<T>), which supports zero-copy slicing via offset+length:

pub enum CoordValues {
    Compact { encoding: CompactCoord, is_timestamp: bool },
    Int64(ScalarBuffer<i64>),
    Float32(ScalarBuffer<f32>),
    Float64(ScalarBuffer<f64>),
    TimestampMicros(ScalarBuffer<i64>),
}

impl CoordValues {
    pub fn slice(&self, start: usize, end: usize) -> CoordValues {
        match self {
            CoordValues::Int64(buf) => CoordValues::Int64(buf.slice(start, end - start)),
            // ... zero-copy for all variants
        }
    }
}

ScalarBuffer::slice() returns a view (just adjusts offset+length) with no allocation.

Option B: Use Arc<[T]> with offset/length

Store shared ownership with a view:

pub enum CoordValues {
    Compact { encoding: CompactCoord, is_timestamp: bool },
    Int64 { data: Arc<[i64]>, offset: usize, len: usize },
    // ...
}

Recommendation

Option A is preferred because:

  • Arrow ScalarBuffer is already a dependency
  • Direct conversion to Arrow arrays without cloning
  • Int64Array::new(scalar_buffer, None) is zero-copy

Impact

For a time coordinate with 87,600 hourly values (10 years):

  • Current: Each slice() allocates ~700KB, each clone() in dictionary building allocates ~700KB
  • Proposed: Zero allocation for slicing, zero-copy Arrow array construction

Files to Modify

  • src/reader/coord.rs — Change CoordValues variants from Vec<T> to ScalarBuffer<T>, update slice(), as_i64_vec(), as_f64_vec(), create_coord_dictionary_typed()
  • src/reader/zarr_reader.rs — Update call sites that construct CoordValues (convert Vec to ScalarBuffer at creation time)
  • src/reader/filter.rs — Update CoordValuesRef to work with ScalarBuffer slices

Motivation

Inspired by Vortex's SliceArray which provides zero-copy views without data duplication. Arrow's buffer types already provide this capability — we just need to use them instead of raw Vecs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions