Skip to content

Cache infer_schema() results per store path #23

@jayendra13

Description

@jayendra13

Summary

infer_schema() is called on every query against a ZarrTable, re-reading Zarr metadata files (.zarray, .zattrs, zarr.json, .zmetadata) and re-discovering array structure each time. For remote stores (S3/GCS), this adds multiple HTTP round-trips per query. The results should be cached since Zarr store structure rarely changes during a session.

Current Behavior

In src/datasource/zarr.rs, ZarrTable::try_new() calls infer_schema() once during table registration. However:

  1. Remote store metadata is fetched twice — once during try_new() for schema, and again in ZarrExec::execute() via discover_arrays() for array structure (shapes, chunk sizes, coordinates)
  2. discover_arrays() is called per-executionzarr_exec.rs:336 calls it every time a query runs against VirtualiZarr stores
  3. No cross-table caching — If the same Zarr store is registered under different names or queried via different ZarrTable instances, schema inference runs independently

Cost per infer_schema() call:

Store Type Operations Estimated Latency
Local (v2) Read .zarray + .zattrs per array, list directories ~5-50ms
Local (v3) Read zarr.json per array ~5-50ms
Remote (S3/GCS) LIST + GET per array (2-4 HTTP calls per array) ~200-800ms
VirtualiZarr Read + parse .zmetadata JSON ~10-100ms

For a store with 10 arrays, remote schema inference can take 2-8 seconds.

Proposed Change

Add a metadata cache keyed by store path:

use std::collections::HashMap;
use std::sync::{Arc, RwLock};
use std::time::Instant;

/// Cached Zarr store metadata
pub struct ZarrMetadataCache {
    entries: RwLock<HashMap<String, CachedMetadata>>,
}

struct CachedMetadata {
    schema: SchemaRef,
    array_info: Vec<ArrayInfo>,  // shapes, chunk sizes, dtypes
    coord_values: Vec<(String, CoordValues)>,  // pre-loaded coordinate values
    statistics: Statistics,
    cached_at: Instant,
}

impl ZarrMetadataCache {
    /// Get or compute schema for a store path
    pub async fn get_or_infer(&self, path: &str) -> Result<&CachedMetadata> {
        // Check cache first
        if let Some(cached) = self.entries.read().unwrap().get(path) {
            return Ok(cached);
        }
        // Infer and cache
        let meta = infer_and_discover(path).await?;
        self.entries.write().unwrap().insert(path.to_string(), meta);
        // ...
    }

    /// Invalidate cache for a path (e.g., after data update)
    pub fn invalidate(&self, path: &str) { ... }
}

Integration points:

  1. ZarrTable holds an Arc<ZarrMetadataCache> shared across tables
  2. ZarrExec receives pre-cached metadata instead of calling discover_arrays() per execution
  3. CLI session creates one cache instance for the session lifetime
  4. Optional TTL — cache entries expire after configurable duration (default: no expiry within session)

Impact

  • Remote stores: Eliminates 2-8 seconds of repeated HTTP calls per query
  • Interactive CLI: Schema discovery happens once on CREATE TABLE, subsequent queries are instant
  • Multiple queries: SELECT MIN(time) FROM era5; SELECT MAX(temp) FROM era5; — second query skips all metadata I/O
  • VirtualiZarr: .zmetadata JSON parsed once instead of per-query

Files to Modify

  • src/reader/schema_inference.rs — Add ZarrMetadataCache struct, CachedMetadata type
  • src/datasource/zarr.rsZarrTable holds Arc<ZarrMetadataCache>, passes to ZarrExec
  • src/physical_plan/zarr_exec.rs — Accept cached metadata, skip discover_arrays() when cache hit
  • src/bin/zarr_cli/main.rs — Create shared cache for CLI session

Motivation

Inspired by Vortex's CachedVortexMetadata which avoids re-reading file footers across queries. The pattern is straightforward: metadata is read-heavy and write-rare, making it an ideal caching target. zarr-datafusion already partially caches remote store connections (cached_remote in ZarrExec), but doesn't cache the more expensive metadata discovery step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions