Summary
infer_schema() is called on every query against a ZarrTable, re-reading Zarr metadata files (.zarray, .zattrs, zarr.json, .zmetadata) and re-discovering array structure each time. For remote stores (S3/GCS), this adds multiple HTTP round-trips per query. The results should be cached since Zarr store structure rarely changes during a session.
Current Behavior
In src/datasource/zarr.rs, ZarrTable::try_new() calls infer_schema() once during table registration. However:
- Remote store metadata is fetched twice — once during
try_new() for schema, and again in ZarrExec::execute() via discover_arrays() for array structure (shapes, chunk sizes, coordinates)
discover_arrays() is called per-execution — zarr_exec.rs:336 calls it every time a query runs against VirtualiZarr stores
- No cross-table caching — If the same Zarr store is registered under different names or queried via different
ZarrTable instances, schema inference runs independently
Cost per infer_schema() call:
| Store Type |
Operations |
Estimated Latency |
| Local (v2) |
Read .zarray + .zattrs per array, list directories |
~5-50ms |
| Local (v3) |
Read zarr.json per array |
~5-50ms |
| Remote (S3/GCS) |
LIST + GET per array (2-4 HTTP calls per array) |
~200-800ms |
| VirtualiZarr |
Read + parse .zmetadata JSON |
~10-100ms |
For a store with 10 arrays, remote schema inference can take 2-8 seconds.
Proposed Change
Add a metadata cache keyed by store path:
use std::collections::HashMap;
use std::sync::{Arc, RwLock};
use std::time::Instant;
/// Cached Zarr store metadata
pub struct ZarrMetadataCache {
entries: RwLock<HashMap<String, CachedMetadata>>,
}
struct CachedMetadata {
schema: SchemaRef,
array_info: Vec<ArrayInfo>, // shapes, chunk sizes, dtypes
coord_values: Vec<(String, CoordValues)>, // pre-loaded coordinate values
statistics: Statistics,
cached_at: Instant,
}
impl ZarrMetadataCache {
/// Get or compute schema for a store path
pub async fn get_or_infer(&self, path: &str) -> Result<&CachedMetadata> {
// Check cache first
if let Some(cached) = self.entries.read().unwrap().get(path) {
return Ok(cached);
}
// Infer and cache
let meta = infer_and_discover(path).await?;
self.entries.write().unwrap().insert(path.to_string(), meta);
// ...
}
/// Invalidate cache for a path (e.g., after data update)
pub fn invalidate(&self, path: &str) { ... }
}
Integration points:
ZarrTable holds an Arc<ZarrMetadataCache> shared across tables
ZarrExec receives pre-cached metadata instead of calling discover_arrays() per execution
- CLI session creates one cache instance for the session lifetime
- Optional TTL — cache entries expire after configurable duration (default: no expiry within session)
Impact
- Remote stores: Eliminates 2-8 seconds of repeated HTTP calls per query
- Interactive CLI: Schema discovery happens once on
CREATE TABLE, subsequent queries are instant
- Multiple queries:
SELECT MIN(time) FROM era5; SELECT MAX(temp) FROM era5; — second query skips all metadata I/O
- VirtualiZarr:
.zmetadata JSON parsed once instead of per-query
Files to Modify
src/reader/schema_inference.rs — Add ZarrMetadataCache struct, CachedMetadata type
src/datasource/zarr.rs — ZarrTable holds Arc<ZarrMetadataCache>, passes to ZarrExec
src/physical_plan/zarr_exec.rs — Accept cached metadata, skip discover_arrays() when cache hit
src/bin/zarr_cli/main.rs — Create shared cache for CLI session
Motivation
Inspired by Vortex's CachedVortexMetadata which avoids re-reading file footers across queries. The pattern is straightforward: metadata is read-heavy and write-rare, making it an ideal caching target. zarr-datafusion already partially caches remote store connections (cached_remote in ZarrExec), but doesn't cache the more expensive metadata discovery step.
Summary
infer_schema()is called on every query against aZarrTable, re-reading Zarr metadata files (.zarray,.zattrs,zarr.json,.zmetadata) and re-discovering array structure each time. For remote stores (S3/GCS), this adds multiple HTTP round-trips per query. The results should be cached since Zarr store structure rarely changes during a session.Current Behavior
In
src/datasource/zarr.rs,ZarrTable::try_new()callsinfer_schema()once during table registration. However:try_new()for schema, and again inZarrExec::execute()viadiscover_arrays()for array structure (shapes, chunk sizes, coordinates)discover_arrays()is called per-execution —zarr_exec.rs:336calls it every time a query runs against VirtualiZarr storesZarrTableinstances, schema inference runs independentlyCost per
infer_schema()call:.zarray+.zattrsper array, list directorieszarr.jsonper array.zmetadataJSONFor a store with 10 arrays, remote schema inference can take 2-8 seconds.
Proposed Change
Add a metadata cache keyed by store path:
Integration points:
ZarrTableholds anArc<ZarrMetadataCache>shared across tablesZarrExecreceives pre-cached metadata instead of callingdiscover_arrays()per executionImpact
CREATE TABLE, subsequent queries are instantSELECT MIN(time) FROM era5; SELECT MAX(temp) FROM era5;— second query skips all metadata I/O.zmetadataJSON parsed once instead of per-queryFiles to Modify
src/reader/schema_inference.rs— AddZarrMetadataCachestruct,CachedMetadatatypesrc/datasource/zarr.rs—ZarrTableholdsArc<ZarrMetadataCache>, passes toZarrExecsrc/physical_plan/zarr_exec.rs— Accept cached metadata, skipdiscover_arrays()when cache hitsrc/bin/zarr_cli/main.rs— Create shared cache for CLI sessionMotivation
Inspired by Vortex's
CachedVortexMetadatawhich avoids re-reading file footers across queries. The pattern is straightforward: metadata is read-heavy and write-rare, making it an ideal caching target. zarr-datafusion already partially caches remote store connections (cached_remoteinZarrExec), but doesn't cache the more expensive metadata discovery step.