Feature: Virtual datastores to reduce storage redundancies

Currently, mllam-data-prep is mostly reordering and rechunking the input datasets for better performance. Sometimes, particularly if the dataset is large or training shall be done on different subsets of the available datasets, keeping several copies of very similar datasets can be challenging and costs a lot of storage.

Creating reference datasets, i.e. datasets that only contain links/references to the original dataset, can be advantageous as no additional data is copied.

This issue is meant to document the potential options and start a discussion on potential implementations.

**Option 1: Create subset of mllam-data-prep zarr**
This option requires the mllam-data-prep zarr to be chunked with a chunksize of 1 along the subsetting dimension, e.g. `state_features`, as subsetting across chunk boundaries is not supported. This can be set in the mllam-data-prep input yaml file. After creation subsets for e.g. ablation studies can be created by:
```python
from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import ZarrParser
from obstore.store import LocalStore
from virtualizarr.registry import ObjectStoreRegistry
from pathlib import Path

zarr_store = str(Path.cwd() / 'example.danra.zarr')
store = LocalStore(prefix=zarr_store)
registry = ObjectStoreRegistry({f"file://{zarr_store}": store})
parser = ZarrParser()
vds = open_virtual_dataset( url=zarr_store,registry=registry,parser=parser)

# Select subset of mllam-data-prep dataset, e.g. reduce number of state_features
vds.isel(state_feature=slice(1,2)).vz.to_kerchunk("index.json", format='json')

# New subsetted mllam-data-prep dataset just containing references to original mllam-data-prep (e.g. no extra copy)
xr.open_zarr("reference://", storage_options={"fo":"index.json"})
```
Edited: this seems to require mllam-data-prep zarr files to be written in zarr format 3

**Option 2: Create dataset with references to source files**
Option 1 requires a full "physical" creation of the largest training dataset, while only the subsets are links. Can a reference dataset be created directly? How to handle variables that are precomputed, like statistics or time_of_day forcings?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Virtual datastores to reduce storage redundancies #88

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: Virtual datastores to reduce storage redundancies #88

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions