Skip to content

Feature: Virtual datastores to reduce storage redundancies #88

@observingClouds

Description

@observingClouds

Currently, mllam-data-prep is mostly reordering and rechunking the input datasets for better performance. Sometimes, particularly if the dataset is large or training shall be done on different subsets of the available datasets, keeping several copies of very similar datasets can be challenging and costs a lot of storage.

Creating reference datasets, i.e. datasets that only contain links/references to the original dataset, can be advantageous as no additional data is copied.

This issue is meant to document the potential options and start a discussion on potential implementations.

Option 1: Create subset of mllam-data-prep zarr
This option requires the mllam-data-prep zarr to be chunked with a chunksize of 1 along the subsetting dimension, e.g. state_features, as subsetting across chunk boundaries is not supported. This can be set in the mllam-data-prep input yaml file. After creation subsets for e.g. ablation studies can be created by:

from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import ZarrParser
from obstore.store import LocalStore
from virtualizarr.registry import ObjectStoreRegistry
from pathlib import Path

zarr_store = str(Path.cwd() / 'example.danra.zarr')
store = LocalStore(prefix=zarr_store)
registry = ObjectStoreRegistry({f"file://{zarr_store}": store})
parser = ZarrParser()
vds = open_virtual_dataset( url=zarr_store,registry=registry,parser=parser)

# Select subset of mllam-data-prep dataset, e.g. reduce number of state_features
vds.isel(state_feature=slice(1,2)).vz.to_kerchunk("index.json", format='json')

# New subsetted mllam-data-prep dataset just containing references to original mllam-data-prep (e.g. no extra copy)
xr.open_zarr("reference://", storage_options={"fo":"index.json"})

Edited: this seems to require mllam-data-prep zarr files to be written in zarr format 3

Option 2: Create dataset with references to source files
Option 1 requires a full "physical" creation of the largest training dataset, while only the subsets are links. Can a reference dataset be created directly? How to handle variables that are precomputed, like statistics or time_of_day forcings?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions