Support Direct Loading of NetCDF Time-Series Data Without Conversion

[Continued from slack discussion [deeplink]](https://ml-lam.slack.com/archives/C07S3JHN2G6/p1736524014591389)

Currently, `mllam-data-prep` netcdf loaders require data to be in single "long" files (i.e., all timestamps in one data element on disk).

In contrast, my dataset consists of NetCDF files with time-series data, where each file represents a single measurement (`Dims = [x, y, time]`, with `time` always being a single value). Instead of concatenating these files manually, I’m exploring ways to load them directly using a more flexible datastore approach.  

### Proposed Solution  
- Introduce a method to glob NetCDF files in the YAML config, mapping timestamps from filenames to a proper time dimension.  
- Alternatively, improve the existing datastore or document using **Kerchunk** to create a reference-based dataset without redundant copies.  

### Related Discussions  
- [Kerchunk-based solutions](https://fsspec.github.io/kerchunk/cases.html)
- Existing [notebook on NetCDF-to-Zarr conversion](https://github.com/mllam/mllam-docs/blob/main/ALARO_netCDF_to_zarr/kerchunk.ipynb)

### Next Steps  
- Determine if a more flexible datastore is needed or if an improved documentation approach (e.g., a tutorial) would suffice.  
- Evaluate performance trade-offs of different loading methods.  

Would love input from others working on similar datasets!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Direct Loading of NetCDF Time-Series Data Without Conversion #74

Proposed Solution

Related Discussions

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Direct Loading of NetCDF Time-Series Data Without Conversion #74

Description

Proposed Solution

Related Discussions

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions