Skip to content

Initial ReGen model release #810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Added ReGen score-based data assimilation example

### Changed

### Deprecated
Expand Down
160 changes: 160 additions & 0 deletions examples/weather/regen/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Generative Data Assimilation of Sparse Weather Station Observations at Kilometer Scales

Peter Manshausen, Yair Cohen, Jaideep Pathak, Mike Pritchard, Piyush Garg, Morteza
Mardani, Karthik Kashinath, Simon Byrne, Noah Brenowitz

[https://arxiv.org/abs/2406.16947](https://arxiv.org/abs/2406.16947), in submission.

![Method animation showing the denoising process](paper_figures/figures/denoising_process/method_anim/method.gif)

*Constructing an atmospheric state from noise, guided by observations. The noisy state
is denoised with the trained model, the denoised state evaluated with the observation
operator A, and from the difference with the observations supplied, the posterior score
is calculated. This score is used to update the noisy state, and the process is
repeated for N denoising steps.*

## Problem Overview

Data assimilation (DA) is the process of incorporating observational data into the full model
state. For numerical weather forecasting models, data assimilation is used to incorporate
weather station data into the initial model state. This can be a costly process, requiring
multiple evaluations of the computationally-expensive forecast to adjust the model state.

Score-based data assimilation (SDA), proposed by [Rozet and Louppe](https://arxiv.org/abs/2306.10574),
adapts the inference procedure of an unconditional generative diffusion model to
generate model states conditional on observations, without retraining the model. We
apply this to the context of a complex regional km-scale weather model, by training an
unconditional diffusion model on the output of the
[High Resolution Rapid Refresh (HRRR)](https://rapidrefresh.noaa.gov/hrrr/) reanalysis
product. Using SDA, we then incorporate sparse weather station data from the
[Integrated Surface Database (ISD)](https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database)
into the inference phase to produce maps of precipitation and surface winds.

Preliminary skill analysis shows the approach already outperforms a naive baseline of the
High-Resolution Rapid Refresh system itself. By incorporating observations from 40 weather
stations, 10% lower RMSEs on left-out stations are attained. Despite some lingering
imperfections such as insufficiently disperse ensemble DA estimates, we find the results
overall an encouraging proof of concept, and the first at km-scale. It is a ripe time to
explore extensions that combine increasingly ambitious regional state generators with an
increasing set of in situ, ground-based, and satellite remote sensing data streams.

## Project Structure

The project is organized into the following directories:

- `obs/`: Notebooks for ISD data processing and comparison to HRRR data
- `paper_figures/`: Scripts to reproduce the figures from the paper
- `sda/`: Code from the original SDA paper with minor adaptations
- `training/`: Scripts for training the model

## Getting Started

### Installation

Install required packages:

```bash
pip install -r requirements.txt
```

### Configuration Files

Machine-specific configuration files can be created in the `configs` directory,
defining the following variables:

```python
isd_path = "<path to isd data>"
path_to_pretrained = "<path to the pretrained model>"
path_to_model_state = "<path to model state from a training checkpoint>"
path_to_hrrr = "<path to Zarr file containing 2017 HRRR data>"
station_locations = "<path to station_locations_on_grid.nc generated by preprocess_isd.py>"
path_to_isp = "<path to ISD csv data>"
val_station_path = "<path to validation station locations generated by val_stations.py>"
```

See `configs/base.py` for an example configuration. Both `station_locations` and
`val_station_path` are checked into the code for simplicity.

### Data

#### High-Resolution Rapid Refresh (HRRR)

HRRR data is used for training and model evaluation. Data from 2018-2020 is used for training,
and 2017 is used for model evaluation. The model is trained on a 128x128 section of Oklahoma,
offset by (834, 353), using the following channels:

- 10 metre U wind component (`10u`)
- 10 metre V wind component (`10v`)
- Total precipitation (`tp`)

The training and inference scripts requires the data be converted into a Zarr format before
use, with one file per year of data. Example dimensions for the `2017.zarr` are shown below:

```bash
<xarray.Dataset> Size: 3GB
Dimensions: (time: 8760, channel: 6, y: 128, x: 128)
Coordinates:
* channel (channel) object 48B '10u' '10v' 'gust' 'tp' 'sp' 'refc'
* time (time) datetime64[ns] 70kB 2017-01-01T01:00:00 ... 2018-01-01
Dimensions without coordinates: y, x
Data variables:
HRRR (time, channel, y, x) float32 3GB dask.array<chunksize=(1, 6, 128, 128), meta=np.ndarray>
latitude (y, x) float32 66kB dask.array<chunksize=(128, 128), meta=np.ndarray>
```

#### Integrated Surface Database (ISD)

ISD data is used for inference. ISD data can be obtained from the
[NOAA Data Search](https://www.ncei.noaa.gov/access/search/data-search/global-hourly?bbox=37.197,-99.640,33.738,-95.393&pageNum=1&startDate=2017-01-01T23:59:59&endDate=2022-01-01T00:00:00&dataTypes=AA1&dataTypes=WND)
and downloaded in CSV format.

To download multiple stations:

1. Click "+Select All > Proceed to Cart"
2. Enter email
3. Hit submit
4. You will receive an email with a download link

The data then needs to be gridded to the model grid and interpolated to a common time
frequency. This is done using [obs/preprocess_isd.py](obs/preprocess_isd.py).

### Training

Training scripts are in the `training` directory. Configuration is handled via the YAML files
in the `training/config` directory. For example:

```bash
cd training
python3 train_diffusions.py \
--outdir /expts \
--tick 100 \
--config_file ./config/hrrr.yaml \
--config_name unconditional_diffusion_downscaling_a2s_v3_1_oklahoma \
--log_to_wandb True \
--run_id 0
```

With `log_to_wandb=True`, you'll want to specify your `wandb` entity and project in
`train_diffusions.py`.

### Inference

For an example of inferencing on both subsampled HRRR and ISD data, run the
[example_inference.py](example_inference.py) script. It requires the ISD and HRRR data, as
well as the state file of the trained model.

For running a full inference of the model across an entire year, use the `full_inference.py`
script. The output filename is specified within that script.

The code to reproduce the paper figures is in the [paper_figures/](paper_figures/) directory.
To score the output of `full_inference.py`, use the `score_inference.py` script:

```bash
cd paper_figures
python score_inference.py figure_data/scores/<_my_regen_model_name>
python score_inference.py -g truth figure_data/scores/hrrr
```

The aggregated scoring metrics (Fig. 7) can be plotted by specifying the
`scoring_target` in `figure-07/fig_metrics.py`, and running it will generate PDFs of
the figures.
33 changes: 33 additions & 0 deletions examples/weather/regen/configs/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

# Mounted under /data
isd_path = "/data/datasets/gridded_isd_oklahoma/2017_corrected.nc"
path_to_pretrained = "/data/datasets/hrrr_oklahoma/hrrr_v3/model_state_dict.pth"
path_to_model_state = "/path/to/my/training-state-000000.pt"
path_to_hrrr = "/data/datasets/hrrr_oklahoma/hrrr_v3/2017.zarr"
path_to_isp = "/data/sda/obs_data/3611864.csv"

# surface obs station data checked into the repo under paper_figures/figure_data/
repo_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
val_station_path = os.path.join(
repo_root, "paper_figures/evenmore_random_val_stations.npy"
) # generated by paper_figures/val_stations.py
station_locations = os.path.join(
repo_root, "obs/station_locations_on_grid.nc"
) # generated by obs/preprocess_isd.py
Loading