Skip to content

Commit 878fd04

Browse files
authored
Initial ReGen model release (#810)
* initial regen release * add readme * cleanup figures, use existing crps routine * update changelog
1 parent 53d8288 commit 878fd04

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+9271
-0
lines changed

CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1010

1111
### Added
1212

13+
- Added ReGen score-based data assimilation example
14+
1315
### Changed
1416

1517
### Deprecated

examples/weather/regen/README.md

+160
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Generative Data Assimilation of Sparse Weather Station Observations at Kilometer Scales
2+
3+
Peter Manshausen, Yair Cohen, Jaideep Pathak, Mike Pritchard, Piyush Garg, Morteza
4+
Mardani, Karthik Kashinath, Simon Byrne, Noah Brenowitz
5+
6+
[https://arxiv.org/abs/2406.16947](https://arxiv.org/abs/2406.16947), in submission.
7+
8+
![Method animation showing the denoising process](paper_figures/figures/denoising_process/method_anim/method.gif)
9+
10+
*Constructing an atmospheric state from noise, guided by observations. The noisy state
11+
is denoised with the trained model, the denoised state evaluated with the observation
12+
operator A, and from the difference with the observations supplied, the posterior score
13+
is calculated. This score is used to update the noisy state, and the process is
14+
repeated for N denoising steps.*
15+
16+
## Problem Overview
17+
18+
Data assimilation (DA) is the process of incorporating observational data into the full model
19+
state. For numerical weather forecasting models, data assimilation is used to incorporate
20+
weather station data into the initial model state. This can be a costly process, requiring
21+
multiple evaluations of the computationally-expensive forecast to adjust the model state.
22+
23+
Score-based data assimilation (SDA), proposed by [Rozet and Louppe](https://arxiv.org/abs/2306.10574),
24+
adapts the inference procedure of an unconditional generative diffusion model to
25+
generate model states conditional on observations, without retraining the model. We
26+
apply this to the context of a complex regional km-scale weather model, by training an
27+
unconditional diffusion model on the output of the
28+
[High Resolution Rapid Refresh (HRRR)](https://rapidrefresh.noaa.gov/hrrr/) reanalysis
29+
product. Using SDA, we then incorporate sparse weather station data from the
30+
[Integrated Surface Database (ISD)](https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database)
31+
into the inference phase to produce maps of precipitation and surface winds.
32+
33+
Preliminary skill analysis shows the approach already outperforms a naive baseline of the
34+
High-Resolution Rapid Refresh system itself. By incorporating observations from 40 weather
35+
stations, 10% lower RMSEs on left-out stations are attained. Despite some lingering
36+
imperfections such as insufficiently disperse ensemble DA estimates, we find the results
37+
overall an encouraging proof of concept, and the first at km-scale. It is a ripe time to
38+
explore extensions that combine increasingly ambitious regional state generators with an
39+
increasing set of in situ, ground-based, and satellite remote sensing data streams.
40+
41+
## Project Structure
42+
43+
The project is organized into the following directories:
44+
45+
- `obs/`: Notebooks for ISD data processing and comparison to HRRR data
46+
- `paper_figures/`: Scripts to reproduce the figures from the paper
47+
- `sda/`: Code from the original SDA paper with minor adaptations
48+
- `training/`: Scripts for training the model
49+
50+
## Getting Started
51+
52+
### Installation
53+
54+
Install required packages:
55+
56+
```bash
57+
pip install -r requirements.txt
58+
```
59+
60+
### Configuration Files
61+
62+
Machine-specific configuration files can be created in the `configs` directory,
63+
defining the following variables:
64+
65+
```python
66+
isd_path = "<path to isd data>"
67+
path_to_pretrained = "<path to the pretrained model>"
68+
path_to_model_state = "<path to model state from a training checkpoint>"
69+
path_to_hrrr = "<path to Zarr file containing 2017 HRRR data>"
70+
station_locations = "<path to station_locations_on_grid.nc generated by preprocess_isd.py>"
71+
path_to_isp = "<path to ISD csv data>"
72+
val_station_path = "<path to validation station locations generated by val_stations.py>"
73+
```
74+
75+
See `configs/base.py` for an example configuration. Both `station_locations` and
76+
`val_station_path` are checked into the code for simplicity.
77+
78+
### Data
79+
80+
#### High-Resolution Rapid Refresh (HRRR)
81+
82+
HRRR data is used for training and model evaluation. Data from 2018-2020 is used for training,
83+
and 2017 is used for model evaluation. The model is trained on a 128x128 section of Oklahoma,
84+
offset by (834, 353), using the following channels:
85+
86+
- 10 metre U wind component (`10u`)
87+
- 10 metre V wind component (`10v`)
88+
- Total precipitation (`tp`)
89+
90+
The training and inference scripts requires the data be converted into a Zarr format before
91+
use, with one file per year of data. Example dimensions for the `2017.zarr` are shown below:
92+
93+
```bash
94+
<xarray.Dataset> Size: 3GB
95+
Dimensions: (time: 8760, channel: 6, y: 128, x: 128)
96+
Coordinates:
97+
* channel (channel) object 48B '10u' '10v' 'gust' 'tp' 'sp' 'refc'
98+
* time (time) datetime64[ns] 70kB 2017-01-01T01:00:00 ... 2018-01-01
99+
Dimensions without coordinates: y, x
100+
Data variables:
101+
HRRR (time, channel, y, x) float32 3GB dask.array<chunksize=(1, 6, 128, 128), meta=np.ndarray>
102+
latitude (y, x) float32 66kB dask.array<chunksize=(128, 128), meta=np.ndarray>
103+
```
104+
105+
#### Integrated Surface Database (ISD)
106+
107+
ISD data is used for inference. ISD data can be obtained from the
108+
[NOAA Data Search](https://www.ncei.noaa.gov/access/search/data-search/global-hourly?bbox=37.197,-99.640,33.738,-95.393&pageNum=1&startDate=2017-01-01T23:59:59&endDate=2022-01-01T00:00:00&dataTypes=AA1&dataTypes=WND)
109+
and downloaded in CSV format.
110+
111+
To download multiple stations:
112+
113+
1. Click "+Select All > Proceed to Cart"
114+
2. Enter email
115+
3. Hit submit
116+
4. You will receive an email with a download link
117+
118+
The data then needs to be gridded to the model grid and interpolated to a common time
119+
frequency. This is done using [obs/preprocess_isd.py](obs/preprocess_isd.py).
120+
121+
### Training
122+
123+
Training scripts are in the `training` directory. Configuration is handled via the YAML files
124+
in the `training/config` directory. For example:
125+
126+
```bash
127+
cd training
128+
python3 train_diffusions.py \
129+
--outdir /expts \
130+
--tick 100 \
131+
--config_file ./config/hrrr.yaml \
132+
--config_name unconditional_diffusion_downscaling_a2s_v3_1_oklahoma \
133+
--log_to_wandb True \
134+
--run_id 0
135+
```
136+
137+
With `log_to_wandb=True`, you'll want to specify your `wandb` entity and project in
138+
`train_diffusions.py`.
139+
140+
### Inference
141+
142+
For an example of inferencing on both subsampled HRRR and ISD data, run the
143+
[example_inference.py](example_inference.py) script. It requires the ISD and HRRR data, as
144+
well as the state file of the trained model.
145+
146+
For running a full inference of the model across an entire year, use the `full_inference.py`
147+
script. The output filename is specified within that script.
148+
149+
The code to reproduce the paper figures is in the [paper_figures/](paper_figures/) directory.
150+
To score the output of `full_inference.py`, use the `score_inference.py` script:
151+
152+
```bash
153+
cd paper_figures
154+
python score_inference.py figure_data/scores/<_my_regen_model_name>
155+
python score_inference.py -g truth figure_data/scores/hrrr
156+
```
157+
158+
The aggregated scoring metrics (Fig. 7) can be plotted by specifying the
159+
`scoring_target` in `figure-07/fig_metrics.py`, and running it will generate PDFs of
160+
the figures.
+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES.
2+
# SPDX-FileCopyrightText: All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
import os
18+
19+
# Mounted under /data
20+
isd_path = "/data/datasets/gridded_isd_oklahoma/2017_corrected.nc"
21+
path_to_pretrained = "/data/datasets/hrrr_oklahoma/hrrr_v3/model_state_dict.pth"
22+
path_to_model_state = "/path/to/my/training-state-000000.pt"
23+
path_to_hrrr = "/data/datasets/hrrr_oklahoma/hrrr_v3/2017.zarr"
24+
path_to_isp = "/data/sda/obs_data/3611864.csv"
25+
26+
# surface obs station data checked into the repo under paper_figures/figure_data/
27+
repo_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
28+
val_station_path = os.path.join(
29+
repo_root, "paper_figures/evenmore_random_val_stations.npy"
30+
) # generated by paper_figures/val_stations.py
31+
station_locations = os.path.join(
32+
repo_root, "obs/station_locations_on_grid.nc"
33+
) # generated by obs/preprocess_isd.py

0 commit comments

Comments
 (0)