|
| 1 | +# Generative Data Assimilation of Sparse Weather Station Observations at Kilometer Scales |
| 2 | + |
| 3 | +Peter Manshausen, Yair Cohen, Jaideep Pathak, Mike Pritchard, Piyush Garg, Morteza |
| 4 | +Mardani, Karthik Kashinath, Simon Byrne, Noah Brenowitz |
| 5 | + |
| 6 | +[https://arxiv.org/abs/2406.16947](https://arxiv.org/abs/2406.16947), in submission. |
| 7 | + |
| 8 | + |
| 9 | + |
| 10 | +*Constructing an atmospheric state from noise, guided by observations. The noisy state |
| 11 | +is denoised with the trained model, the denoised state evaluated with the observation |
| 12 | +operator A, and from the difference with the observations supplied, the posterior score |
| 13 | +is calculated. This score is used to update the noisy state, and the process is |
| 14 | +repeated for N denoising steps.* |
| 15 | + |
| 16 | +## Problem Overview |
| 17 | + |
| 18 | +Data assimilation (DA) is the process of incorporating observational data into the full model |
| 19 | +state. For numerical weather forecasting models, data assimilation is used to incorporate |
| 20 | +weather station data into the initial model state. This can be a costly process, requiring |
| 21 | +multiple evaluations of the computationally-expensive forecast to adjust the model state. |
| 22 | + |
| 23 | +Score-based data assimilation (SDA), proposed by [Rozet and Louppe](https://arxiv.org/abs/2306.10574), |
| 24 | +adapts the inference procedure of an unconditional generative diffusion model to |
| 25 | +generate model states conditional on observations, without retraining the model. We |
| 26 | +apply this to the context of a complex regional km-scale weather model, by training an |
| 27 | +unconditional diffusion model on the output of the |
| 28 | +[High Resolution Rapid Refresh (HRRR)](https://rapidrefresh.noaa.gov/hrrr/) reanalysis |
| 29 | +product. Using SDA, we then incorporate sparse weather station data from the |
| 30 | +[Integrated Surface Database (ISD)](https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database) |
| 31 | +into the inference phase to produce maps of precipitation and surface winds. |
| 32 | + |
| 33 | +Preliminary skill analysis shows the approach already outperforms a naive baseline of the |
| 34 | +High-Resolution Rapid Refresh system itself. By incorporating observations from 40 weather |
| 35 | +stations, 10% lower RMSEs on left-out stations are attained. Despite some lingering |
| 36 | +imperfections such as insufficiently disperse ensemble DA estimates, we find the results |
| 37 | +overall an encouraging proof of concept, and the first at km-scale. It is a ripe time to |
| 38 | +explore extensions that combine increasingly ambitious regional state generators with an |
| 39 | +increasing set of in situ, ground-based, and satellite remote sensing data streams. |
| 40 | + |
| 41 | +## Project Structure |
| 42 | + |
| 43 | +The project is organized into the following directories: |
| 44 | + |
| 45 | +- `obs/`: Notebooks for ISD data processing and comparison to HRRR data |
| 46 | +- `paper_figures/`: Scripts to reproduce the figures from the paper |
| 47 | +- `sda/`: Code from the original SDA paper with minor adaptations |
| 48 | +- `training/`: Scripts for training the model |
| 49 | + |
| 50 | +## Getting Started |
| 51 | + |
| 52 | +### Installation |
| 53 | + |
| 54 | +Install required packages: |
| 55 | + |
| 56 | +```bash |
| 57 | +pip install -r requirements.txt |
| 58 | +``` |
| 59 | + |
| 60 | +### Configuration Files |
| 61 | + |
| 62 | +Machine-specific configuration files can be created in the `configs` directory, |
| 63 | +defining the following variables: |
| 64 | + |
| 65 | +```python |
| 66 | +isd_path = "<path to isd data>" |
| 67 | +path_to_pretrained = "<path to the pretrained model>" |
| 68 | +path_to_model_state = "<path to model state from a training checkpoint>" |
| 69 | +path_to_hrrr = "<path to Zarr file containing 2017 HRRR data>" |
| 70 | +station_locations = "<path to station_locations_on_grid.nc generated by preprocess_isd.py>" |
| 71 | +path_to_isp = "<path to ISD csv data>" |
| 72 | +val_station_path = "<path to validation station locations generated by val_stations.py>" |
| 73 | +``` |
| 74 | + |
| 75 | +See `configs/base.py` for an example configuration. Both `station_locations` and |
| 76 | +`val_station_path` are checked into the code for simplicity. |
| 77 | + |
| 78 | +### Data |
| 79 | + |
| 80 | +#### High-Resolution Rapid Refresh (HRRR) |
| 81 | + |
| 82 | +HRRR data is used for training and model evaluation. Data from 2018-2020 is used for training, |
| 83 | +and 2017 is used for model evaluation. The model is trained on a 128x128 section of Oklahoma, |
| 84 | +offset by (834, 353), using the following channels: |
| 85 | + |
| 86 | +- 10 metre U wind component (`10u`) |
| 87 | +- 10 metre V wind component (`10v`) |
| 88 | +- Total precipitation (`tp`) |
| 89 | + |
| 90 | +The training and inference scripts requires the data be converted into a Zarr format before |
| 91 | +use, with one file per year of data. Example dimensions for the `2017.zarr` are shown below: |
| 92 | + |
| 93 | +```bash |
| 94 | +<xarray.Dataset> Size: 3GB |
| 95 | +Dimensions: (time: 8760, channel: 6, y: 128, x: 128) |
| 96 | +Coordinates: |
| 97 | +* channel (channel) object 48B '10u' '10v' 'gust' 'tp' 'sp' 'refc' |
| 98 | +* time (time) datetime64[ns] 70kB 2017-01-01T01:00:00 ... 2018-01-01 |
| 99 | +Dimensions without coordinates: y, x |
| 100 | +Data variables: |
| 101 | + HRRR (time, channel, y, x) float32 3GB dask.array<chunksize=(1, 6, 128, 128), meta=np.ndarray> |
| 102 | + latitude (y, x) float32 66kB dask.array<chunksize=(128, 128), meta=np.ndarray> |
| 103 | +``` |
| 104 | + |
| 105 | +#### Integrated Surface Database (ISD) |
| 106 | + |
| 107 | +ISD data is used for inference. ISD data can be obtained from the |
| 108 | +[NOAA Data Search](https://www.ncei.noaa.gov/access/search/data-search/global-hourly?bbox=37.197,-99.640,33.738,-95.393&pageNum=1&startDate=2017-01-01T23:59:59&endDate=2022-01-01T00:00:00&dataTypes=AA1&dataTypes=WND) |
| 109 | +and downloaded in CSV format. |
| 110 | + |
| 111 | +To download multiple stations: |
| 112 | + |
| 113 | +1. Click "+Select All > Proceed to Cart" |
| 114 | +2. Enter email |
| 115 | +3. Hit submit |
| 116 | +4. You will receive an email with a download link |
| 117 | + |
| 118 | +The data then needs to be gridded to the model grid and interpolated to a common time |
| 119 | +frequency. This is done using [obs/preprocess_isd.py](obs/preprocess_isd.py). |
| 120 | + |
| 121 | +### Training |
| 122 | + |
| 123 | +Training scripts are in the `training` directory. Configuration is handled via the YAML files |
| 124 | +in the `training/config` directory. For example: |
| 125 | + |
| 126 | +```bash |
| 127 | +cd training |
| 128 | +python3 train_diffusions.py \ |
| 129 | + --outdir /expts \ |
| 130 | + --tick 100 \ |
| 131 | + --config_file ./config/hrrr.yaml \ |
| 132 | + --config_name unconditional_diffusion_downscaling_a2s_v3_1_oklahoma \ |
| 133 | + --log_to_wandb True \ |
| 134 | + --run_id 0 |
| 135 | +``` |
| 136 | + |
| 137 | +With `log_to_wandb=True`, you'll want to specify your `wandb` entity and project in |
| 138 | +`train_diffusions.py`. |
| 139 | + |
| 140 | +### Inference |
| 141 | + |
| 142 | +For an example of inferencing on both subsampled HRRR and ISD data, run the |
| 143 | +[example_inference.py](example_inference.py) script. It requires the ISD and HRRR data, as |
| 144 | +well as the state file of the trained model. |
| 145 | + |
| 146 | +For running a full inference of the model across an entire year, use the `full_inference.py` |
| 147 | +script. The output filename is specified within that script. |
| 148 | + |
| 149 | +The code to reproduce the paper figures is in the [paper_figures/](paper_figures/) directory. |
| 150 | +To score the output of `full_inference.py`, use the `score_inference.py` script: |
| 151 | + |
| 152 | +```bash |
| 153 | +cd paper_figures |
| 154 | +python score_inference.py figure_data/scores/<_my_regen_model_name> |
| 155 | +python score_inference.py -g truth figure_data/scores/hrrr |
| 156 | +``` |
| 157 | + |
| 158 | +The aggregated scoring metrics (Fig. 7) can be plotted by specifying the |
| 159 | +`scoring_target` in `figure-07/fig_metrics.py`, and running it will generate PDFs of |
| 160 | +the figures. |
0 commit comments