Skip to content

Commit e5c988c

Browse files
authored
Parallel Computations (#39)
* improve assignment table functions (#38) * update assign logging and force dtypes before merging * new parallel assignment by propagation functions * map propagate and resolve propagate functions * cache prop tables, add docstrings and todos * copy rows in prop for speed and lower memory req * correct false checks for empty rows and naming in assign by clusters * correct bugs in df filters, map assign ungauged * revise gis generating functions for new column names, logging * incrememnt version number
1 parent af0245d commit e5c988c

22 files changed

+453
-438
lines changed

docs/api/assign.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `saber.assign`
2+
3+
::: saber.assign

docs/api/cluster.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `saber.cluster`
2+
3+
::: saber.cluster

docs/api/gis.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `saber.gis`
2+
3+
::: saber.gis

docs/api/index.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,7 @@
1-
# API Documentation
1+
# `saber-hbc` API
2+
3+
* [`saber.assign`](assign.md)
4+
* [`saber.cluster`](cluster.md)
5+
* [`saber.gis`](gis.md)
6+
* [`saber.prep`](prep.md)
7+
* [`saber.validate`](validate.md)

docs/api/prep.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `saber.prep`
2+
3+
::: saber.prep

docs/api/validate.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `saber.validate`
2+
3+
::: saber.validate

docs/data/discharge-data.md

+41
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Required Hydrological Datasets
2+
3+
1. Hindcast/Retrospective discharge for every stream segment (reporting point) in the model. This is a time series of
4+
discharge, e.g. hydrograph, for each stream segment. The data should be saved in parquet format and named
5+
`hindcast_series_table.parquet`. The DataFrame should have:
6+
1. An index named `datetime` of type `datetime`. Contains the datetime stamp for the simulated values (rows)
7+
2. 1 column per stream, column name is the stream's model ID and is type string, containing the discharge for each
8+
time step.
9+
2. Observed discharge data for each gauge. 1 file per gauge named `{gauge_id}.csv`. The DataFrame should have:
10+
1. `datetime`: The datetime stamp for the measurements
11+
2. A column whose name is the unique `gauge_id` containing the discharge for each time step.
12+
13+
The `hindcast_series_table.parquet` should look like this:
14+
15+
| datetime | model_id_1 | model_id_2 | model_id_3 | ... |
16+
|------------|------------|------------|------------|-----|
17+
| 1985-01-01 | 50 | 50 | 50 | ... |
18+
| 1985-01-02 | 60 | 60 | 60 | ... |
19+
| 1985-01-03 | 70 | 70 | 70 | ... |
20+
| ... | ... | ... | ... | ... |
21+
22+
Each gauge's csv file should look like this:
23+
24+
| datetime | discharge |
25+
|------------|-----------|
26+
| 1985-01-01 | 50 |
27+
| 1985-01-02 | 60 |
28+
| 1985-01-03 | 70 |
29+
| ... | ... |
30+
31+
## Things to check
32+
33+
Be sure that both datasets:
34+
35+
- Are in the same units (e.g. m3/s)
36+
- Are in the same time zone (e.g. UTC)
37+
- Are in the same time step (e.g. daily average)
38+
- Do not contain any non-numeric values (e.g. ICE, none, etc.)
39+
- Do not contain rows with missing values (e.g. NaN or blank cells)
40+
- Have been cleaned of any incorrect values (e.g. no negative values)
41+
- Do not contain any duplicate rows

docs/data/gis-data.md

+46
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Required GIS Datasets
2+
3+
1. Drainage lines (usually delineated center lines) with at least the following attributes (columns)
4+
for each feature:
5+
- `model_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice
6+
- `downstream_model_id`: The ID of the next downstream reach
7+
- `strahler_order`: The strahler stream order of each reach
8+
- `model_drain_area`: Cumulative upstream drainage area
9+
- `x`: The x coordinate of the centroid of each feature (precalculated for faster results later)
10+
- `y`: The y coordinate of the centroid of each feature (precalculated for faster results later)
11+
12+
2. Points representing the location of each of the river gauging station available with at least the
13+
following attributes (columns) for each feature:
14+
- `gauge_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
15+
- `model_id`: The ID of the stream segment which corresponds to that gauge.
16+
17+
The `drain_table.parquet` should look like this:
18+
19+
| downstream_model_id | model_id | model_area | strahler_order | x | y |
20+
|---------------------|-----------------|--------------|----------------|-----|-----|
21+
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
22+
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
23+
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
24+
| ... | ... | ... | ... | ... | ... |
25+
26+
The `gauge_table.parquet` should look like this:
27+
28+
| model_id | gauge_id | gauge_area |
29+
|-------------------|------------------|--------------|
30+
| unique_stream_num | unique_gauge_num | area in km^2 |
31+
| unique_stream_num | unique_gauge_num | area in km^2 |
32+
| unique_stream_num | unique_gauge_num | area in km^2 |
33+
| ... | ... | ... |
34+
35+
36+
## Things to check
37+
38+
Be sure that both datasets:
39+
40+
- Are in the same projected coordinate system
41+
- Only contain gauges and reaches within the area of interest. Clip/delete anything else.
42+
43+
Other things to consider:
44+
45+
- You may find it helpful to also have the catchments, adjoint catchments, and a watershed boundary polygon for
46+
visualization purposes.

docs/data/index.md

+5-53
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,13 @@
11
# Required Datasets
22

3-
## GIS Datasets
3+
SABER requires [GIS Datasets](./gis-data.md) and [Hydrological Datasets](./discharge-data.md).
44

5-
1. Drainage lines (usually delineated center lines) with at least the following attributes (columns)
6-
for each feature:
7-
- `model_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice
8-
- `downstream_model_id`: The ID of the next downstream reach
9-
- `strahler_order`: The strahler stream order of each reach
10-
- `model_drain_area`: Cumulative upstream drainage area
11-
- `x`: The x coordinate of the centroid of each feature (precalculated for faster results later)
12-
- `y`: The y coordinate of the centroid of each feature (precalculated for faster results later)
13-
2. Points representing the location of each of the river gauging station available with at least the
14-
following attributes (columns) for each feature:
15-
- `gauge_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
16-
- `model_id`: The ID of the stream segment which corresponds to that gauge.
5+
These datasets ***need to be prepared independently before using `saber-hbc` functions***. You should organize the datasets in a working
6+
directory that contains 3 subdirectories, as shown below. SABER will expect your inputs to be in the `tables` directory
7+
with the correct names and will generate many files to populate the `gis` and `clusters` directories.
178

18-
Be sure that both datasets:
9+
Example project directory structure:
1910

20-
- Are in the same projected coordinate system
21-
- Only contain gauges and reaches within the area of interest. Clip/delete anything else.
22-
23-
Other things to consider:
24-
25-
- You may find it helpful to also have the catchments, adjoint catchments, and a watershed boundary polygon for
26-
visualization purposes.
27-
28-
## Hydrological Datasets
29-
30-
1. Hindcast/Retrospective/Historical Simulation for every stream segment (reporting point) in the model. This is a time
31-
series of discharge (Q) for each stream segment. The data should be in a tabular format that can be read by `pandas`.
32-
The data should have two columns:
33-
1. `datetime`: The datetime stamp for the measurements
34-
2. A column whose name is the unique `model_id` containing the discharge for each time step.
35-
2. Observed discharge data for each gauge
36-
1. `datetime`: The datetime stamp for the measurements
37-
2. A column whose name is the unique `gauge_id` containing the discharge for each time step.
38-
39-
Be sure that both datasets:
40-
41-
- Are in the same units (e.g. m3/s)
42-
- Are in the same time zone (e.g. UTC)
43-
- Are in the same time step (e.g. daily average)
44-
- Do not contain any non-numeric values (e.g. ICE, none, etc.)
45-
- Do not contain rows with missing values (e.g. NaN or blank cells)
46-
- Have been cleaned of any incorrect values (e.g. no negative values)
47-
- Do not contain any duplicate rows
48-
49-
## Working Directory
50-
51-
SABER is designed to read and write many files in a working directory.
52-
5311
tables/
5412
# This directory contains all the input datasets
5513
drain_table.parquet
@@ -64,9 +22,3 @@ SABER is designed to read and write many files in a working directory.
6422
gis/
6523
# this directory contains outputs from the SABER commands
6624
...
67-
68-
`drain_table.parquet` is a table of the attribute table from the drainage lines GIS dataset. It can be generated with
69-
`saber.prep.gis_tables()`.
70-
71-
`gauge_table.parquet` is a table of the attribute table from the drainage lines GIS dataset. It can be generated with
72-
`saber.prep.gis_tables()`.

docs/requirements.txt

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
mkdocs==1.3
2-
mkdocs-material==8.4
2+
mkdocs-material==8.4
3+
mkdocstrings-python==0.7.1

docs/user-guide/data_preparation.md

+18-68
Original file line numberDiff line numberDiff line change
@@ -1,80 +1,30 @@
1-
# Prepare Spatial Data (scripts not provided)
2-
This step instructs you to collect 3 gis files and use them to generate 2 tables. All 5 files (3 gis files and 2
3-
tables) should go in the `gis_inputs` directory
1+
# Processing Input Data
42

5-
1. Clip model drainage lines and catchments shapefile to extents of the region of interest.
6-
For speed/efficiency, merge their attribute tables and save as a csv.
7-
- read drainage line shapefile and with GeoPandas
8-
- delete all columns ***except***: NextDownID, COMID, Tot_Drain_, order_
9-
- rename the columns:
10-
- NextDownID -> downstream_model_id
11-
- COMID -> model_id
12-
- Tot_Drain -> drainage_area
13-
- order_ -> stream_order
14-
- compute the x and y coordinates of the centroid of each feature (needs the geometry column)
15-
- delete geometry column
16-
- save as `drain_table.csv` in the `gis_inputs` directory
3+
Before following these steps, you should have prepared the required datasets and organized them in a working directory.
4+
Refer to the [Required Datasets](../data/index.md) page for more information.
175

18-
Tip to compute the x and y coordinates using geopandas
6+
***Prereqs:***
197

8+
1. Create a working directory and subdirectories
9+
2. Prepare the `drain_table` and `gauge_table` files.
10+
3. Prepare the `hindcast_series_table` file.
2011

21-
Your table should look like this:
12+
## Prepare Flow Duration Curve Data
2213

23-
| downstream_model_id | model_id | model_drain_area | stream_order | x | y |
24-
|---------------------|-----------------|------------------|--------------|-----|-----|
25-
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
26-
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
27-
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
28-
| ... | ... | ... | ... | ... | ... |
29-
30-
1. Prepare a csv of the attribute table of the gauge locations shapefile.
31-
- You need the columns:
32-
- model_id
33-
- gauge_id
34-
- drainage_area (if known)
35-
36-
Your table should look like this (column order is irrelevant):
37-
38-
| model_id | gauge_drain_area | gauge_id |
39-
|-------------------|------------------|------------------|
40-
| unique_stream_num | area in km^2 | unique_gauge_num |
41-
| unique_stream_num | area in km^2 | unique_gauge_num |
42-
| unique_stream_num | area in km^2 | unique_gauge_num |
43-
| ... | ... | ... |
44-
45-
# Prepare Discharge Data
46-
47-
This step instructs you to gather simulated data and observed data. The raw simulated data (netCDF) and raw observed
48-
data (csvs) should be included in the `data_inputs` folder. You may keep them in another location and provide the path
49-
as an argument in the functions that need it. These datasets are used to generate several additional csv files which
50-
are stored in the `data_processed` directory and are used in later steps. The netCDF file may have any name and the
51-
directory of observed data csvs should be called `obs_csvs`.
52-
53-
Use the dat
54-
55-
1. Create a single large csv of the historical simulation data with a datetime column and 1 column per stream segment labeled by the stream's ID number.
56-
57-
| datetime | model_id_1 | model_id_2 | model_id_3 |
58-
|------------|------------|------------|------------|
59-
| 1979-01-01 | 50 | 50 | 50 |
60-
| 1979-01-02 | 60 | 60 | 60 |
61-
| 1979-01-03 | 70 | 70 | 70 |
62-
| ... | ... | ... | ... |
63-
64-
2. Process the large simulated discharge csv to create a 2nd csv with the flow duration curve on each segment (script provided).
14+
Process the `hindcast_series_table` to create a 2nd table with the flow duration curve on each segment.
6515

6616
| p_exceed | model_id_1 | model_id_2 | model_id_3 |
6717
|----------|------------|------------|------------|
6818
| 100 | 0 | 0 | 0 |
69-
| 99 | 10 | 10 | 10 |
70-
| 98 | 20 | 20 | 20 |
19+
| 97.5 | 10 | 10 | 10 |
20+
| 95 | 20 | 20 | 20 |
7121
| ... | ... | ... | ... |
7222

73-
3. Process the large historical discharge csv to create a 3rd csv with the monthly averages on each segment (script provided).
23+
Then process the FDC data to create a 3rd table with scaled/transformed FDC data for each segment.
7424

75-
| month | model_id_1 | model_id_2 | model_id_3 |
76-
|-------|------------|------------|------------|
77-
| 1 | 60 | 60 | 60 |
78-
| 2 | 30 | 30 | 30 |
79-
| 3 | 70 | 70 | 70 |
80-
| ... | ... | ... | ... |
25+
| model_id | Q100 | Q97.5 | Q95 |
26+
|----------|------|-------|-----|
27+
| 1 | 60 | 50 | 40 |
28+
| 2 | 60 | 50 | 40 |
29+
| 3 | 60 | 50 | 40 |
30+
| ... | ... | ... | ... |

docs/user-guide/index.md

+6-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# User Guide
22

3-
We anticipate the primary usage of `saber-hbc` will be in scripts or workflows that process data in isolated environments,
3+
While following this guide, you may also want to refer to the [API Documentation](../api).
4+
5+
We anticipate the primary usage of `saber` will be in scripts or workflows that process data in isolated environments,
46
such as web servers or interactively in notebooks, rather than using the api in an app. The package's API is designed with
57
many modular, compartmentalized functions intending to create flexibility for running specific portions of the SABER process
68
or repeating certain parts if workflows fail or parameters need to be adjusted.
@@ -20,3 +22,6 @@ logging.basicConfig(
2022
format='%(asctime)s: %(name)s - %(message)s'
2123
)
2224
```
25+
26+
## Example Script
27+

docs/user-guide/validation.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,4 @@ obs_data_dir = '/path/to/obs/data/directory' # optional - if data not in workdi
2727

2828
saber.validate.sample_gauges(workdir)
2929
saber.validate.run_series(workdir, drain_shape, obs_data_dir)
30-
```
30+
```

mkdocs.yml

+15-2
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,10 @@ repo_url: https://github.com/rileyhales/saber-hbc/
77
theme: material
88
nav:
99
- Home: index.md
10-
- Required Datasets: data/index.md
10+
- Required Datasets:
11+
- Summary: data/index.md
12+
- GIS Datasets: data/gis-data.md
13+
- Discharge Datasets: data/discharge-data.md
1114
- User Guide:
1215
- Using SABER: user-guide/index.md
1316
- Data Preparation: user-guide/data_preparation.md
@@ -17,5 +20,15 @@ nav:
1720
- Bias Correction: user-guide/bias_correction.md
1821
- Validation: user-guide/validation.md
1922
- Demonstration: demo/index.md
20-
- API Docs: api/index.md
23+
- API Docs:
24+
- API Reference: api/index.md
25+
- saber.prep: api/prep.md
26+
- saber.cluster: api/cluster.md
27+
- saber.assign: api/assign.md
28+
- saber.gis: api/gis.md
29+
- saber.validate: api/validate.md
2130
- Cite SABER: cite/index.md
31+
32+
plugins:
33+
- search
34+
- mkdocstrings

saber/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,5 @@
1414
]
1515

1616
__author__ = 'Riley C. Hales'
17-
__version__ = '0.5.0'
17+
__version__ = '0.6.0'
1818
__license__ = 'BSD 3 Clause Clear'

0 commit comments

Comments
 (0)