Parallel Computations (#39)

rileyhales · web-flow · commit e5c988c844e3 · 2022-09-22T12:12:05.000-06:00
* improve assignment table functions (#38) * update assign logging and force dtypes before merging * new parallel assignment by propagation functions * map propagate and resolve propagate functions * cache prop tables, add docstrings and todos * copy rows in prop for speed and lower memory req * correct false checks for empty rows and naming in assign by clusters * correct bugs in df filters, map assign ungauged * revise gis generating functions for new column names, logging * incrememnt version number
diff --git a/docs/api/assign.md b/docs/api/assign.md
@@ -0,0 +1,3 @@
+# `saber.assign`
+
+::: saber.assign
diff --git a/docs/api/cluster.md b/docs/api/cluster.md
@@ -0,0 +1,3 @@
+# `saber.cluster`
+
+::: saber.cluster
diff --git a/docs/api/gis.md b/docs/api/gis.md
@@ -0,0 +1,3 @@
+# `saber.gis`
+
+::: saber.gis
diff --git a/docs/api/index.md b/docs/api/index.md
@@ -1 +1,7 @@
-# API Documentation
+# `saber-hbc` API
+
+* [`saber.assign`](assign.md)
+* [`saber.cluster`](cluster.md)
+* [`saber.gis`](gis.md)
+* [`saber.prep`](prep.md)
+* [`saber.validate`](validate.md)
diff --git a/docs/api/prep.md b/docs/api/prep.md
@@ -0,0 +1,3 @@
+# `saber.prep`
+
+::: saber.prep
diff --git a/docs/api/validate.md b/docs/api/validate.md
@@ -0,0 +1,3 @@
+# `saber.validate`
+
+::: saber.validate
diff --git a/docs/data/discharge-data.md b/docs/data/discharge-data.md
@@ -0,0 +1,41 @@
+# Required Hydrological Datasets
+
+1. Hindcast/Retrospective discharge for every stream segment (reporting point) in the model. This is a time series of
+   discharge, e.g. hydrograph, for each stream segment. The data should be saved in parquet format and named 
+   `hindcast_series_table.parquet`. The DataFrame should have:
+    1. An index named `datetime` of type `datetime`. Contains the datetime stamp for the simulated values (rows)
+    2. 1 column per stream, column name is the stream's model ID and is type string, containing the discharge for each
+       time step.
+2. Observed discharge data for each gauge. 1 file per gauge named `{gauge_id}.csv`. The DataFrame should have:
+    1. `datetime`: The datetime stamp for the measurements
+    2. A column whose name is the unique `gauge_id` containing the discharge for each time step.
+
+The `hindcast_series_table.parquet` should look like this:
+
+| datetime   | model_id_1 | model_id_2 | model_id_3 | ... |
+|------------|------------|------------|------------|-----|
+| 1985-01-01 | 50         | 50         | 50         | ... |
+| 1985-01-02 | 60         | 60         | 60         | ... |
+| 1985-01-03 | 70         | 70         | 70         | ... |
+| ...        | ...        | ...        | ...        | ... |
+
+Each gauge's csv file should look like this:
+
+| datetime   | discharge |
+|------------|-----------|
+| 1985-01-01 | 50        |
+| 1985-01-02 | 60        |
+| 1985-01-03 | 70        |
+| ...        | ...       |
+
+## Things to check
+
+Be sure that both datasets:
+
+- Are in the same units (e.g. m3/s)
+- Are in the same time zone (e.g. UTC)
+- Are in the same time step (e.g. daily average)
+- Do not contain any non-numeric values (e.g. ICE, none, etc.)
+- Do not contain rows with missing values (e.g. NaN or blank cells)
+- Have been cleaned of any incorrect values (e.g. no negative values)
+- Do not contain any duplicate rows
diff --git a/docs/data/gis-data.md b/docs/data/gis-data.md
@@ -0,0 +1,46 @@
+# Required GIS Datasets
+
+1. Drainage lines (usually delineated center lines) with at least the following attributes (columns) 
+   for each feature:
+    - `model_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice 
+    - `downstream_model_id`: The ID of the next downstream reach 
+    - `strahler_order`: The strahler stream order of each reach
+    - `model_drain_area`: Cumulative upstream drainage area
+    - `x`: The x coordinate of the centroid of each feature (precalculated for faster results later)
+    - `y`: The y coordinate of the centroid of each feature (precalculated for faster results later)
+
+2. Points representing the location of each of the river gauging station available with at least the 
+   following attributes (columns) for each feature:
+    - `gauge_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
+    - `model_id`: The ID of the stream segment which corresponds to that gauge.
+
+The `drain_table.parquet` should look like this:
+
+| downstream_model_id | model_id        | model_area   | strahler_order | x   | y   |  
+|---------------------|-----------------|--------------|----------------|-----|-----|
+| unique_stream_#     | unique_stream_# | area in km^2 | stream_order   | ##  | ##  |
+| unique_stream_#     | unique_stream_# | area in km^2 | stream_order   | ##  | ##  |  
+| unique_stream_#     | unique_stream_# | area in km^2 | stream_order   | ##  | ##  |  
+| ...                 | ...             | ...          | ...            | ... | ... |
+
+The `gauge_table.parquet` should look like this:
+
+| model_id          | gauge_id         | gauge_area   |
+|-------------------|------------------|--------------|
+| unique_stream_num | unique_gauge_num | area in km^2 |
+| unique_stream_num | unique_gauge_num | area in km^2 |
+| unique_stream_num | unique_gauge_num | area in km^2 |
+| ...               | ...              | ...          |
+
+
+## Things to check
+
+Be sure that both datasets:
+
+- Are in the same projected coordinate system
+- Only contain gauges and reaches within the area of interest. Clip/delete anything else.
+
+Other things to consider:
+
+- You may find it helpful to also have the catchments, adjoint catchments, and a watershed boundary polygon for 
+  visualization purposes.
diff --git a/docs/data/index.md b/docs/data/index.md
@@ -1,55 +1,13 @@
 # Required Datasets
 
-## GIS Datasets
+SABER requires [GIS Datasets](./gis-data.md) and [Hydrological Datasets](./discharge-data.md).
 
-1. Drainage lines (usually delineated center lines) with at least the following attributes (columns) 
-   for each feature:
-    - `model_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice 
-    - `downstream_model_id`: The ID of the next downstream reach 
-    - `strahler_order`: The strahler stream order of each reach
-    - `model_drain_area`: Cumulative upstream drainage area
-    - `x`: The x coordinate of the centroid of each feature (precalculated for faster results later)
-    - `y`: The y coordinate of the centroid of each feature (precalculated for faster results later)
-2. Points representing the location of each of the river gauging station available with at least the 
-   following attributes (columns) for each feature:
-    - `gauge_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
-    - `model_id`: The ID of the stream segment which corresponds to that gauge.
+These datasets ***need to be prepared independently before using `saber-hbc` functions***. You should organize the datasets in a working 
+directory that contains 3 subdirectories, as shown below. SABER will expect your inputs to be in the `tables` directory 
+with the correct names and will generate many files to populate the `gis` and `clusters` directories. 
 
-Be sure that both datasets:
+Example project directory structure:
 
-- Are in the same projected coordinate system
-- Only contain gauges and reaches within the area of interest. Clip/delete anything else.
-
-Other things to consider:
-
-- You may find it helpful to also have the catchments, adjoint catchments, and a watershed boundary polygon for 
-  visualization purposes.
-
-## Hydrological Datasets
-
-1. Hindcast/Retrospective/Historical Simulation for every stream segment (reporting point) in the model. This is a time 
-   series of discharge (Q) for each stream segment. The data should be in a tabular format that can be read by `pandas`.
-    The data should have two columns:
-    1. `datetime`: The datetime stamp for the measurements
-    2. A column whose name is the unique `model_id` containing the discharge for each time step.
-2. Observed discharge data for each gauge
-    1. `datetime`: The datetime stamp for the measurements
-    2. A column whose name is the unique `gauge_id` containing the discharge for each time step.
-
-Be sure that both datasets:
-
-- Are in the same units (e.g. m3/s)
-- Are in the same time zone (e.g. UTC)
-- Are in the same time step (e.g. daily average)
-- Do not contain any non-numeric values (e.g. ICE, none, etc.)
-- Do not contain rows with missing values (e.g. NaN or blank cells)
-- Have been cleaned of any incorrect values (e.g. no negative values)
-- Do not contain any duplicate rows
-
-## Working Directory
-
-SABER is designed to read and write many files in a working directory.
-                                     
     tables/
         # This directory contains all the input datasets
         drain_table.parquet
@@ -64,9 +22,3 @@ SABER is designed to read and write many files in a working directory.
     gis/
         # this directory contains outputs from the SABER commands
         ...
-
-`drain_table.parquet` is a table of the attribute table from the drainage lines GIS dataset. It can be generated with 
-`saber.prep.gis_tables()`.
-
-`gauge_table.parquet` is a table of the attribute table from the drainage lines GIS dataset. It can be generated with 
-`saber.prep.gis_tables()`.
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,2 +1,3 @@
 mkdocs==1.3
-mkdocs-material==8.4
+mkdocs-material==8.4
+mkdocstrings-python==0.7.1
diff --git a/docs/user-guide/data_preparation.md b/docs/user-guide/data_preparation.md
@@ -1,80 +1,30 @@
-# Prepare Spatial Data (scripts not provided)
-This step instructs you to collect 3 gis files and use them to generate 2 tables. All 5 files (3 gis files and 2
-tables) should go in the `gis_inputs` directory 
+# Processing Input Data
 
-1. Clip model drainage lines and catchments shapefile to extents of the region of interest. 
-   For speed/efficiency, merge their attribute tables and save as a csv.
-   - read drainage line shapefile and with GeoPandas 
-   - delete all columns ***except***: NextDownID, COMID, Tot_Drain_, order_
-   - rename the columns:
-      - NextDownID -> downstream_model_id
-      - COMID -> model_id
-      - Tot_Drain -> drainage_area
-      - order_ -> stream_order
-   - compute the x and y coordinates of the centroid of each feature (needs the geometry column)
-   - delete geometry column
-   - save as `drain_table.csv` in the `gis_inputs` directory
+Before following these steps, you should have prepared the required datasets and organized them in a working directory. 
+Refer to the [Required Datasets](../data/index.md) page for more information.
 
-Tip to compute the x and y coordinates using geopandas
+***Prereqs:***
 
+1. Create a working directory and subdirectories
+2. Prepare the `drain_table` and `gauge_table` files.
+3. Prepare the `hindcast_series_table` file.
 
-Your table should look like this:
+## Prepare Flow Duration Curve Data
 
-| downstream_model_id | model_id        | model_drain_area | stream_order | x   | y   |  
-|---------------------|-----------------|------------------|--------------|-----|-----|
-| unique_stream_#     | unique_stream_# | area in km^2     | stream_order | ##  | ##  |
-| unique_stream_#     | unique_stream_# | area in km^2     | stream_order | ##  | ##  |  
-| unique_stream_#     | unique_stream_# | area in km^2     | stream_order | ##  | ##  |  
-| ...                 | ...             | ...              | ...          | ... | ... |
-
-1. Prepare a csv of the attribute table of the gauge locations shapefile.
-   - You need the columns:
-     - model_id
-     - gauge_id
-     - drainage_area (if known)  
-
-Your table should look like this (column order is irrelevant):
-
-| model_id          | gauge_drain_area | gauge_id         |
-|-------------------|------------------|------------------|
-| unique_stream_num | area in km^2     | unique_gauge_num |
-| unique_stream_num | area in km^2     | unique_gauge_num |
-| unique_stream_num | area in km^2     | unique_gauge_num |
-| ...               | ...              | ...              |
-
-# Prepare Discharge Data
-
-This step instructs you to gather simulated data and observed data. The raw simulated data (netCDF) and raw observed 
-data (csvs) should be included in the `data_inputs` folder. You may keep them in another location and provide the path 
-as an argument in the functions that need it. These datasets are used to generate several additional csv files which 
-are stored in the `data_processed` directory and are used in later steps. The netCDF file may have any name and the 
-directory of observed data csvs should be called `obs_csvs`.
-
-Use the dat
-
-1. Create a single large csv of the historical simulation data with a datetime column and 1 column per stream segment labeled by the stream's ID number.
-
-| datetime   | model_id_1 | model_id_2 | model_id_3 |
-|------------|------------|------------|------------|
-| 1979-01-01 | 50         | 50         | 50         |
-| 1979-01-02 | 60         | 60         | 60         |
-| 1979-01-03 | 70         | 70         | 70         |
-| ...        | ...        | ...        | ...        |
-
-2. Process the large simulated discharge csv to create a 2nd csv with the flow duration curve on each segment (script provided).
+Process the `hindcast_series_table` to create a 2nd table with the flow duration curve on each segment.
 
 | p_exceed | model_id_1 | model_id_2 | model_id_3 |
 |----------|------------|------------|------------|
 | 100      | 0          | 0          | 0          |
-| 99       | 10         | 10         | 10         |
-| 98       | 20         | 20         | 20         |
+| 97.5     | 10         | 10         | 10         |
+| 95       | 20         | 20         | 20         |
 | ...      | ...        | ...        | ...        |
 
-3. Process the large historical discharge csv to create a 3rd csv with the monthly averages on each segment (script provided).
+Then process the FDC data to create a 3rd table with scaled/transformed FDC data for each segment.
 
-| month | model_id_1 | model_id_2 | model_id_3 |
-|-------|------------|------------|------------|
-| 1     | 60         | 60         | 60         |
-| 2     | 30         | 30         | 30         |
-| 3     | 70         | 70         | 70         |
-| ...   | ...        | ...        | ...        |
+| model_id | Q100 | Q97.5 | Q95 |
+|----------|------|-------|-----|
+| 1        | 60   | 50    | 40  |
+| 2        | 60   | 50    | 40  |
+| 3        | 60   | 50    | 40  |
+| ...      | ...  | ...   | ... |
diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md
@@ -1,6 +1,8 @@
 # User Guide
 
-We anticipate the primary usage of `saber-hbc` will be in scripts or workflows that process data in isolated environments, 
+While following this guide, you may also want to refer to the [API Documentation](../api).
+
+We anticipate the primary usage of `saber` will be in scripts or workflows that process data in isolated environments, 
 such as web servers or interactively in notebooks, rather than using the api in an app. The package's API is designed with 
 many modular, compartmentalized functions intending to create flexibility for running specific portions of the SABER process 
 or repeating certain parts if workflows fail or parameters need to be adjusted. 
@@ -20,3 +22,6 @@ logging.basicConfig(
     format='%(asctime)s: %(name)s - %(message)s'
 )
 ```
+
+## Example Script
+
diff --git a/docs/user-guide/validation.md b/docs/user-guide/validation.md
@@ -27,4 +27,4 @@ obs_data_dir = '/path/to/obs/data/directory'  # optional - if data not in workdi
 
 saber.validate.sample_gauges(workdir)
 saber.validate.run_series(workdir, drain_shape, obs_data_dir)
-```
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -7,7 +7,10 @@ repo_url: https://github.com/rileyhales/saber-hbc/
 theme: material
 nav:
   - Home: index.md
-  - Required Datasets: data/index.md
+  - Required Datasets:
+      - Summary: data/index.md
+      - GIS Datasets: data/gis-data.md
+      - Discharge Datasets: data/discharge-data.md
   - User Guide:
       - Using SABER: user-guide/index.md
       - Data Preparation: user-guide/data_preparation.md
@@ -17,5 +20,15 @@ nav:
       - Bias Correction: user-guide/bias_correction.md
       - Validation: user-guide/validation.md
   - Demonstration: demo/index.md
-  - API Docs: api/index.md
+  - API Docs:
+      - API Reference: api/index.md
+      - saber.prep: api/prep.md
+      - saber.cluster: api/cluster.md
+      - saber.assign: api/assign.md
+      - saber.gis: api/gis.md
+      - saber.validate: api/validate.md
   - Cite SABER: cite/index.md
+
+plugins:
+  - search
+  - mkdocstrings
diff --git a/saber/__init__.py b/saber/__init__.py
@@ -14,5 +14,5 @@
 ]
 
 __author__ = 'Riley C. Hales'
-__version__ = '0.5.0'
+__version__ = '0.6.0'
 __license__ = 'BSD 3 Clause Clear'
diff --git a/saber/_propagation.py b/saber/_propagation.py
diff --git a/saber/assign.py b/saber/assign.py
diff --git a/saber/cluster.py b/saber/cluster.py
diff --git a/saber/gis.py b/saber/gis.py
diff --git a/saber/io.py b/saber/io.py
diff --git a/saber/prep.py b/saber/prep.py
diff --git a/setup.py b/setup.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	+# `saber.assign`
	`2`	`+`
	`3`	`+::: saber.assign`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	+# `saber.cluster`
	`2`	`+`
	`3`	`+::: saber.cluster`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	+# `saber.prep`
	`2`	`+`
	`3`	`+::: saber.prep`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	+# `saber.validate`
	`2`	`+`
	`3`	`+::: saber.validate`
Original file line number	Diff line number	Diff line change
`@@ -14,5 +14,5 @@`
`14`	`14`	`]`
`15`	`15`
`16`	`16`	`__author__ = 'Riley C. Hales'`
`17`		`-__version__ = '0.5.0'`
	`17`	`+__version__ = '0.6.0'`
`18`	`18`	`__license__ = 'BSD 3 Clause Clear'`