Merge `develop_readWDM` into `develop` to read time series by block & group #35

aufdenkampe · 2021-03-29T18:52:49Z

This PR directly addresses issue #21 Rewrite readWDM.py to read by data group & block.

Before merging, we should test that constructed HDF5 files work for running the HSP2 model. Indications are that it does not. See respec#40 (comment), for which @bcous ran from his branch. I got similar results running test10, which I'll commit shortly.

We should merge this PR before we merge PR #34 for the new WDM Class.

Adds new function 'process_groups' which replaces the 'getfloats' function. This new function processes WDM files by blocks which consist of a control word (32bit integer) and then one or more float32 values which contain the data for the block. This approach will provide support for Timeseries records with irregular timestep intervals.

Adding an additional WDM test file rpo772.wdm. This file contains single time with an irregular timestep.

My understanding is that when working with numpy array allocating the array size up front is very important for performance, as expanding the array later essentially means copying the array to a larger array in new block of memory. With the block processing approach there is presently no way to determine the exact size of the resultant array until you read through the groups. Allocating an array of a fixed size can cause files to fail with an IndexError. A quick solution to this was to implement a chunk allocation approach which allocates numpy arrays in large 100mb chunks and only expands and array when processing the next block will exceed the boundaries of the existing array. This solution is far from perfect but at least resolves the IndexErrors. We should consult with Jack to see if there is a way to determine the number of elements prior to processing the timeseries, or alternatively consider switching to Lists instead because adding an element to a list is time constant and should perform better than this approach.

Blocks in a timeseries consist of a minimum of 2 elements, the block control word and one or more float data values. When block ends there is either another block, the end of the group, or the end of the record (meaning we'd go to the next record in the chain). Some of our test WDM files have a block that ends on the 511th element of a record. In these example files the 512th element also did not parse to a valid block control word. At this time I'm not sure what the significance of the 512th elements when a block end, but processing those elements as a control word causes a series of errors which will throw off the accuracy of all subsequent blocks processed. We'll need to confirm the significance of these elements with Jack, but for now I implemented logic to skip to the next record if the end of a block fails the 511th element of the record.

@PaulDudaRESPEC

From @PaulDudaRESPEC, added to new `docs` directory that we'll want to build out over time. Connects to #20 & #21.

@jlkittle

.exp & .OUT files from @@jlkittle using WDMRX debugger: https://github.com/respec/FORTRAN/blob/master/lib3.0/BIN/WDMRX.EXE CSV file from @htaolimno using https://github.com/respec/BASINS/tree/master/atcWdmVb Connects to #21 & #22.

# 1. used lists to replace numpy matrix; # 2. added a loop to iterate each group and used ending date as the ending condition

The rewrite to process wdm files as groups led to the deprecation of the getfloats and adjustNval functions. Additionally Hua refactored the original process_groups method in a previous commit as process_groups2. The original process_groups method was removed and process_groups2 was renamed to process_group.

A single leading underscore is one of the methods used in python (see PEP8) to denote internal classes and functions. Internal functions come with no guarantee of backward compatibility. We want to update the naming of supporting functions that we do not the public to interface with.

General refactoring to cleanup code by removing old comments and slight restructuring to increase readability. Also replace print statements for error with raise exceptions.

@ptomasula

To help with #21 @ptomasula & @PaulDudaRESPEC

Connected to #28

Merge updates to readUCI and GQUAL from respec/HSPsquared - develop branch

The constraints of Numba meant that datetime conversion cannot occur with the main block processing loop (_process_groups function). This commit replaces the previous use of python datetime object for a bit approach in which date components are stored in a single integer who individual bits can be parsed into the time step components. Conversion to a datetime object now occurs outside of the processing loop prior to output to HDF.

…ared into develop_readWDM

The datetime functions added to support numba in commit e5d64a1 required that integers input into these functions are 64bits or year information will be lost during bit manipulations. The previous implementation left integer type up to numba and in some instances could produce a in32 object. This commit causes integer conversion to be explicitly int64 so that year information is not lost.

Even with datetime conversions removed from the group processing loop, the conversion time using datetime.datetime() remains slow. After trying attempts using some datetime conversion approaches with pandas I was still unable to achieve a significant performance boost. Numba does not support the creation of datetime objects, however it does support datetime arithmetic. This commit adds in a numba compatible datetime conversion function which calculates a dates offset from the epoch and adds the appropriate timedelta64 objects to return a datetime64 object.

I missed committing 3 line deletions which remove the old pandas.apply based datetime conversion approach.

@ptomasula

@ptomasula, try running either of these notebooks. Connects to issue #21 & PR #35

The timeseries produced by the new version of the WMD reader are offset by one datetime step from the previous version.

When the transform is unable to detect the timestep of the input timeseries, there was a bit of code that performed a reindex and fillna using the 'tbase' of the siminfo. However this key does not appear in that dictionary. Additionally our initial test case of test 10b does not encounter this bit of code. We'll revisit the necessity of this during our IO abstraction but commenting it out for now.

For testing of #21 and #35. HDF5 output files were not committed.

@ptomasula

@ptomasula, I found the bug recently described in #21

for assisted value comparisons.

This commit addresses a mismatch in the 'Stop' parameter of the '/TIMESERIES/SUMMARY' table. The updated WDMreader wrote a value 1 timestep less than the end of the last group in the timeseries. See Issue #21.

Add two lines that were missing from the previous commit. ca50dd0

@PaulDudaRESPEC

Commit c01199a by @PaulDudaRESPEC got HSP2 to run with HDF5 files from new `readWDM`. Still need to compare outputs

Reversing b0edc39 as a step toward advancing #21

This commit assigned freq attribute of the timeseries generated in the ReadWDM function. Without the freq assignment, the timeseries were failing to execute in some of the model modules. Reference: #21 and #21 (comment)

This fixes a similar parsing on invalid block control word issue that we saw previously. The 'offset' variable keeps track of where on a given record the loop is and a block must be at least 2 words long. When at the final (512th) index of the record and attempting to process the next block we must first go to the next record in the timeseries. See line 307. However, we've used python indexing for the offset variable which starts at 0. The last index of record should therefor 511 and not the 512. Reference: #21 (comment)

Timeseries with irregular timestep do not conform to the requirement for setting the index.freq argument. This results in a value error. This commit adds a try except to allow for the reader to handle timeseries with irregular timesteps. However as of this commit, the model with not be able to run these timeseries. Additional effort is needed to handle timesteps with irregular timesteps.

Renames internal functions with prepending underscore to indicate they are private functions as per PEP8.

This reverts commit c545150, reversing changes made to c62adb6.

ptomasula · 2021-04-28T23:57:13Z

Due to issues merging this pull request will be abandoned and replaced with an updated pull request on a new branch split off from this branch. Additional details can be found under pull request #37.

also addresses #35 (git tracking Jupyter notebooks). The new conda environment substantially improves over the previous version, with a more consolidated HDF5 version (1.10.6) and upgrading JupyterLab to v3.

For testing of LimnoTech#21 and LimnoTech#35. HDF5 output files were not committed.

ptomasula and others added 19 commits February 19, 2021 14:43

Merge branch 'develop' into develop_readWDM

152d7ed

Adding rpo772.wdm

90e7ff7

Adding an additional WDM test file rpo772.wdm. This file contains single time with an irregular timestep.

Add WDM Programmers Guide

762b90c

From @PaulDudaRESPEC, added to new `docs` directory that we'll want to build out over time. Connects to #20 & #21.

Adding rpo772.wdm debugging files

b6fbef4

.exp & .OUT files from @@jlkittle using WDMRX debugger: https://github.com/respec/FORTRAN/blob/master/lib3.0/BIN/WDMRX.EXE CSV file from @htaolimno using https://github.com/respec/BASINS/tree/master/atcWdmVb Connects to #21 & #22.

alternative implementation of process_group:

0f5a2d1

# 1. used lists to replace numpy matrix; # 2. added a loop to iterate each group and used ending date as the ending condition

tidy up readWDM

2e205f1

General refactoring to cleanup code by removing old comments and slight restructuring to increase readability. Also replace print statements for error with raise exceptions.

Searchable WDMProgrammersGuide+Search+Nav.pdf with sidebar navigation

8b701e9

To help with #21 @ptomasula & @PaulDudaRESPEC

Upload HSPF v12.2 manual with added sidebar navigation

ef2d8df

Connected to #28

Merge pull request #32 from LimnoTech/develop

0cfd7c2

Merge updates to readUCI and GQUAL from respec/HSPsquared - develop branch

Merge branch 'develop_readWDM' of https://github.com/LimnoTech/HSPsqu…

677adaa

…ared into develop_readWDM

Remove old datetime conversion

9b19a92

I missed committing 3 line deletions which remove the old pandas.apply based datetime conversion approach.

aufdenkampe added this to the Sprint 1: WDM read & write to Parquet milestone Mar 29, 2021

aufdenkampe requested a review from ptomasula March 29, 2021 18:52

aufdenkampe self-assigned this Mar 29, 2021

aufdenkampe mentioned this pull request Mar 29, 2021

Develop WDMClass #34

Open

aufdenkampe and others added 3 commits March 30, 2021 07:36

Test10 no longer runs since readWDM time series updates

2a8373e

@ptomasula, try running either of these notebooks. Connects to issue #21 & PR #35

Datetime shift fix

2014cd0

The timeseries produced by the new version of the WMD reader are offset by one datetime step from the previous version.

aufdenkampe added a commit that referenced this pull request Mar 31, 2021

Rerun Test10 & Test10b for outputs comparison

6592111

For testing of #21 and #35. HDF5 output files were not committed.

aufdenkampe added 3 commits March 31, 2021 16:41

Merge branch 'develop' into develop_readWDM

a094cfe

Merge branch 'develop' into develop_readWDM

0f1f7bc

TimeSeries dtypes change in readWDM

c52eaad

@ptomasula, I found the bug recently described in #21

aufdenkampe and others added 9 commits April 8, 2021 17:27

Update HDF5_compare notebook

2d09713

for assisted value comparisons.

Merge branch 'develop' into develop_readWDM

ca50dd0

Stop datetime fix

543ea20

This commit addresses a mismatch in the 'Stop' parameter of the '/TIMESERIES/SUMMARY' table. The updated WDMreader wrote a value 1 timestep less than the end of the last group in the timeseries. See Issue #21.

Adding missing lines to stop datetime fix

ae374f1

Add two lines that were missing from the previous commit. ca50dd0

Merge branch 'develop' into develop_readWDM

0d910ed

Testing outputs. Fix worked!

6e6beab

Commit c01199a by @PaulDudaRESPEC got HSP2 to run with HDF5 files from new `readWDM`. Still need to compare outputs

Sparse time series fill added back

479e6d6

Reversing b0edc39 as a step toward advancing #21

timeseries freq assignment

b9e635f

This commit assigned freq attribute of the timeseries generated in the ReadWDM function. Without the freq assignment, the timeseries were failing to execute in some of the model modules. Reference: #21 and #21 (comment)

Update conda env & notebooks

1c450ae

aufdenkampe mentioned this pull request Apr 23, 2021

Merge Unit testing work from RESPEC #36

Merged

aufdenkampe mentioned this pull request Apr 28, 2021

Rewrite readWDM.py to read by data group & block #21

Closed

ptomasula added 3 commits April 28, 2021 13:09

renaming private function

c62adb6

Renames internal functions with prepending underscore to indicate they are private functions as per PEP8.

Merge branch 'develop' into develop_readWDM

c545150

aufdenkampe mentioned this pull request Apr 28, 2021

Create list of challenging WDM test files #22

Closed

Revert "Merge branch 'develop' into develop_readWDM"

16325d1

This reverts commit c545150, reversing changes made to c62adb6.

ptomasula mentioned this pull request Apr 28, 2021

Merge develop_readWDM into develop to read time series by block & group #37

Merged

ptomasula closed this Apr 28, 2021

ptomasula deleted the develop_readWDM branch September 3, 2021 14:17

aufdenkampe mentioned this pull request Oct 1, 2021

UCI import with 15min timestep respec/HSPsquared#22

Closed

rburghol pushed a commit to HARPgroup/HSPsquared that referenced this pull request Mar 5, 2025

Rerun Test10 & Test10b for outputs comparison

b8e348b

For testing of LimnoTech#21 and LimnoTech#35. HDF5 output files were not committed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge `develop_readWDM` into `develop` to read time series by block & group #35

Merge `develop_readWDM` into `develop` to read time series by block & group #35

Uh oh!

aufdenkampe commented Mar 29, 2021

Uh oh!

ptomasula commented Apr 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Merge develop_readWDM into develop to read time series by block & group #35

Merge develop_readWDM into develop to read time series by block & group #35

Uh oh!

Conversation

aufdenkampe commented Mar 29, 2021

Uh oh!

ptomasula commented Apr 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Merge `develop_readWDM` into `develop` to read time series by block & group #35

Merge `develop_readWDM` into `develop` to read time series by block & group #35