Skip to content

Conversation

@aufdenkampe
Copy link
Member

This PR directly addresses issue #21 Rewrite readWDM.py to read by data group & block.

Before merging, we should test that constructed HDF5 files work for running the HSP2 model. Indications are that it does not. See respec#40 (comment), for which @bcous ran from his branch. I got similar results running test10, which I'll commit shortly.

We should merge this PR before we merge PR #34 for the new WDM Class.

ptomasula and others added 19 commits February 19, 2021 14:43
Adds new function 'process_groups' which replaces the 'getfloats' function.

This new function processes WDM files by blocks which consist of a control word (32bit integer) and then one or more float32 values which contain the data for the block. This approach will provide support for Timeseries records with irregular timestep intervals.
Adding an additional WDM test file rpo772.wdm. This file contains single time with an irregular timestep.
My understanding is that when working with numpy array allocating the array size up front is very important for performance, as expanding the array later essentially means copying the array to a larger array in new block of memory. With the block processing approach there is presently no way to determine the exact size of the resultant array until you read through the groups. Allocating an array of a fixed size can cause files to fail with an IndexError. A quick solution to this was to implement a chunk allocation approach which allocates numpy arrays in large 100mb chunks and only expands and array when processing the next block will exceed the boundaries of the existing array. This solution is far from perfect but at least resolves the IndexErrors. We should consult with Jack to see if there is a way to determine the number of elements prior to processing the timeseries, or alternatively consider switching to Lists instead because adding an element to a list is time constant and should perform better than this approach.
Blocks in a timeseries consist of a minimum of 2 elements, the block control word and one or more float data values. When block ends there is either another block, the end of the group, or the end of the record (meaning we'd go to the next record in the chain). Some of our test WDM files have a block that ends on the 511th element of a record. In these example files the 512th element also did not parse to a valid block control word. At this time I'm not sure what the significance of the 512th elements when a block end, but processing those elements as a control word causes a series of errors which will throw off the accuracy of all subsequent blocks processed. We'll need to confirm the significance of these elements with Jack, but for now I implemented logic to skip to the next record if the end of a block fails the 511th element of the record.
From @PaulDudaRESPEC, added to new `docs` directory that we'll want to build out over time. Connects to #20 & #21.
# 1. used lists to replace numpy matrix;
# 2. added a loop to iterate each group and used ending date as the ending condition
The rewrite to process wdm files as groups led to the deprecation of the getfloats and adjustNval functions. Additionally Hua refactored the original process_groups method in a previous commit as process_groups2. The original process_groups method was removed and process_groups2 was renamed to process_group.
A single leading underscore is one of the methods used in python (see PEP8) to denote internal classes and functions. Internal functions come with no guarantee of backward compatibility. We want to update the naming of supporting functions that we do not the public to interface with.
General refactoring to cleanup code by removing old comments and slight restructuring to increase readability. Also replace print statements for error with raise exceptions.
Merge updates to readUCI and GQUAL from respec/HSPsquared - develop branch
The constraints of Numba meant that datetime conversion cannot occur with the main block processing loop (_process_groups function). This commit replaces the previous use of python datetime object for a bit approach in which date components are stored in a single integer who individual bits can be parsed into the time step components. Conversion to a datetime object now occurs outside of the processing loop prior to output to HDF.
The datetime functions added to support numba in commit e5d64a1 required that integers input into these functions are 64bits or year information will be lost during bit manipulations. The previous implementation left integer type up to numba and in some instances could produce a in32 object. This commit causes integer conversion to be explicitly int64 so that year information is not lost.
Even with datetime conversions removed from the group processing loop, the conversion time using datetime.datetime() remains slow. After trying attempts using some datetime conversion approaches with pandas I was still unable to achieve a significant performance boost.

Numba does not support the creation of datetime objects, however it does support datetime arithmetic. This commit adds in a numba compatible datetime conversion function which calculates a dates offset from the epoch and adds the appropriate timedelta64 objects to return a datetime64 object.
I missed committing 3 line deletions which remove the old pandas.apply based datetime conversion approach.
aufdenkampe and others added 3 commits March 30, 2021 07:36
@ptomasula, try running either of these notebooks.
Connects to issue #21 & PR #35
The timeseries produced by the new version of the WMD reader are offset by one datetime step from the previous version.
When the transform is unable to detect the timestep of the input timeseries, there was a bit of code that performed a reindex and fillna using the 'tbase' of the siminfo. However this key does not appear in that dictionary. Additionally our initial test case of test 10b does not encounter this bit of code. We'll revisit the necessity of this during our IO abstraction but commenting it out for now.
aufdenkampe added a commit that referenced this pull request Mar 31, 2021
For testing of #21 and #35.
HDF5 output files were not committed.
aufdenkampe and others added 9 commits April 8, 2021 17:27
for assisted value comparisons.
This commit addresses a mismatch in the 'Stop' parameter of the '/TIMESERIES/SUMMARY' table. The updated WDMreader wrote a value 1 timestep less than the end of the last group in the timeseries. See Issue #21.
Add two lines that were missing from the previous commit.
ca50dd0
Commit  c01199a by @PaulDudaRESPEC got HSP2 to run with HDF5 files from new `readWDM`.
Still need to compare outputs
Reversing b0edc39 as a step toward advancing #21
This commit assigned freq attribute of the timeseries generated in the ReadWDM
function. Without the freq assignment, the timeseries were failing to execute in some of the model modules.

Reference: #21 and
#21 (comment)
This fixes a similar parsing on invalid block control word issue that we saw previously. The 'offset' variable keeps track of where on a given record the loop is and a block must be at least 2 words long. When at the final (512th) index of the record and attempting to process the next block we must first go to the next record in the timeseries. See line 307.

However, we've used python indexing for the offset variable which starts at 0. The last index of record should therefor 511 and not the 512.

Reference: #21 (comment)
Timeseries with irregular timestep do not conform to the requirement for setting the index.freq argument. This results in a value error. This commit adds a try except to allow for the reader to handle timeseries with irregular timesteps. However as of this commit, the model with not be able to run these timeseries. Additional effort is needed to handle timesteps with irregular timesteps.
Renames internal functions with prepending underscore to indicate they are private functions as per PEP8.
This reverts commit c545150, reversing
changes made to c62adb6.
@ptomasula
Copy link
Member

Due to issues merging this pull request will be abandoned and replaced with an updated pull request on a new branch split off from this branch. Additional details can be found under pull request #37.

@ptomasula ptomasula closed this Apr 28, 2021
aufdenkampe added a commit that referenced this pull request May 13, 2021
also addresses #35 (git tracking Jupyter notebooks).
The new conda environment substantially improves over the previous version, with a more consolidated HDF5 version (1.10.6) and upgrading JupyterLab to v3.
@ptomasula ptomasula deleted the develop_readWDM branch September 3, 2021 14:17
rburghol pushed a commit to HARPgroup/HSPsquared that referenced this pull request Mar 5, 2025
For testing of LimnoTech#21 and LimnoTech#35.
HDF5 output files were not committed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants