forked from respec/HSPsquared
-
Notifications
You must be signed in to change notification settings - Fork 0
Merge develop_readWDM into develop to read time series by block & group
#35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Adds new function 'process_groups' which replaces the 'getfloats' function. This new function processes WDM files by blocks which consist of a control word (32bit integer) and then one or more float32 values which contain the data for the block. This approach will provide support for Timeseries records with irregular timestep intervals.
Adding an additional WDM test file rpo772.wdm. This file contains single time with an irregular timestep.
My understanding is that when working with numpy array allocating the array size up front is very important for performance, as expanding the array later essentially means copying the array to a larger array in new block of memory. With the block processing approach there is presently no way to determine the exact size of the resultant array until you read through the groups. Allocating an array of a fixed size can cause files to fail with an IndexError. A quick solution to this was to implement a chunk allocation approach which allocates numpy arrays in large 100mb chunks and only expands and array when processing the next block will exceed the boundaries of the existing array. This solution is far from perfect but at least resolves the IndexErrors. We should consult with Jack to see if there is a way to determine the number of elements prior to processing the timeseries, or alternatively consider switching to Lists instead because adding an element to a list is time constant and should perform better than this approach.
Blocks in a timeseries consist of a minimum of 2 elements, the block control word and one or more float data values. When block ends there is either another block, the end of the group, or the end of the record (meaning we'd go to the next record in the chain). Some of our test WDM files have a block that ends on the 511th element of a record. In these example files the 512th element also did not parse to a valid block control word. At this time I'm not sure what the significance of the 512th elements when a block end, but processing those elements as a control word causes a series of errors which will throw off the accuracy of all subsequent blocks processed. We'll need to confirm the significance of these elements with Jack, but for now I implemented logic to skip to the next record if the end of a block fails the 511th element of the record.
From @PaulDudaRESPEC, added to new `docs` directory that we'll want to build out over time. Connects to #20 & #21.
.exp & .OUT files from @@jlkittle using WDMRX debugger: https://github.com/respec/FORTRAN/blob/master/lib3.0/BIN/WDMRX.EXE CSV file from @htaolimno using https://github.com/respec/BASINS/tree/master/atcWdmVb Connects to #21 & #22.
# 1. used lists to replace numpy matrix; # 2. added a loop to iterate each group and used ending date as the ending condition
The rewrite to process wdm files as groups led to the deprecation of the getfloats and adjustNval functions. Additionally Hua refactored the original process_groups method in a previous commit as process_groups2. The original process_groups method was removed and process_groups2 was renamed to process_group.
A single leading underscore is one of the methods used in python (see PEP8) to denote internal classes and functions. Internal functions come with no guarantee of backward compatibility. We want to update the naming of supporting functions that we do not the public to interface with.
General refactoring to cleanup code by removing old comments and slight restructuring to increase readability. Also replace print statements for error with raise exceptions.
Merge updates to readUCI and GQUAL from respec/HSPsquared - develop branch
The constraints of Numba meant that datetime conversion cannot occur with the main block processing loop (_process_groups function). This commit replaces the previous use of python datetime object for a bit approach in which date components are stored in a single integer who individual bits can be parsed into the time step components. Conversion to a datetime object now occurs outside of the processing loop prior to output to HDF.
…ared into develop_readWDM
The datetime functions added to support numba in commit e5d64a1 required that integers input into these functions are 64bits or year information will be lost during bit manipulations. The previous implementation left integer type up to numba and in some instances could produce a in32 object. This commit causes integer conversion to be explicitly int64 so that year information is not lost.
Even with datetime conversions removed from the group processing loop, the conversion time using datetime.datetime() remains slow. After trying attempts using some datetime conversion approaches with pandas I was still unable to achieve a significant performance boost. Numba does not support the creation of datetime objects, however it does support datetime arithmetic. This commit adds in a numba compatible datetime conversion function which calculates a dates offset from the epoch and adds the appropriate timedelta64 objects to return a datetime64 object.
I missed committing 3 line deletions which remove the old pandas.apply based datetime conversion approach.
Open
@ptomasula, try running either of these notebooks. Connects to issue #21 & PR #35
The timeseries produced by the new version of the WMD reader are offset by one datetime step from the previous version.
When the transform is unable to detect the timestep of the input timeseries, there was a bit of code that performed a reindex and fillna using the 'tbase' of the siminfo. However this key does not appear in that dictionary. Additionally our initial test case of test 10b does not encounter this bit of code. We'll revisit the necessity of this during our IO abstraction but commenting it out for now.
@ptomasula, I found the bug recently described in #21
for assisted value comparisons.
This commit addresses a mismatch in the 'Stop' parameter of the '/TIMESERIES/SUMMARY' table. The updated WDMreader wrote a value 1 timestep less than the end of the last group in the timeseries. See Issue #21.
Add two lines that were missing from the previous commit. ca50dd0
Commit c01199a by @PaulDudaRESPEC got HSP2 to run with HDF5 files from new `readWDM`. Still need to compare outputs
This commit assigned freq attribute of the timeseries generated in the ReadWDM function. Without the freq assignment, the timeseries were failing to execute in some of the model modules. Reference: #21 and #21 (comment)
This fixes a similar parsing on invalid block control word issue that we saw previously. The 'offset' variable keeps track of where on a given record the loop is and a block must be at least 2 words long. When at the final (512th) index of the record and attempting to process the next block we must first go to the next record in the timeseries. See line 307. However, we've used python indexing for the offset variable which starts at 0. The last index of record should therefor 511 and not the 512. Reference: #21 (comment)
Timeseries with irregular timestep do not conform to the requirement for setting the index.freq argument. This results in a value error. This commit adds a try except to allow for the reader to handle timeseries with irregular timesteps. However as of this commit, the model with not be able to run these timeseries. Additional effort is needed to handle timesteps with irregular timesteps.
Renames internal functions with prepending underscore to indicate they are private functions as per PEP8.
Member
|
Due to issues merging this pull request will be abandoned and replaced with an updated pull request on a new branch split off from this branch. Additional details can be found under pull request #37. |
aufdenkampe
added a commit
that referenced
this pull request
May 13, 2021
also addresses #35 (git tracking Jupyter notebooks). The new conda environment substantially improves over the previous version, with a more consolidated HDF5 version (1.10.6) and upgrading JupyterLab to v3.
rburghol
pushed a commit
to HARPgroup/HSPsquared
that referenced
this pull request
Mar 5, 2025
For testing of LimnoTech#21 and LimnoTech#35. HDF5 output files were not committed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR directly addresses issue #21 Rewrite readWDM.py to read by data group & block.
Before merging, we should test that constructed HDF5 files work for running the HSP2 model. Indications are that it does not. See respec#40 (comment), for which @bcous ran from his branch. I got similar results running
test10, which I'll commit shortly.We should merge this PR before we merge PR #34 for the new WDM Class.