-
Notifications
You must be signed in to change notification settings - Fork 10
Add IMS stationIdentification to output #1654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Please review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you! |
@yuanxue2870 please update the test files as part of this PR, if you need help with that, let me know |
@yuanxue2870, test files include both input and output files. please keep consistency with ioda converters input and output filenames. Otherwise, you also need to modify .txt file to do the ioda-converter test. |
Thanks Youlong and Cory for your comments. I have already created the new test input and output files on Hera, also included their paths on this PR. The administrator can grab these new files and replace with the old ones directly (i.e., file names are consistent). Unless there is a public place that I can drop off these two files? |
The IODA converters repository uses "git-lfs" so they should be committed to this PR and replace the existing files |
Oh, I see. That is why it is different from other repos... Thanks for your clarification! I will do that. Convert to draft while uploading new test files... |
Please re-review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you! |
|
||
lons = lons.astype('float32') | ||
lats = lats.astype('float32') | ||
stid = stid.astype('int64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stid = stid.astype('int32')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuanxue2870 I recalled that only datetime use the double precision. For station ID, single precision is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Youlong, thanks for your comments. For now, I do not think int32 or int64 would make a difference here as the "stid" is a nine-digit integer, and not being output to the IODA formatted as well. However, int32 will not be enough if we goes to higher resolution (e.g., C1152), where we may need a 11 to 12-digit integer, which will exceed int32 storage capacity. Hence, I would like to keep int64 here. - Thanks,Yuan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, keep it as suggested, @yuanxue2870. I am fine with it but this is the first time I found that people uses int64 for station ID, a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an optimal suggestion, @srherbener, thank you for your deep thoughts. sounds good with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approved this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this update!
@@ -112,6 +118,7 @@ def _read(self): | |||
self.varAttrs[('dateTime', metaDataName)]['_FillValue'] = long_missing_value | |||
self.outdata[('latitude', metaDataName)] = lats | |||
self.outdata[('longitude', metaDataName)] = lons | |||
self.outdata[('stationIdentification', metaDataName)] = strstid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding correctly that stid itself, as an integer, is not written to the output file. Rather it is converted to a string and the string value is written out in the stationIdentification value? If so, I think it's okay for stid to be a 64-bit integer.
Pardon my ignorance of this obs type, but I am curious why ~4 billion values is not enough unique station id values. If this obs type is going to have very large numbers of locations, we might want to consider using a numeric station id, or be very careful to use a character array instead of a variable length string, since the variable length string storage and access in the file is going to be very slow. Not for this PR but something to think about for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srherbener, as @ClaraDraper-NOAA would like to filter out some processed snow depth values at some IMS grids, and she asked me to add a staion ID as txxxxyyy, t is tile number for FV3 grid, 1-6, xxxx is the number for the longitude, and yyyy is number of latitude, e.g., C1152, 4608x2406 grids, xxxx from 0001 to 4608, yyyy 0001 to 2406, therefore, it is a 9-digit integer. In general, stid use string for ioda-converter. Are you thinking to use an integer instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srherbener Yes, you are correct. Only strid is written out.
Thanks for your suggestion. IMS is gridded obs, which is different from conventional "in-situ" (i.e., point-scale) obs. We want to give each grid an ID because we want to do some further QCs afterwards. The way we set up each ID is by concatenating its tile number, x-coordinate number, and y-coordinate number -- so we have super big values for stid. For now, int32 is sufficient. By using int64, I am trying to leave some wiggle room for higher resolution (in the long run) which is likely with an increased concatenated ID based on the current code logic.
I am happy to adjust if defining 'stid' as an int64 takes too much memory. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srherbener, In future. you suggest to use an integer. but for now, If it is 9 digit string, do you know how slow it will be for a vector 6x4608x2406?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YoulongXia-NOAA the vector you mention is ~66 million entries. If we are reading that many locations into a single ObsSpace, then a variable length string data type on a variable (eg stationIdentification) in the file could be problematic. However, if it's the case where we need the 66 million values to cover all the possible grid cells, but we only read a subset of those grids (say 10's or 100's of thousands of locations) into a single ObsSpace then we should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For C96, we output a total of 12252 location ID strings on December 01. The total number of land points in C96 is 18320, meaning ~67% of the total land points are being output in the IMS-IODA output. For C1152, we have a total number of land points of 2381853, similarly, say when 67% are being output on December 01 (for example), we will output 1595841 ID strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is not necessarily memory, rather it is runtime. When an hdf5/netcdf file has a large array (vector) of strings, and those strings are the variable length string datatype (which is what stationIdentification is), it can sometimes take a long time to read those strings into memory. When in memory the strings are stored as a vector in C++ and processing (compare, sort, conversion to DateTime objects, etc) those can also be slow, but not as bad as the file I/O.
I think we are in the gray zone for runtime performance with a string vector that contains a million elements. It may still be okay but if so it's probably getting close to where the runtime starts blowing up. It might make sense to:
- Leave stationIdentification as is for now and see how it performs
- If the runtime is too slow, then consider using a new integer variable, say "gridIdentification"
If this sounds like a good approach with everyone, you could merge this PR as is (using stationIdentification) and then do some profiling to decide if a numeric id is warranted (and add that in a subsequent PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you propose sounds good to me, thank you @srherbener for the suggestions and insight! (as always!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Thanks for your help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great. Happy to approve now.
Description
Issue(s) addressed
Resolves #1642
Dependencies
List the other PRs that this PR is dependent on:
Impact
Expected impact on downstream repositories: No significant impacts are expected. More IMS QCs can be done based on this added IMS ID feature. The input and output to the ctest of "test_iodaconv_imsfv3grid_scf" needs to be changed, otherwise, the ctest will fail.
Checklist