Skip to content

Add IMS stationIdentification to output #1654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

yuanxue2870
Copy link

@yuanxue2870 yuanxue2870 commented May 12, 2025

Description

  1. Revise errsd from 40 to 80 to be consistent with offline workflow. Confirmed IMS errsd value via email exchanges on 05/12/2025.
  2. Based on Youlong's recent PR in Add station IDs with txxxxyyyy for processed IMS snow DA NOAA-EMC/land-SCF_proc#1, now we need to output IMS station IDs (i.e., stationIdentification as a nine-digit string) to the ioda formatted output to be used for follow-up QC.

Issue(s) addressed

Resolves #1642

Dependencies

List the other PRs that this PR is dependent on:

Impact

Expected impact on downstream repositories: No significant impacts are expected. More IMS QCs can be done based on this added IMS ID feature. The input and output to the ctest of "test_iodaconv_imsfv3grid_scf" needs to be changed, otherwise, the ctest will fail.

Checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have run the unit tests before creating the PR

@yuanxue2870
Copy link
Author

Please review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you!

@yuanxue2870 yuanxue2870 marked this pull request as ready for review May 12, 2025 15:19
@CoryMartin-NOAA
Copy link
Contributor

@yuanxue2870 please update the test files as part of this PR, if you need help with that, let me know

@YoulongXia-NOAA
Copy link
Contributor

@yuanxue2870, test files include both input and output files. please keep consistency with ioda converters input and output filenames. Otherwise, you also need to modify .txt file to do the ioda-converter test.

@yuanxue2870
Copy link
Author

@yuanxue2870 please update the test files as part of this PR, if you need help with that, let me know

Thanks Youlong and Cory for your comments. I have already created the new test input and output files on Hera, also included their paths on this PR. The administrator can grab these new files and replace with the old ones directly (i.e., file names are consistent). Unless there is a public place that I can drop off these two files?

@CoryMartin-NOAA
Copy link
Contributor

The IODA converters repository uses "git-lfs" so they should be committed to this PR and replace the existing files

@yuanxue2870
Copy link
Author

The IODA converters repository uses "git-lfs" so they should be committed to this PR and replace the existing files

Oh, I see. That is why it is different from other repos... Thanks for your clarification! I will do that. Convert to draft while uploading new test files...

@yuanxue2870 yuanxue2870 marked this pull request as draft May 12, 2025 15:56
@yuanxue2870 yuanxue2870 marked this pull request as ready for review May 12, 2025 16:06
@yuanxue2870
Copy link
Author

Please re-review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you!


lons = lons.astype('float32')
lats = lats.astype('float32')
stid = stid.astype('int64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stid = stid.astype('int32')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanxue2870 I recalled that only datetime use the double precision. For station ID, single precision is sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Youlong, thanks for your comments. For now, I do not think int32 or int64 would make a difference here as the "stid" is a nine-digit integer, and not being output to the IODA formatted as well. However, int32 will not be enough if we goes to higher resolution (e.g., C1152), where we may need a 11 to 12-digit integer, which will exceed int32 storage capacity. Hence, I would like to keep int64 here. - Thanks,Yuan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, keep it as suggested, @yuanxue2870. I am fine with it but this is the first time I found that people uses int64 for station ID, a bit weird.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an optimal suggestion, @srherbener, thank you for your deep thoughts. sounds good with me.

Copy link
Contributor

@YoulongXia-NOAA YoulongXia-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved this PR.

Copy link
Collaborator

@srherbener srherbener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this update!

@@ -112,6 +118,7 @@ def _read(self):
self.varAttrs[('dateTime', metaDataName)]['_FillValue'] = long_missing_value
self.outdata[('latitude', metaDataName)] = lats
self.outdata[('longitude', metaDataName)] = lons
self.outdata[('stationIdentification', metaDataName)] = strstid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that stid itself, as an integer, is not written to the output file. Rather it is converted to a string and the string value is written out in the stationIdentification value? If so, I think it's okay for stid to be a 64-bit integer.

Pardon my ignorance of this obs type, but I am curious why ~4 billion values is not enough unique station id values. If this obs type is going to have very large numbers of locations, we might want to consider using a numeric station id, or be very careful to use a character array instead of a variable length string, since the variable length string storage and access in the file is going to be very slow. Not for this PR but something to think about for the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srherbener, as @ClaraDraper-NOAA would like to filter out some processed snow depth values at some IMS grids, and she asked me to add a staion ID as txxxxyyy, t is tile number for FV3 grid, 1-6, xxxx is the number for the longitude, and yyyy is number of latitude, e.g., C1152, 4608x2406 grids, xxxx from 0001 to 4608, yyyy 0001 to 2406, therefore, it is a 9-digit integer. In general, stid use string for ioda-converter. Are you thinking to use an integer instead?

Copy link
Author

@yuanxue2870 yuanxue2870 May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srherbener Yes, you are correct. Only strid is written out.
Thanks for your suggestion. IMS is gridded obs, which is different from conventional "in-situ" (i.e., point-scale) obs. We want to give each grid an ID because we want to do some further QCs afterwards. The way we set up each ID is by concatenating its tile number, x-coordinate number, and y-coordinate number -- so we have super big values for stid. For now, int32 is sufficient. By using int64, I am trying to leave some wiggle room for higher resolution (in the long run) which is likely with an increased concatenated ID based on the current code logic.
I am happy to adjust if defining 'stid' as an int64 takes too much memory. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srherbener, In future. you suggest to use an integer. but for now, If it is 9 digit string, do you know how slow it will be for a vector 6x4608x2406?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YoulongXia-NOAA the vector you mention is ~66 million entries. If we are reading that many locations into a single ObsSpace, then a variable length string data type on a variable (eg stationIdentification) in the file could be problematic. However, if it's the case where we need the 66 million values to cover all the possible grid cells, but we only read a subset of those grids (say 10's or 100's of thousands of locations) into a single ObsSpace then we should be fine.

Copy link
Author

@yuanxue2870 yuanxue2870 May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For C96, we output a total of 12252 location ID strings on December 01. The total number of land points in C96 is 18320, meaning ~67% of the total land points are being output in the IMS-IODA output. For C1152, we have a total number of land points of 2381853, similarly, say when 67% are being output on December 01 (for example), we will output 1595841 ID strings.

Copy link
Collaborator

@srherbener srherbener May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is not necessarily memory, rather it is runtime. When an hdf5/netcdf file has a large array (vector) of strings, and those strings are the variable length string datatype (which is what stationIdentification is), it can sometimes take a long time to read those strings into memory. When in memory the strings are stored as a vector in C++ and processing (compare, sort, conversion to DateTime objects, etc) those can also be slow, but not as bad as the file I/O.

I think we are in the gray zone for runtime performance with a string vector that contains a million elements. It may still be okay but if so it's probably getting close to where the runtime starts blowing up. It might make sense to:

  1. Leave stationIdentification as is for now and see how it performs
  2. If the runtime is too slow, then consider using a new integer variable, say "gridIdentification"

If this sounds like a good approach with everyone, you could merge this PR as is (using stationIdentification) and then do some profiling to decide if a numeric id is warranted (and add that in a subsequent PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you propose sounds good to me, thank you @srherbener for the suggestions and insight! (as always!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Thanks for your help!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. Happy to approve now.

@fcvdb fcvdb added the NWS NOAA National Weather Service label May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NWS NOAA National Weather Service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add location IDs to IMS IODA files
7 participants