Add IMS stationIdentification to output #1654

yuanxue2870 · 2025-05-12T14:07:41Z

Description

Revise errsd from 40 to 80 to be consistent with offline workflow. Confirmed IMS errsd value via email exchanges on 05/12/2025.
Based on Youlong's recent PR in Add station IDs with txxxxyyyy for processed IMS snow DA NOAA-EMC/land-SCF_proc#1, now we need to output IMS station IDs (i.e., stationIdentification as a nine-digit string) to the ioda formatted output to be used for follow-up QC.

Issue(s) addressed

Resolves #1642

Dependencies

List the other PRs that this PR is dependent on:

Add station IDs with txxxxyyyy for processed IMS snow DA NOAA-EMC/land-SCF_proc#1

Impact

Expected impact on downstream repositories: No significant impacts are expected. More IMS QCs can be done based on this added IMS ID feature. The input and output to the ctest of "test_iodaconv_imsfv3grid_scf" needs to be changed, otherwise, the ctest will fail.

Checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have run the unit tests before creating the PR

yuanxue2870 · 2025-05-12T15:19:43Z

Please review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you!

CoryMartin-NOAA · 2025-05-12T15:42:27Z

@yuanxue2870 please update the test files as part of this PR, if you need help with that, let me know

YoulongXia-NOAA · 2025-05-12T15:47:56Z

@yuanxue2870, test files include both input and output files. please keep consistency with ioda converters input and output filenames. Otherwise, you also need to modify .txt file to do the ioda-converter test.

yuanxue2870 · 2025-05-12T15:53:54Z

@yuanxue2870 please update the test files as part of this PR, if you need help with that, let me know

Thanks Youlong and Cory for your comments. I have already created the new test input and output files on Hera, also included their paths on this PR. The administrator can grab these new files and replace with the old ones directly (i.e., file names are consistent). Unless there is a public place that I can drop off these two files?

CoryMartin-NOAA · 2025-05-12T15:55:03Z

The IODA converters repository uses "git-lfs" so they should be committed to this PR and replace the existing files

yuanxue2870 · 2025-05-12T15:56:15Z

The IODA converters repository uses "git-lfs" so they should be committed to this PR and replace the existing files

Oh, I see. That is why it is different from other repos... Thanks for your clarification! I will do that. Convert to draft while uploading new test files...

…da-converters into Feature/add_IMSid

yuanxue2870 · 2025-05-12T16:07:09Z

Please re-review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you!

YoulongXia-NOAA · 2025-05-12T17:51:24Z

src/land/imsfv3_scf2ioda.py


        lons = lons.astype('float32')
        lats = lats.astype('float32')
+        stid = stid.astype('int64')


stid = stid.astype('int32')

@yuanxue2870 I recalled that only datetime use the double precision. For station ID, single precision is sufficient.

Hi Youlong, thanks for your comments. For now, I do not think int32 or int64 would make a difference here as the "stid" is a nine-digit integer, and not being output to the IODA formatted as well. However, int32 will not be enough if we goes to higher resolution (e.g., C1152), where we may need a 11 to 12-digit integer, which will exceed int32 storage capacity. Hence, I would like to keep int64 here. - Thanks,Yuan

Okay, keep it as suggested, @yuanxue2870. I am fine with it but this is the first time I found that people uses int64 for station ID, a bit weird.

That is an optimal suggestion, @srherbener, thank you for your deep thoughts. sounds good with me.

YoulongXia-NOAA

I approved this PR.

srherbener

Thanks for this update!

srherbener · 2025-05-12T19:48:21Z

src/land/imsfv3_scf2ioda.py

@@ -112,6 +118,7 @@ def _read(self):
        self.varAttrs[('dateTime', metaDataName)]['_FillValue'] = long_missing_value
        self.outdata[('latitude', metaDataName)] = lats
        self.outdata[('longitude', metaDataName)] = lons
+        self.outdata[('stationIdentification', metaDataName)] = strstid


Am I understanding correctly that stid itself, as an integer, is not written to the output file. Rather it is converted to a string and the string value is written out in the stationIdentification value? If so, I think it's okay for stid to be a 64-bit integer.

Pardon my ignorance of this obs type, but I am curious why ~4 billion values is not enough unique station id values. If this obs type is going to have very large numbers of locations, we might want to consider using a numeric station id, or be very careful to use a character array instead of a variable length string, since the variable length string storage and access in the file is going to be very slow. Not for this PR but something to think about for the future.

@srherbener, as @ClaraDraper-NOAA would like to filter out some processed snow depth values at some IMS grids, and she asked me to add a staion ID as txxxxyyy, t is tile number for FV3 grid, 1-6, xxxx is the number for the longitude, and yyyy is number of latitude, e.g., C1152, 4608x2406 grids, xxxx from 0001 to 4608, yyyy 0001 to 2406, therefore, it is a 9-digit integer. In general, stid use string for ioda-converter. Are you thinking to use an integer instead?

@srherbener Yes, you are correct. Only strid is written out.
Thanks for your suggestion. IMS is gridded obs, which is different from conventional "in-situ" (i.e., point-scale) obs. We want to give each grid an ID because we want to do some further QCs afterwards. The way we set up each ID is by concatenating its tile number, x-coordinate number, and y-coordinate number -- so we have super big values for stid. For now, int32 is sufficient. By using int64, I am trying to leave some wiggle room for higher resolution (in the long run) which is likely with an increased concatenated ID based on the current code logic.
I am happy to adjust if defining 'stid' as an int64 takes too much memory. Thanks!

@srherbener, In future. you suggest to use an integer. but for now, If it is 9 digit string, do you know how slow it will be for a vector 6x4608x2406?

@YoulongXia-NOAA the vector you mention is ~66 million entries. If we are reading that many locations into a single ObsSpace, then a variable length string data type on a variable (eg stationIdentification) in the file could be problematic. However, if it's the case where we need the 66 million values to cover all the possible grid cells, but we only read a subset of those grids (say 10's or 100's of thousands of locations) into a single ObsSpace then we should be fine.

For C96, we output a total of 12252 location ID strings on December 01. The total number of land points in C96 is 18320, meaning ~67% of the total land points are being output in the IMS-IODA output. For C1152, we have a total number of land points of 2381853, similarly, say when 67% are being output on December 01 (for example), we will output 1595841 ID strings.

The issue is not necessarily memory, rather it is runtime. When an hdf5/netcdf file has a large array (vector) of strings, and those strings are the variable length string datatype (which is what stationIdentification is), it can sometimes take a long time to read those strings into memory. When in memory the strings are stored as a vector in C++ and processing (compare, sort, conversion to DateTime objects, etc) those can also be slow, but not as bad as the file I/O.

I think we are in the gray zone for runtime performance with a string vector that contains a million elements. It may still be okay but if so it's probably getting close to where the runtime starts blowing up. It might make sense to:

Leave stationIdentification as is for now and see how it performs

If the runtime is too slow, then consider using a new integer variable, say "gridIdentification"

If this sounds like a good approach with everyone, you could merge this PR as is (using stationIdentification) and then do some profiling to decide if a numeric id is warranted (and add that in a subsequent PR).

What you propose sounds good to me, thank you @srherbener for the suggestions and insight! (as always!)

Sounds good to me. Thanks for your help!

Sounds great. Happy to approve now.

yuanxue2870 and others added 3 commits May 12, 2025 13:47

change errsd from 40 to 80 based on emails on 05/12/2025

f820afb

add IMS stationIdentification as a string to the output

5b3540b

Merge branch 'develop' into Feature/add_IMSid

fb87f97

yuanxue2870 marked this pull request as ready for review May 12, 2025 15:19

yuanxue2870 requested a review from YoulongXia-NOAA May 12, 2025 15:20

yuanxue2870 marked this pull request as draft May 12, 2025 15:56

yuanxue2870 added 3 commits May 12, 2025 15:59

replace old ims test input file

0154166

Merge branch 'Feature/add_IMSid' of https://github.com/yuanxue2870/io…

ec3fbca

…da-converters into Feature/add_IMSid

replace old ims test output file

5db8cbb

yuanxue2870 marked this pull request as ready for review May 12, 2025 16:06

CoryMartin-NOAA approved these changes May 12, 2025

View reviewed changes

CoryMartin-NOAA requested review from jiaruidong2017, ClaraDraper-NOAA and BenjaminRuston May 12, 2025 16:57

jiaruidong2017 approved these changes May 12, 2025

View reviewed changes

YoulongXia-NOAA reviewed May 12, 2025

View reviewed changes

YoulongXia-NOAA approved these changes May 12, 2025

View reviewed changes

CoryMartin-NOAA requested a review from srherbener May 12, 2025 18:20

srherbener reviewed May 12, 2025

View reviewed changes

Merge branch 'develop' into Feature/add_IMSid

f7fa972

srherbener approved these changes May 13, 2025

View reviewed changes

fcvdb added the NWS NOAA National Weather Service label May 22, 2025

Add IMS stationIdentification to output #1654

Are you sure you want to change the base?

Add IMS stationIdentification to output #1654

Uh oh!

Conversation

yuanxue2870 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue(s) addressed

Dependencies

Impact

Checklist

Uh oh!

yuanxue2870 commented May 12, 2025

Uh oh!

CoryMartin-NOAA commented May 12, 2025

Uh oh!

YoulongXia-NOAA commented May 12, 2025

Uh oh!

yuanxue2870 commented May 12, 2025

Uh oh!

CoryMartin-NOAA commented May 12, 2025

Uh oh!

yuanxue2870 commented May 12, 2025

Uh oh!

yuanxue2870 commented May 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YoulongXia-NOAA left a comment

Choose a reason for hiding this comment

Uh oh!

srherbener left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanxue2870 May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanxue2870 May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srherbener May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuanxue2870 commented May 12, 2025 •

edited

Loading

yuanxue2870 May 12, 2025 •

edited

Loading

yuanxue2870 May 13, 2025 •

edited

Loading

srherbener May 13, 2025 •

edited

Loading