Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUFR2NETCDF bug for ADPUPA Prepbufr ---- the total number of data from output is not consistent for test with and without MPI #46

Open
emilyhcliu opened this issue Jan 16, 2025 · 7 comments
Assignees

Comments

@emilyhcliu
Copy link
Collaborator

emilyhcliu commented Jan 16, 2025

CASE: BUFR2NETCDF conversion for ADPUPA prepbufr test with MPI numbers: 2, 4, 8, and without MPI.

input BUFR file: the prepbufr file (contains all subset)
output  NetCDF:  ADPUPA sbuset from prepbufr in NetCDF format

Expectation:  the output files from runs without and with various MPI configurations (2, 4, 8) should have the same size, and number of data,

Sympton:

The number of obs in output:
MPI = 0  ----> 270600
MPI = 1  ----> 270600
MPI = 2 ----> 261786
MPI = 4 ----> 250079
MPI = 8 ---->  239868

We are losing more data in the output as we increase the number of MPIs 

Test Setup on HERA

ObsForge build: /scratch1/NCEPDEV/da/Emily.Liu/EMC-obsForge/obsForge_adpupa
        - IODA: feature/bufr_in_parallel
        - SPOC: feature/adpupa_prepbufr_NEW


Run directory: /scratch1/NCEPDEV/da/Emily.Liu/EMC-obsForge/run_adpupa_prepbufr
./run_encodeBufr bufr2netcdf 0
./run_encodeBufr bufr2netcdf 1
./run_encoderBuir bufr2netcdf 2

The output directory is ./testoutput/2021080100
   
@PraveenKumar-NOAA
Copy link

PraveenKumar-NOAA commented Jan 22, 2025

It looks like a group_by field issue as @nicholasesposito also encountered similar one in his acft_profiles case: #40.

Looking into details about the group_by query string for the adpupa prepbufr, which is ADPUPA/PRSLEVEL/CAT

ADPUPA
Dimensioning Sub-paths:
3d */PRSLEVEL
2d int ADPUPA/PRSLEVEL/CAT

When I used the followings for the query string:
ADPUPA/PRSLEVEL{1}/CAT
ADPUPA/PRSLEVEL{2}/CAT
ADPUPA/PRSLEVEL{3}/CAT

I got following results across different nodes, but number of observations/location were much lowered compared to the original number, 270600.

MPI = 0 239770 gdas.t00z.adpupa_prepbufr.tm00_0.nc ----> 1353
MPI = 2 239770 gdas.t00z.adpupa_prepbufr.tm00_2.nc ----> 1353
MPI = 4 239770 gdas.t00z.adpupa_prepbufr.tm00_4.nc ----> 1353
MPI = 8 239770 gdas.t00z.adpupa_prepbufr.tm00_8.nc ----> 1353

@emilyhcliu
Copy link
Collaborator Author

It looks like a group_by field issue as @nicholasesposito also encountered similar one in his acft_profiles case: #40.

Looking into details about the group_by query string for the adpupa prepbufr, which is ADPUPA/PRSLEVEL/CAT

ADPUPA Dimensioning Sub-paths: 3d */PRSLEVEL 2d int ADPUPA/PRSLEVEL/CAT

When I used the followings for the query string: ADPUPA/PRSLEVEL{1}/CAT ADPUPA/PRSLEVEL{2}/CAT ADPUPA/PRSLEVEL{3}/CAT

I got following results across different nodes, but number of observations/location were much lowered compared to the original number, 270600.

MPI = 0 239770 gdas.t00z.adpupa_prepbufr.tm00_0.nc ----> 1353 MPI = 2 239770 gdas.t00z.adpupa_prepbufr.tm00_2.nc ----> 1353 MPI = 4 239770 gdas.t00z.adpupa_prepbufr.tm00_4.nc ----> 1353 MPI = 8 239770 gdas.t00z.adpupa_prepbufr.tm00_8.nc ----> 1353

The number 1353 is likely the number of stations.

@rmclaren
Copy link
Collaborator

I'm aware of the problem, I just haven't had a chance to look at it... I will need to do another round of bug fixing at some point..

@PraveenKumar-NOAA
Copy link

@emilyhcliu @rmclaren FYI - SCRIPT2NETCDF also has similar issue, i.e. we are losing more data in the output as we increase the number of MPIs.

I also confirm that there are no such issues with the BUFR_BACKEND and SCRIPT_BACKEND.

@rmclaren
Copy link
Collaborator

rmclaren commented Feb 11, 2025

So I've figured out what is going on. Basically this happens because the data is in the form of "jagged" arrays. This means that the group_by vector size will vary from subset to subset. The way that group by currently works is that it will normalize this data so that every subset that is read is inflated to the same dimensions and then group by is applied. For example:

Data Read Dimensions (per subset)

n, 12
n, 14
n, 13

Inflated (takes max of extra dimensions)

n, 14
n, 14
n, 14

The extra values are filled with missing values.

Then Group by is applied final dimensions.

n x 14
n x 14
n x 14

For the case when there are multiple MPI processes the MAX value (ex: 14) is not guaranteed to be the same so the first MPI process might have a max of ex: 10, and the second 14. The end result is that the final result might not be the same size when computing on multiple nodes.

These extra rows are just filler and don't tend to matter so you can ignore them. I think a better (more correct) behavior might be to do the grouping before inflating the dimensions. There probably is not much point to adding these extra rows to the data (just makes the dataset larger).

@PraveenKumar-NOAA
Copy link

@rmclaren thank you for the clarification! Please let me know how to do the grouping before inflating the dimensions.

@rmclaren
Copy link
Collaborator

@PraveenKumar-NOAA Nothing you can do I'm afraid... I have to modify the result set class to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants