Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaea: Issues running C768 S2SW #3324

Open
JessicaMeixner-NOAA opened this issue Feb 13, 2025 · 12 comments
Open

Gaea: Issues running C768 S2SW #3324

JessicaMeixner-NOAA opened this issue Feb 13, 2025 · 12 comments
Labels
bug Something isn't working triage Issues that are triage

Comments

@JessicaMeixner-NOAA
Copy link
Contributor

What is wrong?

Running C768 S2SW on gaea had the following messages:

/gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/c768t01/COMROOT/c768t01/logs/2019120300/

ariable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
2144: PE 2144: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
 404: PE 404: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
3076: PE 3076: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
3772: PE 3772: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
1727: PE 1727: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
1727:
1918: PE 1918: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
1918:
2686: PE 2686: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
2686:

In ufs-weather-model the GaeaC6 forecast job has the following environment variables:

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/fv3_slurm.IN_gaeac6#L31-L34:

export FI_VERBS_PREFER_XRC=0
export FI_CXI_RX_MATCH_MODE=hybrid
export COMEX_EAGER_THRESHOLD=65536
export FI_CXI_RDZV_THRESHOLD=65536

Adding these to
https://github.com/NOAA-EMC/global-workflow/blob/develop/env/GAEAC6.env#L203-L212

has gotten past this hang issue.

What should have happened?

C768 S2SW should work on Gaea C6

What machines are impacted?

All or N/A

What global-workflow hash are you using?

#3289

Steps to reproduce

Use the CI test here: https://github.com/NOAA-EMC/global-workflow/blob/develop/ci/cases/hires/C768_S2SW.yaml

Additional information

There are additional variables in the ufs-weather-model https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/fv3_slurm.IN_gaeac6#L26C1-L28C19

export OMP_NUM_THREADS=@[THRD]
export OMP_STACKSIZE=1024M
export NC_BLKSZ=1M

That should be potentially also added.

Do you have a proposed solution?

Add variables from ufs-weather-model to gaeac6 environment file for forecast.

@JessicaMeixner-NOAA JessicaMeixner-NOAA added bug Something isn't working triage Issues that are triage labels Feb 13, 2025
@JessicaMeixner-NOAA
Copy link
Contributor Author

Note: I've gotten the job past the previous error but initialization is taking about 30 min so cannot confirm if this is the only needed fix yet.

@JessicaMeixner-NOAA
Copy link
Contributor Author

I have run with the following environment variables:

    export OMP_STACKSIZE=1024M
    export NC_BLKSZ=1M
    export FI_VERBS_PREFER_XRC=0
    export FI_CXI_RX_MATCH_MODE=hybrid
    export COMEX_EAGER_THRESHOLD=65536
    export FI_CXI_RDZV_THRESHOLD=65536

But am now getting a SIGTERM error.

@GeorgeVandenberghe-NOAA I believe you have successfully run C768/C1152 on Gaea. Do you have recommended environment variables and/or layouts for the components?

@GeorgeVandenberghe-NOAA
Copy link

@JessicaMeixner-NOAA
Copy link
Contributor Author

@GeorgeVandenberghe-NOAA - Thanks requested information is below:

source: /gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/global-workflow/sorc/ufs_model.fd
rundirectory: /gpfs/f6/ira-sti/scratch/Jessica.Meixner/RUNDIRS/c768t01/gfs.2019120300/gfsfcst.2019120300/fcst.1315376
full log file: /gpfs/f6/ira-sti/scratch/Jessica.Meixner/tr68t01/COMROOT/c768t01/logs/2019120300/gfs_fcst_seg0.log

I am using ESMF threading. It's not timing out.

Write task info from model_configure:

write_groups:            4
write_tasks_per_group:   120

@GeorgeVandenberghe-NOAA
Copy link

@JessicaMeixner-NOAA
Copy link
Contributor Author

@GeorgeVandenberghe-NOAA what accounts do you have on gaea? I can re-run in one of those spaces if i have it too

@GeorgeVandenberghe-NOAA
Copy link

@GeorgeVandenberghe-NOAA
Copy link

@JessicaMeixner-NOAA
Copy link
Contributor Author

@GeorgeVandenberghe-NOAA

I ran extra tests with esmf_threading turned off and the extra environment variable from hercules and still am getting an error.

I moved the log and run directory here: /gpfs/f6/ira-sti/world-shared/Jessica.Meixner/fcst.1315376

If you have a successful configuration for C768 or C1152 on Gaea that you could share that'd be great - including what environment variables you used.

@GeorgeVandenberghe-NOAA
Copy link

@GeorgeVandenberghe-NOAA
Copy link

@JessicaMeixner-NOAA
Copy link
Contributor Author

@GeorgeVandenberghe-NOAA - The fix files are staged for you when using the global-workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issues that are triage
Projects
None yet
Development

No branches or pull requests

2 participants