Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1535 - Add the NAEFS jobs restart capability for file alerts #26

Open
BoCui-NOAA opened this issue Feb 14, 2025 · 0 comments
Open

Comments

@BoCui-NOAA
Copy link
Collaborator

BoCui-NOAA commented Feb 14, 2025

Currently the WCOSS implementation requirement is to have a restart capability for the job which runs greater than 15 minutes to

  • to save time when recovering from a failure.
  • to keep as consistent as possible the time our end users get data.
  • Also another factor that is keep the dbnet load minimized and as well as outgoing networks.

For the naefs jobs, apparently it's a very large number of files been alerted within a short runtime as below. Therefore we would like to have a threshold standard for file alerts that also triggers the need for restart capability, rather than just the 15 min time.

Please add the restart capability for file alerts the next NAEFS upgrade that -

  • when rerun the naefs job from a failure, improve the scripts to check and not to alert the existed output data files from previous run.
  • also extend improvement of the scripts to check and not process/generate the existed output data files from previous run, specially for the gempak scripts/jobs.

NAEFS v7.0 job runtime and file alerts - job runtime (min) alerts
naefs_gefs_prob_avgspr 1.1 1440
naefs_fnmoc_ens_gempak 10.8 1940
naefs_cmc_ens_gempak 6.4 2134
naefs_cmc_ens_post 7.9 4462
naefs_gefs_debias_gempak 8.8 6136
naefs_gefs_debias 164.3 9172 (rerun ~13 mins without wait/sleep)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant