Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add restart capability for RTOFS model job runs more than 15 minutes (bugzilla #1132) #11

Open
DanIredell-NOAA opened this issue Mar 15, 2022 · 4 comments

Comments

@DanIredell-NOAA
Copy link
Contributor

http://www2.spa.ncep.noaa.gov/bugzilla/show_bug.cgi?id=1132

Please Add restart capability for RTOFS model job runs more than 15 minutes at next upgrade.

Please refer to https://www.nco.ncep.noaa.gov/idsb/implementation_standards/Implementation%20Standards%20v10.2.pdf? page 11, section "Minimize the time it takes to re-run a failed job"

@DanIredell-NOAA
Copy link
Contributor Author

DanIredell-NOAA commented Oct 27, 2023

The following are the jobs that take more than 15 minutes:

hycom_var - 45
analysis - 30
gzip00 - 17
forecast_step[12] - 107
*_grib2_post* - 25

The analysis and forecast_step[12] jobs do have some logic to restart the jobs. Will need to test on how well that works.

@DanIredell-NOAA
Copy link
Contributor Author

DanIredell-NOAA commented Nov 13, 2023

gzip mods documented here -- #42

@DanIredell-NOAA
Copy link
Contributor Author

Note that here is the latest implementation standards (v11.0 from January 2022)
https://www.nco.ncep.noaa.gov/idsb/implementation_standards/ImplementationStandards.v11.0.0.pdf

@DanIredell-NOAA
Copy link
Contributor Author

DanIredell-NOAA commented Dec 4, 2023

Made modifications to make the jobs analysis, forecast_step1 and forecast_step2 "restartable". This required a few functionality changes

  1. upon job completion, check if job failed or succeeded.
    --- if it failed then copy restart files to restart directory (GESOUT)
    --- if it succeeded then copy restart files to COMOUT
    --- proceed as normally (copy archives to to COMOUT) and exit with appropriate message
  2. upon job start, check if restart files are in GESOUT.
    --- if they exist then use them as the restart files.
  3. Add capability to write restart files from both entities (hycom and cice) at specified frequency.

This also changed the granularity of the possible restart times. Before that date was YYYYMMDD. With this capability the date is now YYYYMMDDHH. This fundamental change required many small mods to those macros (startdate and enddate).

Major changes were made to the following
scripts/exrtofs_glo_analysis.sh
scripts/exrtofs_glo_forecast.sh

Middling changes were made to
jobs/JRTOFS_GLO_ANALYSIS
jobs/JRTOFS_GLO_FORECAST_STEP1
jobs/JRTOFS_GLO_FORECAST_STEP2
scripts/exrtofs_glo_incup.sh
ush/rtofs_tmp2com.sh
sorc/rtofs_hycom.fd/src_2.2.99DHMTi-dist2B_relo_cice_v4.0e/hycom/mod_hycom.F for testing only
sorc/rtofs_hycom.fd/src_2.2.99DHMTi-dist2B_relo_cice_v4.0e/hycom/restart.f
sorc/rtofs_hycom.fd/src_2.2.99DHMTi-dist2B_relo_cice_v4.0e/source/ice_calendar.F90

Minor changes to the following files:
jobs/JRTOFS_GLO_ANALYSIS_GRIB2_POST
jobs/JRTOFS_GLO_FORECAST_POST
jobs/JRTOFS_GLO_FORECAST_GRIB2_POST
jobs/JRTOFS_GLO_FORECAST_POST_2
jobs/JRTOFS_GLO_FORECAST_STEP1_PRE
jobs/JRTOFS_GLO_FORECAST_STEP2_PRE
jobs/JRTOFS_GLO_INCUP
scripts/exrtofs_glo_analysis_pre.sh
scripts/exrtofs_glo_forecast_pre.sh
scripts/exrtofs_glo_grib2_post.sh
scripts/exrtofs_glo_post.sh
scripts/exrtofs_glo_post_2.sh
ush/rtofs_prestaging.sh
ush/rtofs_runstaging.sh
ush/rtofs_submit.sh
parm/rtofs_glo.navy_0.08.anal.ice_in
parm/rtofs_glo.navy_0.08.fcst.blkdat.input
parm/rtofs_glo.navy_0.08.fcst.ice_in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant