-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaea: Issues running C768 S2SW #3324
Comments
Note: I've gotten the job past the previous error but initialization is taking about 30 min so cannot confirm if this is the only needed fix yet. |
I have run with the following environment variables:
But am now getting a SIGTERM error. @GeorgeVandenberghe-NOAA I believe you have successfully run C768/C1152 on Gaea. Do you have recommended environment variables and/or layouts for the components? |
Are u getting anything besides SIGTERM. What is the number of ranks in
each I/O group.. Are you running with ESMF threading?
Is it timing out?
save a CWD, with executable and I will look at it and also tell me where
source is.
…On Thu, Feb 13, 2025 at 5:53 PM Jessica Meixner ***@***.***> wrote:
I have run with the following environment variables:
export OMP_STACKSIZE=1024M
export NC_BLKSZ=1M
export FI_VERBS_PREFER_XRC=0
export FI_CXI_RX_MATCH_MODE=hybrid
export COMEX_EAGER_THRESHOLD=65536
export FI_CXI_RDZV_THRESHOLD=65536
But am now getting a SIGTERM error.
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> I
believe you have successfully run C768/C1152 on Gaea. Do you have
recommended environment variables and/or layouts for the components?
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FUL7OSEPWTEVQE25YD2PTL2PAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJXGM2DENJTHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: JessicaMeixner-NOAA]*JessicaMeixner-NOAA* left a comment
(NOAA-EMC/global-workflow#3324)
<#3324 (comment)>
I have run with the following environment variables:
export OMP_STACKSIZE=1024M
export NC_BLKSZ=1M
export FI_VERBS_PREFER_XRC=0
export FI_CXI_RX_MATCH_MODE=hybrid
export COMEX_EAGER_THRESHOLD=65536
export FI_CXI_RDZV_THRESHOLD=65536
But am now getting a SIGTERM error.
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> I
believe you have successfully run C768/C1152 on Gaea. Do you have
recommended environment variables and/or layouts for the components?
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FUL7OSEPWTEVQE25YD2PTL2PAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJXGM2DENJTHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@GeorgeVandenberghe-NOAA - Thanks requested information is below: source: /gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/global-workflow/sorc/ufs_model.fd I am using ESMF threading. It's not timing out. Write task info from model_configure:
|
I get permission denied at the Jessica.Meixner directory. There may be
others in the path that I can't see that also have this block
…On Thu, Feb 13, 2025 at 7:08 PM Jessica Meixner ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> -
Thanks requested information is below:
source:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/global-workflow/sorc/ufs_model.fd
rundirectory:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/RUNDIRS/c768t01/gfs.2019120300/gfsfcst.2019120300/fcst.1315376
full log file:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/tr68t01/COMROOT/c768t01/logs/2019120300/gfs_fcst_seg0.log
I am using ESMF threading. It's not timing out.
Write task info from model_configure:
write_groups: 4
write_tasks_per_group: 120
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FXEN6LPU5HC543IS7L2PTUQZAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJXGQ4TAMBVHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: JessicaMeixner-NOAA]*JessicaMeixner-NOAA* left a comment
(NOAA-EMC/global-workflow#3324)
<#3324 (comment)>
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> -
Thanks requested information is below:
source:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/global-workflow/sorc/ufs_model.fd
rundirectory:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/RUNDIRS/c768t01/gfs.2019120300/gfsfcst.2019120300/fcst.1315376
full log file:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/tr68t01/COMROOT/c768t01/logs/2019120300/gfs_fcst_seg0.log
I am using ESMF threading. It's not timing out.
Write task info from model_configure:
write_groups: 4
write_tasks_per_group: 120
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FXEN6LPU5HC543IS7L2PTUQZAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJXGQ4TAMBVHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@GeorgeVandenberghe-NOAA what accounts do you have on gaea? I can re-run in one of those spaces if i have it too |
I think the permission is blocked at the Jessica.Meixner directory, not the
project directory.
…On Thu, Feb 13, 2025 at 8:49 PM Jessica Meixner ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
what accounts do you have on gaea? I can re-run in one of those spaces if i
have it too
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FS4DEB7SFDT4PNSGAL2PUALZAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJXGY3TSMZQGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: JessicaMeixner-NOAA]*JessicaMeixner-NOAA* left a comment
(NOAA-EMC/global-workflow#3324)
<#3324 (comment)>
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
what accounts do you have on gaea? I can re-run in one of those spaces if i
have it too
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FS4DEB7SFDT4PNSGAL2PUALZAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJXGY3TSMZQGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
*nggps_emc* is my only Gaea project
…On Thu, Feb 13, 2025 at 8:49 PM Jessica Meixner ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
what accounts do you have on gaea? I can re-run in one of those spaces if i
have it too
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FS4DEB7SFDT4PNSGAL2PUALZAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJXGY3TSMZQGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: JessicaMeixner-NOAA]*JessicaMeixner-NOAA* left a comment
(NOAA-EMC/global-workflow#3324)
<#3324 (comment)>
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
what accounts do you have on gaea? I can re-run in one of those spaces if i
have it too
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FS4DEB7SFDT4PNSGAL2PUALZAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJXGY3TSMZQGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I ran extra tests with esmf_threading turned off and the extra environment variable from hercules and still am getting an error. I moved the log and run directory here: /gpfs/f6/ira-sti/world-shared/Jessica.Meixner/fcst.1315376 If you have a successful configuration for C768 or C1152 on Gaea that you could share that'd be great - including what environment variables you used. |
I captured this and tried running and it got as far as
FATAL ERROR: in opening file
456:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/global-workflow/fix/am/global_slmask.t1534.3072.1536.grbThi
This is blocked at the Jessica.Meixner directory.
The 120 MPI ranks per I/O group is too small. I changed to 480 ranks and
one group rather than four to avoid having to edit ufs.configure.
…On Fri, Feb 14, 2025 at 2:39 PM Jessica Meixner ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
I ran extra tests with esmf_threading turned off and the extra environment
variable from hercules and still am getting an error.
I moved the log and run directory here:
/gpfs/f6/ira-sti/world-shared/Jessica.Meixner/fcst.1315376
If you have a successful configuration for C768 or C1152 on Gaea that you
could share that'd be great - including what environment variables you used.
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FQJOFRXPFBWNOGBTAL2PZBADAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRQGEZDKNZRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: JessicaMeixner-NOAA]*JessicaMeixner-NOAA* left a comment
(NOAA-EMC/global-workflow#3324)
<#3324 (comment)>
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
I ran extra tests with esmf_threading turned off and the extra environment
variable from hercules and still am getting an error.
I moved the log and run directory here:
/gpfs/f6/ira-sti/world-shared/Jessica.Meixner/fcst.1315376
If you have a successful configuration for C768 or C1152 on Gaea that you
could share that'd be great - including what environment variables you used.
—
Reply to this email directly, view it on GitHub
<#3324 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FQJOFRXPFBWNOGBTAL2PZBADAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRQGEZDKNZRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Also where are these fix files on Gaea C6? And Gaea C5. And other
systems?
On Wed, Feb 19, 2025 at 9:48 AM George Vandenberghe - NOAA Affiliate <
***@***.***> wrote:
… I captured this and tried running and it got as far as
FATAL ERROR: in opening file
456:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/global-workflow/fix/am/global_slmask.t1534.3072.1536.grbThi
This is blocked at the Jessica.Meixner directory.
The 120 MPI ranks per I/O group is too small. I changed to 480 ranks and
one group rather than four to avoid having to edit ufs.configure.
On Fri, Feb 14, 2025 at 2:39 PM Jessica Meixner ***@***.***>
wrote:
> @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
>
> I ran extra tests with esmf_threading turned off and the extra
> environment variable from hercules and still am getting an error.
>
> I moved the log and run directory here:
> /gpfs/f6/ira-sti/world-shared/Jessica.Meixner/fcst.1315376
>
> If you have a successful configuration for C768 or C1152 on Gaea that you
> could share that'd be great - including what environment variables you used.
>
> —
> Reply to this email directly, view it on GitHub
> <#3324 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ANDS4FQJOFRXPFBWNOGBTAL2PZBADAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRQGEZDKNZRGU>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
> [image: JessicaMeixner-NOAA]*JessicaMeixner-NOAA* left a comment
> (NOAA-EMC/global-workflow#3324)
> <#3324 (comment)>
>
> @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
>
> I ran extra tests with esmf_threading turned off and the extra
> environment variable from hercules and still am getting an error.
>
> I moved the log and run directory here:
> /gpfs/f6/ira-sti/world-shared/Jessica.Meixner/fcst.1315376
>
> If you have a successful configuration for C768 or C1152 on Gaea that you
> could share that'd be great - including what environment variables you used.
>
> —
> Reply to this email directly, view it on GitHub
> <#3324 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ANDS4FQJOFRXPFBWNOGBTAL2PZBADAVCNFSM6AAAAABXCQN4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRQGEZDKNZRGU>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@GeorgeVandenberghe-NOAA - The fix files are staged for you when using the global-workflow. |
What is wrong?
Running C768 S2SW on gaea had the following messages:
/gpfs/f6/ira-sti/scratch/Jessica.Meixner/try01/c768t01/COMROOT/c768t01/logs/2019120300/
In ufs-weather-model the GaeaC6 forecast job has the following environment variables:
https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/fv3_slurm.IN_gaeac6#L31-L34:
Adding these to
https://github.com/NOAA-EMC/global-workflow/blob/develop/env/GAEAC6.env#L203-L212
has gotten past this hang issue.
What should have happened?
C768 S2SW should work on Gaea C6
What machines are impacted?
All or N/A
What global-workflow hash are you using?
#3289
Steps to reproduce
Use the CI test here: https://github.com/NOAA-EMC/global-workflow/blob/develop/ci/cases/hires/C768_S2SW.yaml
Additional information
There are additional variables in the ufs-weather-model https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/fv3_slurm.IN_gaeac6#L26C1-L28C19
That should be potentially also added.
Do you have a proposed solution?
Add variables from ufs-weather-model to gaeac6 environment file for forecast.
The text was updated successfully, but these errors were encountered: