Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault upon Restart #14223

Open
kasrash opened this issue Feb 18, 2025 · 14 comments
Open

Segmentation Fault upon Restart #14223

kasrash opened this issue Feb 18, 2025 · 14 comments
Assignees

Comments

@kasrash
Copy link

kasrash commented Feb 18, 2025

Dear FDS community,

I need to do restart to run multiple large models on a HPC environment with limited wall clock time. However, the restart is not working as expected. As a test case, I tried the following steps for the attached input files on 32 CPUs:

  1. Change T_END to 2, RESTART=F, and DT_RESTART=2
  2. Let the model run to generate the files
  3. Change T_END to 4 and RESTART=T
  4. Upon running this model I get error:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread-2.31.s  00001457EC205420  Unknown               Unknown  Unknown
fds                0000000006B2D408  Unknown               Unknown  Unknown
fds                0000000006B121A0  Unknown               Unknown  Unknown
fds                0000000006AB8148  Unknown               Unknown  Unknown
fds                000000000040A71D  Unknown               Unknown  Unknown
libc-2.31.so       00001457EA2C6083  __libc_start_main     Unknown  Unknown
fds                000000000040A636  Unknown               Unknown  Unknow

I am using latest pre-compiled FDS executables on an Ubuntu 20.04 HPC. I have also tried own-compiled executables using GCC and OpnMPI on a SUSE HPC, which resulted in error:

> Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
> 
> Backtrace for this error:
> 
> Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
> 
> Backtrace for this error:
> 
> Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
> 
> Backtrace for this error:
> 
> Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
> 
> Backtrace for this error:
> #0  0x14a0ec32bd4f in ???
>         at /usr/src/debug/glibc-2.31-150300.41.1.x86_64/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
> #1  0x721f1b in ???
> #2  0x72baac in ???
> #3  0x72e47b in ???
> #4  0x941edb in ???
> #5  0x403a2c in ???
> #6  0x14a0ec31629c in __libc_start_main
>         at ../csu/libc-start.c:308
> #7  0x403a59 in ???
>         at ../sysdeps/x86_64/start.S:120
> #8  0xffffffffffffffff in ???
> #0  0x154c87e16d4f in ???
>         at /usr/src/debug/glibc-2.31-150300.41.1.x86_64/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
> #1  0x721f1b in ???
> #2  0x72baac in ???
> #3  0x72e47b in ???
> #4  0x941edb in ???
> #5  0x403a2c in ???
> #6  0x154c87e0129c in __libc_start_main
>         at ../csu/libc-start.c:308
> #7  0x403a59 in ???
>         at ../sysdeps/x86_64/start.S:120
> #8  0xffffffffffffffff in ???
> #0  0x14e954960d4f in ???
>         at /usr/src/debug/glibc-2.31-150300.41.1.x86_64/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
> #1  0x721f1b in ???
> #2  0x72baac in ???
> #3  0x72e47b in ???
> #4  0x941edb in ???
> #5  0x403a2c in ???
> #6  0x14e95494b29c in __libc_start_main
>         at ../csu/libc-start.c:308
> #7  0x403a59 in ???
>         at ../sysdeps/x86_64/start.S:120
> #8  0xffffffffffffffff in ???
> #0  0x14e72a107d4f in ???
>         at /usr/src/debug/glibc-2.31-150300.41.1.x86_64/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
> #1  0x721f1b in ???
> #2  0x72baac in ???
> #3  0x72e47b in ???
> #4  0x941edb in ???
> #5  0x403a2c in ???
> #6  0x14e72a0f229c in __libc_start_main
>         at ../csu/libc-start.c:308
> #7  0x403a59 in ???
>         at ../sysdeps/x86_64/start.S:120
> #8  0xffffffffffffffff in ???

I would be grateful to know your suggestions about the likely reasons and how to solve this issue.

Many thanks in advance!

vegetation_model.txt

InputFile_MeshAdjust32.txt

@marcosvanella
Copy link
Contributor

I'll try it.

@marcosvanella marcosvanella self-assigned this Feb 18, 2025
@marcosvanella
Copy link
Contributor

@kasrash cleanup your input files as much as possible and run your test with the latest nightly bundle:

https://github.com/firemodels/test_bundles/releases/tag/FDS_TEST

I'm getting an error

ERROR(375): OBST OBST-1 is VARIABLE_THICKNESS or HT3D and needs a MATL_ID.

when trying to do the first run to T_END=2.

@kasrash
Copy link
Author

kasrash commented Feb 18, 2025

@kasrash cleanup your input files as much as possible and run your test with the latest nightly bundle:

https://github.com/firemodels/test_bundles/releases/tag/FDS_TEST

I'm getting an error

ERROR(375): OBST OBST-1 is VARIABLE_THICKNESS or HT3D and needs a MATL_ID.

when trying to do the first run to T_END=2.

Thank you for taking the time to check this and sorry for the messy inputs. I have cleaned them. The attached inputs work with the latest release version (link).

I am not able to run the latest pre-compiled nightly build as it cannot find "libimf.so", so I need a bit time to figure this out. However, I am sharing the input files here in case you would like to try before I figure out the executable error in the nightly build.

Thanks!

InputFile_MeshAdjust32.txt
vegetation_model.txt

@kasrash
Copy link
Author

kasrash commented Feb 19, 2025

@kasrash cleanup your input files as much as possible and run your test with the latest nightly bundle:

https://github.com/firemodels/test_bundles/releases/tag/FDS_TEST

I'm getting an error

ERROR(375): OBST OBST-1 is VARIABLE_THICKNESS or HT3D and needs a MATL_ID.

when trying to do the first run to T_END=2.

@marcosvanella I was able to test the latest nightly build (gaf6fbf1). Based on the updated User's Guide, I had to chage the OBST a bit per the new requirements to overcome the error you encountered, The error still persists, but it contains a bit more information. The main error is:

> ERROR(300): N_LAYER_CELLS_MAX should be at least    71 for vegetation (MPI Process: 17, CHID: Ember_Ignition_CaseII_cat)
> 
> Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

I have also attached the full log below together with the inputs I used. If it is important, I compiled the build myself using GCC+OpenMPI as I could not use the pre-compiled executables you provided.

outerr.txt
InputFile_MeshAdjust32.txt
vegetation_model.txt

Update: I was able to isolate the source of the error. Setting VARIABLE_THICKNESS=T causes the error. Looking at the source code, there is only one function that can cause ERROR(300) which is related to remeshing the solid phase. Seems like for some reason variable N_LAYER_CELLS_MAX is not assigned during the restart run. I tried to also include it in the input file, no luck. I would be more than grateful if you could provide a solution for this as I need the VARIABLE_THICKNESS for my models. Even a quick and dirty solution for now would be really appreciated as I have a bit limited time.

Thabk you so much in advance!

@marcosvanella
Copy link
Contributor

@kasrash I was able to reproduce your error at restart:

 Job ID string    : Ember_Ignition_CaseII_cat

forrtl: severe (408): fort: (2): Subscript #1 of the array N_LAYER_CELLS has value 2 which is greater than the upper bound of 1

Image              PC                Routine            Line        Source
fds_impi_intel_li  000000000061E0E1  pack_boundary_one        4012  func.f90
fds_impi_intel_li  0000000000614102  pack_wall                3806  func.f90
fds_impi_intel_li  0000000002AC5F9F  read_restart             3800  dump.f90
fds_impi_intel_li  0000000003176586  fds                       373  main.f90
fds_impi_intel_li  0000000000408D6D  Unknown               Unknown  Unknown
libc.so.6          0000145947A295D0  Unknown               Unknown  Unknown
libc.so.6          0000145947A29680  __libc_start_main     Unknown  Unknown
fds_impi_intel_li  0000000000408C85  Unknown               Unknown  Unknown

we'll take a closer look.

@marcosvanella
Copy link
Contributor

@kasrash as per the user guide (section 8.4.5), take all material and extra THICKNESS information from your &SURF line for 'vegetation':

&SURF ID                         = 'vegetation'
      TMP_INNER                  = 600
      TMP_FRONT_INITIAL          = 600
      VARIABLE_THICKNESS         = T
      COLOR                       = "BLACK"
 /

And try the case. There seems to be a issue restarting VARIABLE_THICKNESS = T cases when the Material information/THICKNESS is provided in the SURF line and also in the OBSTacle. This is related to internal gridding in the solid object. This small case demonstrates the problem:

&HEAD CHID='test', TITLE='VARIABLE_THICKNESS Restart Test Case' /
&MESH IJK=20,20,20, XB=-0.025,0.025,-0.025,0.025,0.0,0.05 /
&TIME T_END=0.1, WALL_INCREMENT=1 /
&MISC RESTART=F / 

# Materials:
&MATL ID                      = 'CHAR'
      DENSITY                 = 105.
      CONDUCTIVITY            = 0.065
      SPECIFIC_HEAT           = 1.5 /

&SURF ID                         = 'vegetation'
      MATL_ID(1,:)                = 'CHAR'
      THICKNESS                  = 0.00125
      VARIABLE_THICKNESS         = T /

&OBST XB=-0.0127,0.0127,-0.003175,0.003175,0.00125,0.0076, SURF_ID='vegetation', MATL_ID='CHAR', MATL_MASS_FRACTION=1. /

&DUMP DT_RESTART=0.1 /
&TAIL /

Run the case with debug target to completion at T_END=0.1. Then change T_END=0.2 and RESTART=T. run the case and you will see the following error:

forrtl: severe (408): fort: (2): Subscript #1 of the array N_LAYER_CELLS has value 2 which is greater than the upper bound of 1

Image              PC                Routine            Line        Source             
fds_impi_intel_li  000000000061E0E1  pack_boundary_one        4019  func.f90
fds_impi_intel_li  0000000000614102  pack_wall                3808  func.f90
fds_impi_intel_li  0000000002AC68E3  read_restart             3805  dump.f90
fds_impi_intel_li  0000000003176EC6  fds                       373  main.f90
fds_impi_intel_li  0000000000408D6D  Unknown               Unknown  Unknown
libc.so.6          000014FB7EE295D0  Unknown               Unknown  Unknown
libc.so.6          000014FB7EE29680  __libc_start_main     Unknown  Unknown
fds_impi_intel_li  0000000000408C85  Unknown               Unknown  Unknown

If we take out the lines:

      MATL_ID(1,:)                = 'CHAR'
      THICKNESS                  = 0.00125

from the 'vegetation' SURF and redo this exercise the code restarts correctly.

@kasrash
Copy link
Author

kasrash commented Feb 20, 2025

@kasrash as per the user guide (section 8.4.5), take all material and extra THICKNESS information from your &SURF line for 'vegetation':

&SURF ID                         = 'vegetation'
      TMP_INNER                  = 600
      TMP_FRONT_INITIAL          = 600
      VARIABLE_THICKNESS         = T
      COLOR                       = "BLACK"
 /

And try the case. There seems to be a issue restarting VARIABLE_THICKNESS = T cases when the Material information/THICKNESS is provided in the SURF line and also in the OBSTacle. This is related to internal gridding in the solid object. This small case demonstrates the problem:

&HEAD CHID='test', TITLE='VARIABLE_THICKNESS Restart Test Case' /
&MESH IJK=20,20,20, XB=-0.025,0.025,-0.025,0.025,0.0,0.05 /
&TIME T_END=0.1, WALL_INCREMENT=1 /
&MISC RESTART=F / 

# Materials:
&MATL ID                      = 'CHAR'
      DENSITY                 = 105.
      CONDUCTIVITY            = 0.065
      SPECIFIC_HEAT           = 1.5 /

&SURF ID                         = 'vegetation'
      MATL_ID(1,:)                = 'CHAR'
      THICKNESS                  = 0.00125
      VARIABLE_THICKNESS         = T /

&OBST XB=-0.0127,0.0127,-0.003175,0.003175,0.00125,0.0076, SURF_ID='vegetation', MATL_ID='CHAR', MATL_MASS_FRACTION=1. /

&DUMP DT_RESTART=0.1 /
&TAIL /

Run the case with debug target to completion at T_END=0.1. Then change T_END=0.2 and RESTART=T. run the case and you will see the following error:

forrtl: severe (408): fort: (2): Subscript #1 of the array N_LAYER_CELLS has value 2 which is greater than the upper bound of 1

Image              PC                Routine            Line        Source             
fds_impi_intel_li  000000000061E0E1  pack_boundary_one        4019  func.f90
fds_impi_intel_li  0000000000614102  pack_wall                3808  func.f90
fds_impi_intel_li  0000000002AC68E3  read_restart             3805  dump.f90
fds_impi_intel_li  0000000003176EC6  fds                       373  main.f90
fds_impi_intel_li  0000000000408D6D  Unknown               Unknown  Unknown
libc.so.6          000014FB7EE295D0  Unknown               Unknown  Unknown
libc.so.6          000014FB7EE29680  __libc_start_main     Unknown  Unknown
fds_impi_intel_li  0000000000408C85  Unknown               Unknown  Unknown

If we take out the lines:

      MATL_ID(1,:)                = 'CHAR'
      THICKNESS                  = 0.00125

from the 'vegetation' SURF and redo this exercise the code restarts correctly.

Thank you so much for your quick reply and working pn this! I will try this. I wonder though what are the implications of these changes on the results. I hope that it doesn’t change the results.

I will update here shortly to either close the issue, or see what else can be done.

@marcosvanella
Copy link
Contributor

@kasrash See that the TEMP inside the solid and in the surface is set correctly.

@mcgratta check the simple case above.

mcgratta added a commit that referenced this issue Feb 20, 2025
FDS Source: Issue #14223. Check bounds of arrays on restart
@mcgratta
Copy link
Contributor

I added some logic to check array bounds. However, in this case, you should not have a THICKNESS on a line with VARIABLE_THICKNESS.

@kasrash
Copy link
Author

kasrash commented Feb 20, 2025

@marcosvanella @mcgratta Thank you again for looking into this. However, another confusing thing has happened. Setting MATL_ID in the OBST rather than SURF causes the results change drastically. The reaction is significantly slowed when MATL_ID is added to OBST, and it is not sensitive to kinetic parameters (A & E). In the nightly release, not defining the material in OBST causes an error, but it does not cause an error in the release version. Could you please help why my results are different when MATL_ID is defined in OBST and why this definition is forced in the nightly build?

@mcgratta Could you please elaborate why I should not have THICKNESS with VARIABLE_THICKNESS?

@mcgratta
Copy link
Contributor

VARIABLE_THICKNESS means that the thickness of the solid is taken directly from the underlying obstruction to which it is applied. Therefore, specifying a THICKNESS is not needed, and potentially can cause a bug.

@mcgratta
Copy link
Contributor

As for your other question -- create a simple, clean, short input file that demonstrates the problem. I am not sure what you are trying to do and it would take me hours to unravel it all.

@kasrash
Copy link
Author

kasrash commented Feb 20, 2025

@mcgratta Thank you. I will try to to come up with that simple case. One more question: I could not understand what @marcosvanella meant by "See that the TEMP inside the solid and in the surface is set correctly.". Could you please elaborate more on this, so I can check this before creating the simple test case and taking more of your time?

Thanks a lot!

@mcgratta
Copy link
Contributor

I think he just meant to say that you should make sure that the calculation is working before the restart. That is, do your temperatures make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants