Can Caliper run on a 384 core x86_64 host? #678

drmichaeltcvx · 2025-08-05T16:26:01Z

drmichaeltcvx
Aug 5, 2025

Hello, I am trying to use GEOS (https://github.com/GEOS-DEV/GEOS) with Caliper enabled on a 384 core AMD MI300C unit. The code immediately core dumps as soon as it starts executing the binary ("geos") . When I won't request Caliper output (no "-t ...") the code proceeds and terminates OK.

Does Caliper have any limitations on the number of h/w cores on a node?

Thanks
Michael

daboehme · 2025-08-05T20:12:23Z

daboehme
Aug 5, 2025
Maintainer

Hi @drmichaeltcvx, thanks for the report. There are no explicit limits on the number of threads/cores etc. It'll allocate some resources on all threads that have Caliper annotations (maybe 2-3MiB per thread) so that could add up, but it shouldn't bring down the app unless you're already maxing out the available memory. Another thing is that Caliper needs to be initialized on the main thread, so if there's a race where a sub-thread initializes Caliper first it can segfault. A fix is to call cali_init() somewhere early on in main. If possible can you try to get a stack trace and/or see how far it gets with CALI_LOG_VERBOSITY=2 set as an environment variable?

0 replies

drmichaeltcvx · 2025-08-06T03:52:31Z

drmichaeltcvx
Aug 6, 2025
Author

Hi, here is a stack trace using options -t runtime-report,max_column_width=200,calc.inclusive,mpi-report. Without these options the code terminates OK.

Thanks!

...
GEOS version: 1.1.0 (feature/paludettomag1/physicsScaling, sha1: 972f3a7d3)
  - c++ compiler: gcc 11.5.0
  - openmp version: 201511
  - MPI version: Open MPI v4.1.7rc1, package: Open MPI root@hpc-kernel-03 Distribution, ident: 4.1.7rc1, repo rev: v4.1.5-176-g6d9519e4c3, Unreleased developer copy
  - HDF5 version: 1.12.1
  - Conduit version: 0.9.2
  - VTK version: 9.4.2
  - RAJA version: 2025.3.0
  - umpire version: 2025.03.0
  - chai version: 2025.3.0
  - adiak version: ..
  - caliper version: ..
  - METIS version: 5.1.0
  - PARAMETIS version: 4.0.0
  - scotch version: 7.0.7
  - superlu_dist version: 6.3.0
  - suitesparse version: 5.7.9
  - hypre version: 2.33.0
  - Python3 version: 3.10.5
Started at 2025-08-06 03:44:23.231506084
Received signal 11: Segmentation fault

** StackTrace of 16 frames **
Frame 0: /lib64/libc.so.6
Frame 1: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2
Frame 2: update_all_library_gots
Frame 3: gotcha_wrap
Frame 4: gotcha_wrap
Frame 5: cali::mpiwrap_init(cali::Caliper*, cali::Channel*, cali::ConfigSet&)
Frame 6: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2
Frame 7: cali::services::register_configured_services(cali::Caliper*, cali::Channel*)
Frame 8: cali::Caliper::create_channel(char const*, cali::RuntimeConfig const&)
Frame 9: cali::ChannelController::create()
Frame 10: cali::ChannelController::start()
Frame 11: cali::ConfigManager::start()
Frame 12: geos::GeosxState::GeosxState(std::unique_ptr<geos::CommandLineOptions, std::default_delete<geos::CommandLineOptions> >&&)
Frame 13: main
Frame 14: /lib64/libc.so.6
Frame 15: __libc_start_main
Frame 16: _start
=====

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 16 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Received signal 11: Segmentation fault

** StackTrace of 16 frames **
Frame 0: /lib64/libc.so.6
Frame 1: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2
Frame 2: update_all_library_gots
Frame 3: gotcha_wrap
Frame 4: gotcha_wrap
Frame 5: cali::mpiwrap_init(cali::Caliper*, cali::Channel*, cali::ConfigSet&)
Frame 6: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2
Frame 7: cali::services::register_configured_services(cali::Caliper*, cali::Channel*)
Frame 8: cali::Caliper::create_channel(char const*, cali::RuntimeConfig const&)
Frame 9: cali::ChannelController::create()
Frame 10: cali::ChannelController::start()
Frame 11: cali::ConfigManager::start()
Frame 12: geos::GeosxState::GeosxState(std::unique_ptr<geos::CommandLineOptions, std::default_delete<geos::CommandLineOptions> >&&)
Frame 13: main
Frame 14: /lib64/libc.so.6
Frame 15: __libc_start_main
Frame 16: _start
=====

Received signal 11: Segmentation fault
...

0 replies

drmichaeltcvx · 2025-08-06T19:39:05Z

drmichaeltcvx
Aug 6, 2025
Author

Hello, the target unit is an AMD MI300C with 384 x Zen4 cores. and 501GiB of HBM3 memory. This will be an exclusive Azure unit (SKU) called HBv5. The core count is insane but it is expected to be a great candidate for certain HPC workloads.

Can you please take a look at what may triggering the SIGSEGV? Is the number of MPI callers within a single node?

I missed running the code with CALI_LOG_VERBOSITY=2 which I'll do next.

0 replies

drmichaeltcvx · 2025-08-06T19:45:27Z

drmichaeltcvx
Aug 6, 2025
Author

Messages with CALI_LOG_VERBOSITY=2

...
Num ranks: 384
Max threads: 1
GEOS version: 1.1.0 (feature/paludettomag1/physicsScaling, sha1: 972f3a7d3)
  - c++ compiler: gcc 11.5.0
  - openmp version: 201511
  - MPI version: Open MPI v4.1.7rc1, package: Open MPI root@hpc-kernel-03 Distribution, ident: 4.1.7rc1, repo rev: v4.1.5-176-g6d9519e4c3, Unreleased developer copy
  - HDF5 version: 1.12.1
  - Conduit version: 0.9.2
  - VTK version: 9.4.2
  - RAJA version: 2025.3.0
  - umpire version: 2025.03.0
  - chai version: 2025.3.0
  - adiak version: ..
  - caliper version: ..
  - METIS version: 5.1.0
  - PARAMETIS version: 4.0.0
  - scotch version: 7.0.7
  - superlu_dist version: 6.3.0
  - suitesparse version: 5.7.9
  - hypre version: 2.33.0
  - Python3 version: 3.10.5
Started at 2025-08-06 19:42:35.706938551
== CALIPER: (0): Available services: adiak_export,adiak_import,aggregate,alloc,cpuinfo,debug,env,event,io,kokkoslookup,kokkostime,loop_monitor,loop_statistics,memstat,mpi,mpiflush,mpireport,pthread,recorder,region_monitor,report,statistics,sysalloc,textlog,timer,timeseries,timestamp,trace,validator
== CALIPER: (0): Initialized
== CALIPER: (0): No manual config specified, disabling default channel
== CALIPER: (0): Creating channel runtime-report
== CALIPER: (0): runtime-report: Registered aggregation service
== CALIPER: (0): runtime-report: event: Using region level 0
== CALIPER: (0): runtime-report: event: Marked attribute comm.region
== CALIPER: (0): runtime-report: event: Marked attribute loop
== CALIPER: (0): runtime-report: event: Marked attribute phase
== CALIPER: (0): runtime-report: event: Marked attribute region
== CALIPER: (0): runtime-report: Registered event trigger service
== CALIPER: (0): runtime-report: mpiwrap: Using GOTCHA wrappers.
Received signal 11: Segmentation fault

** StackTrace of 16 frames **
Frame 0: /lib64/libc.so.6
Frame 1: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2
Frame 2: update_all_library_gots
Frame 3: gotcha_wrap
Frame 4: gotcha_wrap
Frame 5: cali::mpiwrap_init(cali::Caliper*, cali::Channel*, cali::ConfigSet&)
Frame 6: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2
Frame 7: cali::services::register_configured_services(cali::Caliper*, cali::Channel*)
Frame 8: cali::Caliper::create_channel(char const*, cali::RuntimeConfig const&)
Frame 9: cali::ChannelController::create()
Frame 10: cali::ChannelController::start()
Frame 11: cali::ConfigManager::start()
Frame 12: geos::GeosxState::GeosxState(std::unique_ptr<geos::CommandLineOptions, std::default_delete<geos::CommandLineOptions> >&&)
Frame 13: main
Frame 14: /lib64/libc.so.6
Frame 15: __libc_start_main
Frame 16: _start
=====


...

0 replies

daboehme · 2025-08-07T00:07:23Z

daboehme
Aug 7, 2025
Maintainer

Hi @drmichaeltcvx, thanks for the additional details.

So it looks like Caliper is crashing inside the GOTCHA library when trying to wrap MPI functions. GOTCHA is our function wrapper, similar to LD_PRELOAD. It's low-level stuff so there's a higher chance of things going wrong on new systems. Unfortunately this makes it a bit more difficult to debug as well.

I would really appreciate it if you could help us get to the bottom of this issue. We could use a GOTCHA debug trace from a small example (i.e. just one MPI rank). The cali-query tool may be a good test app for this. It should be in bin/ in your Caliper installation directory. If you run it like so it should print a bunch of GOTCHA debug output and I would expect it to segfault as well:

 GOTCHA_DEBUG=3 cali-query -P "runtime-report(aggregate_across_ranks=true)"

If it doesn't segfault just close it with Ctrl+D or Ctrl+C (it'll be waiting for input from stdin) and that would also be good to know, otherwise please share the log output. Also, what exactly is the Caliper version you're using?

In the meantime there are two possible workarounds:

You can build Caliper with -DWITH_GOTCHA=Off. In that case Caliper will fall back to the PMPI interface for intercepting MPI functions and your Caliper config should work. It'll add a very small overhead to MPI calls even if Caliper measurements are turned off, so it's not ideal for production builds, but should be fine for benchmarking.
You can try -t spot as the Caliper config. This will create a .cali file, and you can then run cali-query -T file.cali on that file to get a report that's quite similar to the runtime-report. Unlike runtime-report the spot config doesn't intercept MPI functions by default if it's invoked through the Caliper ConfigManager API, and so it shouldn't run into the segfault issue. The mpi-report won't work unfortunately since that obviously requires intercepting MPI functions.

0 replies

drmichaeltcvx · 2025-08-07T00:14:59Z

drmichaeltcvx
Aug 7, 2025
Author

Thanks for the prompt response! I’ll send back the info as soon as I collect it. Btw in the past we had our code crash at the very end with Caliper never generating any report. Is it possible to get these reports at earlier stages upon request? Thanks! Michael

…

________________________________ From: David Boehme ***@***.***> Sent: Wednesday, August 6, 2025 7:07:45 PM To: LLNL/Caliper ***@***.***> Cc: Thomadakis, Michael ***@***.***>; Mention ***@***.***> Subject: [**EXTERNAL**] Re: [LLNL/Caliper] Can Caliper run on a 384 core x86_64 host? (Discussion #678) Be aware this external email contains an attachment and/or link. Ensure the email and contents are expected. If there are concerns, please submit suspicious messages to the Cyber Intelligence Center using the Report Phishing button. Hi @drmichaeltcvx<https://github.com/drmichaeltcvx>, thanks for the additional details. So it looks like Caliper is crashing inside the GOTCHA library when trying to wrap MPI functions. GOTCHA is our function wrapper, similar to LD_PRELOAD. It's low-level stuff so there's a higher chance of things going wrong on new systems. Unfortunately this makes it a bit more difficult to debug as well. I would really appreciate it if you could help us get to the bottom of this issue. We could use a GOTCHA debug trace from a small example (i.e. just one MPI rank). The cali-query tool may be a good test app for this. It should be in bin/ in your Caliper installation directory. If you run it like so it should print a bunch of GOTCHA debug output and I would expect it to segfault as well: GOTCHA_DEBUG=3 cali-query -P "runtime-report(aggregate_across_ranks=true)" If it doesn't segfault just close it with Ctrl+D or Ctrl+C (it'll be waiting for input from stdin) and that would also be good to know, otherwise please share the log output. Also, what exactly is the Caliper version you're using? In the meantime there are two possible workarounds: * You can build Caliper with -DWITH_GOTCHA=Off. In that case Caliper will fall back to the PMPI interface for intercepting MPI functions and your Caliper config should work. It'll add a very small overhead to MPI calls even if Caliper measurements are turned off, so it's not ideal for production builds, but should be fine for benchmarking. * You can try -t spot as the Caliper config. This will create a .cali file, and you can then run cali-query -T file.cali on that file to get a report that's quite similar to the runtime-report. Unlike runtime-report the spot config doesn't intercept MPI functions by default if it's invoked through the Caliper ConfigManager API, and so it shouldn't run into the segfault issue. The mpi-report won't work unfortunately since that obviously requires intercepting MPI functions. — Reply to this email directly, view it on GitHub<#678 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AS6ZG2SKFMN5OY7EQOFEEXD3MKKFDAVCNFSM6AAAAACDFOUVF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMBSGY3TONI>. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

drmichaeltcvx Aug 7, 2025
Author

Here is the log out of GOTCHA_DEBUG=3 cali-query -P "runtime-report(aggregate_across_ranks=true)" on the AMD MI300C unit.

I am also attaching lscpu and lstopo-no-graphics output.

Let me know if U need any more info to address the issues

thanks

Caliper-call-query_MI300C.log
System_MI300C.log

daboehme · 2025-08-07T00:51:14Z

daboehme
Aug 7, 2025
Maintainer

Great, thanks!

With the ConfigManager API you can call the flush() method essentially whenever you like, so it should be possible to write intermediate reports. Just keep in mind that the flush is effectively an MPI collective operation, so all MPI ranks must participate in it.

0 replies

drmichaeltcvx · 2025-08-07T14:27:08Z

drmichaeltcvx
Aug 7, 2025
Author

Posted response at #678 (reply in thread)

0 replies

drmichaeltcvx · 2025-08-07T18:00:29Z

drmichaeltcvx
Aug 7, 2025
Author

We are using Caliper-2.12.0.

0 replies

drmichaeltcvx · 2025-08-07T19:38:29Z

drmichaeltcvx
Aug 7, 2025
Author

So Caliper inspects the executable and the shared libs for functions called and instruments them? Can it collect h/w counter values and incorporate these with its profiling reports?

0 replies

drmichaeltcvx · 2025-08-07T21:23:22Z

drmichaeltcvx
Aug 7, 2025
Author

I added the cali_init() at our initialization routine. The SIGSEGV at Caliper initialization persists. I am attaching a log file from an actual run with GOTCHA_DEBUG=3 . There are several error messages.

I am on Alma9 and using GCC 11.5:

[hpcuser@localhost miket]$ uname -r
5.14.0-427.13.1.el9_4.x86_64
[hpcuser@localhost miket]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-host-pie --enable-host-bind-now --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugs.almalinux.org/ --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-plugin --enable-initfini-array --without-isl --enable-multilib --with-linker-hash-style=gnu --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_64=x86-64-v2 --with-arch_32=x86-64 --build=x86_64-redhat-linux --with-build-config=bootstrap-lto --enable-link-serialization=1
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.5.0 20240719 (Red Hat 11.5.0-5) (GCC)

geosx_gotcha_debug_2025-08-07.log

1 reply

daboehme Aug 7, 2025
Maintainer

Okay, thanks. I'll run this by our GOTCHA developers and see what they think.

Interestingly the small cali-query example seemed to run fine as far as I could tell from that log, so maybe it's triggered by some specific library that GEOS uses or somehow the scale after all.

Can you check if it runs with just the spot config, i.e. -t spot?

drmichaeltcvx · 2025-08-07T23:36:53Z

drmichaeltcvx
Aug 7, 2025
Author

Yes, the ‘-t spot’ option terminates OK Michael

…

________________________________ From: David Boehme ***@***.***> Sent: Thursday, August 7, 2025 5:09:16 PM To: LLNL/Caliper ***@***.***> Cc: Thomadakis, Michael ***@***.***>; Mention ***@***.***> Subject: [**EXTERNAL**] Re: [LLNL/Caliper] Can Caliper run on a 384 core x86_64 host? (Discussion #678) Be aware this external email contains an attachment and/or link. Ensure the email and contents are expected. If there are concerns, please submit suspicious messages to the Cyber Intelligence Center using the Report Phishing button. Okay, thanks. I'll run this by our GOTCHA developers and see what they think. Interestingly the small cali-query example seemed to run fine as far as I could tell from that log, so maybe it's triggered by some specific library that GEOS uses or somehow the scale after all. Can you check if it runs with just the spot config, i.e. -t spot? — Reply to this email directly, view it on GitHub<#678 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AS6ZG2WL4LR6CKN2YI43WIT3MPFAZAVCNFSM6AAAAACDFOUVF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMBTHE3TCOI>. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

drmichaeltcvx · 2025-08-08T15:08:02Z

drmichaeltcvx
Aug 8, 2025
Author

I am using OpenMPI out of HPC_X distribution.

0 replies

Can Caliper run on a 384 core x86_64 host? #678

Uh oh!

drmichaeltcvx Aug 5, 2025

Replies: 13 comments · 2 replies

Uh oh!

daboehme Aug 5, 2025 Maintainer

Uh oh!

drmichaeltcvx Aug 6, 2025 Author

Uh oh!

drmichaeltcvx Aug 6, 2025 Author

Uh oh!

drmichaeltcvx Aug 6, 2025 Author

Uh oh!

daboehme Aug 7, 2025 Maintainer

Uh oh!

drmichaeltcvx Aug 7, 2025 Author

Uh oh!

drmichaeltcvx Aug 7, 2025 Author

Uh oh!

daboehme Aug 7, 2025 Maintainer

Uh oh!

drmichaeltcvx Aug 7, 2025 Author

Uh oh!

drmichaeltcvx Aug 7, 2025 Author

Uh oh!

drmichaeltcvx Aug 7, 2025 Author

Uh oh!

drmichaeltcvx Aug 7, 2025 Author

Uh oh!

daboehme Aug 7, 2025 Maintainer

Uh oh!

drmichaeltcvx Aug 7, 2025 Author

Uh oh!

drmichaeltcvx Aug 8, 2025 Author

drmichaeltcvx
Aug 5, 2025

Replies: 13 comments 2 replies

daboehme
Aug 5, 2025
Maintainer

drmichaeltcvx
Aug 6, 2025
Author

drmichaeltcvx
Aug 6, 2025
Author

drmichaeltcvx
Aug 6, 2025
Author

daboehme
Aug 7, 2025
Maintainer

drmichaeltcvx
Aug 7, 2025
Author

drmichaeltcvx Aug 7, 2025
Author

daboehme
Aug 7, 2025
Maintainer

drmichaeltcvx
Aug 7, 2025
Author

drmichaeltcvx
Aug 7, 2025
Author

drmichaeltcvx
Aug 7, 2025
Author

drmichaeltcvx
Aug 7, 2025
Author

daboehme Aug 7, 2025
Maintainer

drmichaeltcvx
Aug 7, 2025
Author

drmichaeltcvx
Aug 8, 2025
Author