Can Caliper run on a 384 core x86_64 host? #678
Replies: 13 comments 2 replies
-
|
Hi @drmichaeltcvx, thanks for the report. There are no explicit limits on the number of threads/cores etc. It'll allocate some resources on all threads that have Caliper annotations (maybe 2-3MiB per thread) so that could add up, but it shouldn't bring down the app unless you're already maxing out the available memory. Another thing is that Caliper needs to be initialized on the main thread, so if there's a race where a sub-thread initializes Caliper first it can segfault. A fix is to call |
Beta Was this translation helpful? Give feedback.
-
|
Hi, here is a stack trace using options Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Hello, the target unit is an AMD MI300C with 384 x Zen4 cores. and 501GiB of HBM3 memory. This will be an exclusive Azure unit (SKU) called HBv5. The core count is insane but it is expected to be a great candidate for certain HPC workloads. Can you please take a look at what may triggering the SIGSEGV? Is the number of MPI callers within a single node? I missed running the code with |
Beta Was this translation helpful? Give feedback.
-
|
Messages with |
Beta Was this translation helpful? Give feedback.
-
|
Hi @drmichaeltcvx, thanks for the additional details. So it looks like Caliper is crashing inside the GOTCHA library when trying to wrap MPI functions. GOTCHA is our function wrapper, similar to LD_PRELOAD. It's low-level stuff so there's a higher chance of things going wrong on new systems. Unfortunately this makes it a bit more difficult to debug as well. I would really appreciate it if you could help us get to the bottom of this issue. We could use a GOTCHA debug trace from a small example (i.e. just one MPI rank). The cali-query tool may be a good test app for this. It should be in If it doesn't segfault just close it with Ctrl+D or Ctrl+C (it'll be waiting for input from stdin) and that would also be good to know, otherwise please share the log output. Also, what exactly is the Caliper version you're using? In the meantime there are two possible workarounds:
|
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the prompt response!
I’ll send back the info as soon as I collect it.
Btw in the past we had our code crash at the very end with Caliper never generating any report. Is it possible to get these reports at earlier stages upon request?
Thanks!
Michael
…________________________________
From: David Boehme ***@***.***>
Sent: Wednesday, August 6, 2025 7:07:45 PM
To: LLNL/Caliper ***@***.***>
Cc: Thomadakis, Michael ***@***.***>; Mention ***@***.***>
Subject: [**EXTERNAL**] Re: [LLNL/Caliper] Can Caliper run on a 384 core x86_64 host? (Discussion #678)
Be aware this external email contains an attachment and/or link.
Ensure the email and contents are expected. If there are concerns, please submit suspicious messages to the Cyber Intelligence Center using the Report Phishing button.
Hi @drmichaeltcvx<https://github.com/drmichaeltcvx>, thanks for the additional details.
So it looks like Caliper is crashing inside the GOTCHA library when trying to wrap MPI functions. GOTCHA is our function wrapper, similar to LD_PRELOAD. It's low-level stuff so there's a higher chance of things going wrong on new systems. Unfortunately this makes it a bit more difficult to debug as well.
I would really appreciate it if you could help us get to the bottom of this issue. We could use a GOTCHA debug trace from a small example (i.e. just one MPI rank). The cali-query tool may be a good test app for this. It should be in bin/ in your Caliper installation directory. If you run it like so it should print a bunch of GOTCHA debug output and I would expect it to segfault as well:
GOTCHA_DEBUG=3 cali-query -P "runtime-report(aggregate_across_ranks=true)"
If it doesn't segfault just close it with Ctrl+D or Ctrl+C (it'll be waiting for input from stdin) and that would also be good to know, otherwise please share the log output. Also, what exactly is the Caliper version you're using?
In the meantime there are two possible workarounds:
* You can build Caliper with -DWITH_GOTCHA=Off. In that case Caliper will fall back to the PMPI interface for intercepting MPI functions and your Caliper config should work. It'll add a very small overhead to MPI calls even if Caliper measurements are turned off, so it's not ideal for production builds, but should be fine for benchmarking.
* You can try -t spot as the Caliper config. This will create a .cali file, and you can then run cali-query -T file.cali on that file to get a report that's quite similar to the runtime-report. Unlike runtime-report the spot config doesn't intercept MPI functions by default if it's invoked through the Caliper ConfigManager API, and so it shouldn't run into the segfault issue. The mpi-report won't work unfortunately since that obviously requires intercepting MPI functions.
—
Reply to this email directly, view it on GitHub<#678 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AS6ZG2SKFMN5OY7EQOFEEXD3MKKFDAVCNFSM6AAAAACDFOUVF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMBSGY3TONI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Great, thanks! With the ConfigManager API you can call the |
Beta Was this translation helpful? Give feedback.
-
|
Posted response at #678 (reply in thread) |
Beta Was this translation helpful? Give feedback.
-
|
We are using |
Beta Was this translation helpful? Give feedback.
-
|
So Caliper inspects the executable and the shared libs for functions called and instruments them? Can it collect h/w counter values and incorporate these with its profiling reports? |
Beta Was this translation helpful? Give feedback.
-
|
I added the I am on Alma9 and using GCC 11.5: |
Beta Was this translation helpful? Give feedback.
-
|
Yes, the ‘-t spot’ option terminates OK
Michael
…________________________________
From: David Boehme ***@***.***>
Sent: Thursday, August 7, 2025 5:09:16 PM
To: LLNL/Caliper ***@***.***>
Cc: Thomadakis, Michael ***@***.***>; Mention ***@***.***>
Subject: [**EXTERNAL**] Re: [LLNL/Caliper] Can Caliper run on a 384 core x86_64 host? (Discussion #678)
Be aware this external email contains an attachment and/or link.
Ensure the email and contents are expected. If there are concerns, please submit suspicious messages to the Cyber Intelligence Center using the Report Phishing button.
Okay, thanks. I'll run this by our GOTCHA developers and see what they think.
Interestingly the small cali-query example seemed to run fine as far as I could tell from that log, so maybe it's triggered by some specific library that GEOS uses or somehow the scale after all.
Can you check if it runs with just the spot config, i.e. -t spot?
—
Reply to this email directly, view it on GitHub<#678 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AS6ZG2WL4LR6CKN2YI43WIT3MPFAZAVCNFSM6AAAAACDFOUVF6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMBTHE3TCOI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
I am using OpenMPI out of HPC_X distribution. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I am trying to use GEOS (
https://github.com/GEOS-DEV/GEOS) with Caliper enabled on a 384 core AMD MI300C unit. The code immediately core dumps as soon as it starts executing the binary ("geos") . When I won't request Caliper output (no "-t ...") the code proceeds and terminates OK.Does Caliper have any limitations on the number of h/w cores on a node?
Thanks
Michael
Beta Was this translation helpful? Give feedback.
All reactions