[Feature] Improved Panic Handling #2927

kaimast · 2025-09-17T19:55:35Z

This PR is a follow-up on #2813.

Improved Panic Handling Mechanisms

Panic handling, as of now, had a major shortcoming: it did not work well with multi-threading. That is because the panic handler is set per-process but, while executing a snarkVM program another thread, unrelated to execution, might panic, and snarkVM may falsely log it as VM error.

The proposed mechanism in this PR is to store backtraces and error messages in a thread-local variable. try_vm_runtime is changed to retrieve the error message from this thread-local variable.
Additionally, there is no a catch_unwind function that wraps around the function of the same name from the standard library and returns the error message and backtrace on error. The latter is useful when printing error messages about panics, for example, in snarkOS.

Error Logging Utilities

This PR also introduces a new LoggableError trait that is implemented for anyhow::Error. It allows to conveniently log the entire error chain with a custom error message. There are many places where we log errors in snarkOS, but just the top-level error, not the entire error chain.
In addition to provide more information, this utility also simplifies error printing logic and ensure a unified style.

Utilities for Task Management

tokio allows propagating panics through JoinHandles. This PR also provides wrappers around tokio's spawn and spawn_blocking functions that contain the boilerplate for propagating such panics.
This replaces logic that before resided in snarkos-node-bft but will be useful to other crates.

Notes

For panic handling, there is a potential overhead of storing backtraces that might not always be needed. It is my understanding that panics do not happen very frequently in production, but if this understanding is wrong, please let me know.
This PR also contains some unrelated additions of error contexts, which helped debugging testnet issues.
Panic handling for "detached" tokio task (those without JoinHandles) does not work well (yet). Future improvements to tokio will change this, but we also do not use detached tasks much.

raychu86 · 2025-09-17T23:12:06Z

utilities/src/errors.rs

+}
+
+/// `catch_unwind` calls the given closure `f` and, if `f` panics, returns the panic message and backtrace.
+pub fn catch_unwind<R, F: FnMut() -> R>(f: F) -> Result<R, (String, Backtrace)> {


Outside of tests, who calls this particular catch_unwind vs the one from std::panic::catch_unwind?

The plan is to use this in the snarkOS CLI. You can see it in this experimental branch.

utilities/src/errors.rs

utilities/src/task.rs

ledger/store/src/helpers/memory/internal/nested_map.rs

vicsn · 2025-09-18T07:54:24Z

utilities/src/errors.rs

+/// The message backtrace of the last panic on this thread (if any).
+///
+/// We store this information here instead of directly processing it in a panic hook, because panic hooks are global whereas this can be processed on a per-thread basis.
+/// For example, one thread may execute a program where panics should *not* cause the entire process to terminate, while in another thread there is a panic due to a bug.


This looks like quite a crucial and useful change which will save us a lot of headache in the future, thank you for the proposal. Same question as before, I'm curious to hear about prior art in the Rust world

There is ongoing work to atomically update the panic handler (in this RFC).

However, that still does not allow setting a handler per thread. For that, the common suggestion I found is essentially what I implemented.

vicsn · 2025-09-21T16:37:24Z

and snarkVM may falsely log it as VM error.

Is this the main and only usecase for the thread local panic logic of this PR? I'm trying to better understand whether the refactor is worth it at the moment.

I thought we printed the error in the VM panic handler. Therefore:

if it were to catch a non-VM panic, I'd assume we would clearly observe this in the logs. Clumsy, but no harm done.
if it were to catch a VM panic of the wróng thread, would there be a risk of any of the following:
a. verification of the wrong transaction being logged as wrong?
b. verification of the wrong transaction being returned as Err?
c. the above being non-deterministic, and therefore creating a risk of a fork on the network?

Cargo.toml

ledger/store/src/helpers/rocksdb/internal/nested_map.rs

ljedrz · 2025-09-22T19:53:06Z

utilities/src/task.rs

+            Err(err) => {
+                if err.is_panic() {
+                    // Resume the panic on the main task
+                    std::panic::resume_unwind(err.into_panic());


won't this cause us to shut down the entire application as opposed to "silently" panicking within a single worker thread? as outlined in the PR description, I don't think this happens often, but it needs to be clear that this makes the node as a whole more susceptible to crashes

Yes, that was also the intention of the previous PR. It doesn't seem like a good idea to continue execution after a panic, especially when they happen in the storage layer.

It doesn't seem like a good idea to continue execution after a panic, especially when they happen in the storage layer.

Two questions for both of you:

If the panic happens within the worker thread, the targeted behaviour should be to catch the panic and log it. Does this PR maintain that?

I asked this before, but before this PR, is there a forking risk if the VM verifies transactions in parallel? Or will a transaction verification panic always bubble up to it's own thread's panic handler?

It would now log the panic, but then resume it, shutting everything down (as opposed to just the worker thread).

I'm not sure why this would be a forking risk -verify_transactions, even if executed outside the context of adding a new block, is still ultimately performed again before block insertion

Alright so this PRs approach is not what we should be doing

Txs which are aborted by the parallel verification are not serially verified anymore. One thread's execution should not mess with another thread. Is this an issue currently on mainnet?

I very much think we should fail if there is a fatal error. It is a lot easier for the node to restart than to try to recover the state somehow.
In the current design, it is very rare for the system to encounter a panic (outside of VM execution) and, if it happens, it is an indication that something happened that we did not anticipate. The goal for us should be to have a backtrace from the panic and as much logged information as possible.

In the current design, it seems like different threads set/unset the global panic handler. I am not sure if it causes a problem on mainnet, but the worst that could happen is that some VM panics are logged incorrectly.

A fundamental issue, besides the many lines of not perfectly audited unwraps, is that the VM has some badly designed halt / or_halt_with logic which panics. That would need to be cleaned up before we let panics be panics. Made an issue: [Bug] Remove clear panic potentials in validator code paths #2941

Alright, @ljedrz do you agree?

I agree with @kaimast - the worst case scenario is a changed panic (originating from anywhere in snarkOS/VM) message - after all, we only call catch_unwind for code wrapped with try_vm_runtime, with the panic hook only changing the message.

…eaded setting

kaimast force-pushed the feat/track-error branch from 34a44e4 to cec41dd Compare September 17, 2025 20:13

kaimast changed the title ~~[Feature]~~ [Feature] Improved Panic Handling Sep 17, 2025

kaimast marked this pull request as ready for review September 17, 2025 20:47

raychu86 reviewed Sep 17, 2025

View reviewed changes

utilities/src/errors.rs Outdated Show resolved Hide resolved

vicsn reviewed Sep 18, 2025

View reviewed changes

utilities/src/task.rs Show resolved Hide resolved

vicsn reviewed Sep 18, 2025

View reviewed changes

ledger/store/src/helpers/memory/internal/nested_map.rs Show resolved Hide resolved

vicsn reviewed Sep 18, 2025

View reviewed changes

kaimast mentioned this pull request Sep 19, 2025

[WIP] Track errors ProvableHQ/snarkOS#3874

Draft

vicsn requested a review from ljedrz September 22, 2025 19:14

ljedrz reviewed Sep 22, 2025

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

ljedrz reviewed Sep 22, 2025

View reviewed changes

ledger/store/src/helpers/rocksdb/internal/nested_map.rs Show resolved Hide resolved

ljedrz reviewed Sep 22, 2025

View reviewed changes

kaimast mentioned this pull request Oct 6, 2025

[Feature] LoggableError trait #2944

Merged

vicsn marked this pull request as draft October 8, 2025 17:01

kaimast force-pushed the feat/track-error branch from 1fc8711 to 5b52e96 Compare October 31, 2025 16:19

kaimast added 2 commits November 11, 2025 17:33

feat(utilities): add utilities for propagating errors across async tasks

0678bd3

feat(utils/error): make catching panics work correctly in a multi-thr…

56de046

…eaded setting

kaimast force-pushed the feat/track-error branch from 5b52e96 to 56de046 Compare November 12, 2025 01:35

[Feature] Improved Panic Handling #2927

Are you sure you want to change the base?

[Feature] Improved Panic Handling #2927

Uh oh!

Conversation

kaimast commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improved Panic Handling Mechanisms

Error Logging Utilities

Utilities for Task Management

Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vicsn commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vicsn Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljedrz Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kaimast commented Sep 17, 2025 •

edited

Loading

vicsn commented Sep 21, 2025 •

edited

Loading

vicsn Oct 1, 2025 •

edited

Loading

ljedrz Oct 1, 2025 •

edited

Loading