Skip to content

Conversation

@kaimast
Copy link
Collaborator

@kaimast kaimast commented Sep 17, 2025

This PR is a follow-up on #2813.

Improved Panic Handling Mechanisms

Panic handling, as of now, had a major shortcoming: it did not work well with multi-threading. That is because the panic handler is set per-process but, while executing a snarkVM program another thread, unrelated to execution, might panic, and snarkVM may falsely log it as VM error.

The proposed mechanism in this PR is to store backtraces and error messages in a thread-local variable. try_vm_runtime is changed to retrieve the error message from this thread-local variable.
Additionally, there is no a catch_unwind function that wraps around the function of the same name from the standard library and returns the error message and backtrace on error. The latter is useful when printing error messages about panics, for example, in snarkOS.

Error Logging Utilities

This PR also introduces a new LoggableError trait that is implemented for anyhow::Error. It allows to conveniently log the entire error chain with a custom error message. There are many places where we log errors in snarkOS, but just the top-level error, not the entire error chain.
In addition to provide more information, this utility also simplifies error printing logic and ensure a unified style.

Utilities for Task Management

tokio allows propagating panics through JoinHandles. This PR also provides wrappers around tokio's spawn and spawn_blocking functions that contain the boilerplate for propagating such panics.
This replaces logic that before resided in snarkos-node-bft but will be useful to other crates.

Notes

  • For panic handling, there is a potential overhead of storing backtraces that might not always be needed. It is my understanding that panics do not happen very frequently in production, but if this understanding is wrong, please let me know.
  • This PR also contains some unrelated additions of error contexts, which helped debugging testnet issues.
  • Panic handling for "detached" tokio task (those without JoinHandles) does not work well (yet). Future improvements to tokio will change this, but we also do not use detached tasks much.

@kaimast kaimast changed the title [Feature] [Feature] Improved Panic Handling Sep 17, 2025
@kaimast kaimast marked this pull request as ready for review September 17, 2025 20:47
}

/// `catch_unwind` calls the given closure `f` and, if `f` panics, returns the panic message and backtrace.
pub fn catch_unwind<R, F: FnMut() -> R>(f: F) -> Result<R, (String, Backtrace)> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of tests, who calls this particular catch_unwind vs the one from std::panic::catch_unwind?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to use this in the snarkOS CLI. You can see it in this experimental branch.

/// The message backtrace of the last panic on this thread (if any).
///
/// We store this information here instead of directly processing it in a panic hook, because panic hooks are global whereas this can be processed on a per-thread basis.
/// For example, one thread may execute a program where panics should *not* cause the entire process to terminate, while in another thread there is a panic due to a bug.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like quite a crucial and useful change which will save us a lot of headache in the future, thank you for the proposal. Same question as before, I'm curious to hear about prior art in the Rust world

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is ongoing work to atomically update the panic handler (in this RFC).

However, that still does not allow setting a handler per thread. For that, the common suggestion I found is essentially what I implemented.

@vicsn
Copy link
Collaborator

vicsn commented Sep 21, 2025

and snarkVM may falsely log it as VM error.

Is this the main and only usecase for the thread local panic logic of this PR? I'm trying to better understand whether the refactor is worth it at the moment.

I thought we printed the error in the VM panic handler. Therefore:

  1. if it were to catch a non-VM panic, I'd assume we would clearly observe this in the logs. Clumsy, but no harm done.
  2. if it were to catch a VM panic of the wróng thread, would there be a risk of any of the following:
    a. verification of the wrong transaction being logged as wrong?
    b. verification of the wrong transaction being returned as Err?
    c. the above being non-deterministic, and therefore creating a risk of a fork on the network?

@vicsn vicsn requested a review from ljedrz September 22, 2025 19:14
Err(err) => {
if err.is_panic() {
// Resume the panic on the main task
std::panic::resume_unwind(err.into_panic());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this cause us to shut down the entire application as opposed to "silently" panicking within a single worker thread? as outlined in the PR description, I don't think this happens often, but it needs to be clear that this makes the node as a whole more susceptible to crashes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was also the intention of the previous PR. It doesn't seem like a good idea to continue execution after a panic, especially when they happen in the storage layer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like a good idea to continue execution after a panic, especially when they happen in the storage layer.

Two questions for both of you:

  1. If the panic happens within the worker thread, the targeted behaviour should be to catch the panic and log it. Does this PR maintain that?
  2. I asked this before, but before this PR, is there a forking risk if the VM verifies transactions in parallel? Or will a transaction verification panic always bubble up to it's own thread's panic handler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It would now log the panic, but then resume it, shutting everything down (as opposed to just the worker thread).
  2. I'm not sure why this would be a forking risk -verify_transactions, even if executed outside the context of adding a new block, is still ultimately performed again before block insertion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Alright so this PRs approach is not what we should be doing
  2. Txs which are aborted by the parallel verification are not serially verified anymore. One thread's execution should not mess with another thread. Is this an issue currently on mainnet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I very much think we should fail if there is a fatal error. It is a lot easier for the node to restart than to try to recover the state somehow.
    In the current design, it is very rare for the system to encounter a panic (outside of VM execution) and, if it happens, it is an indication that something happened that we did not anticipate. The goal for us should be to have a backtrace from the panic and as much logged information as possible.

  2. In the current design, it seems like different threads set/unset the global panic handler. I am not sure if it causes a problem on mainnet, but the worst that could happen is that some VM panics are logged incorrectly.

Copy link
Collaborator

@vicsn vicsn Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. A fundamental issue, besides the many lines of not perfectly audited unwraps, is that the VM has some badly designed halt / or_halt_with logic which panics. That would need to be cleaned up before we let panics be panics. Made an issue: [Bug] Remove clear panic potentials in validator code paths #2941

  2. Alright, @ljedrz do you agree?

Copy link
Collaborator

@ljedrz ljedrz Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I agree with @kaimast - the worst case scenario is a changed panic (originating from anywhere in snarkOS/VM) message - after all, we only call catch_unwind for code wrapped with try_vm_runtime, with the panic hook only changing the message.

@vicsn vicsn marked this pull request as draft October 8, 2025 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants