Skip to content

Conversation

@ljedrz
Copy link
Collaborator

@ljedrz ljedrz commented Oct 23, 2025

This is a proposal tackling #2954. It introduces a background thread responsible for the sequential processing of storage-related operations that cannot happen concurrently, in order to allow us to remove some of the related locks, to be able to introduce more performance optimizations related to block syncing, and in general be able to reason about the storage with more confidence.

During prototyping, the following open questions were resolved:

Are there any more [ledger operations with potential writes]?

I have found only one more operation not listed in the issue, which is only triggered when a dev-mode node creates a genesis block. That being said, it followed one of the call paths detected previously, so it required no special treatment.

at which level should these operations be introduced? try_advance_to_next_block is too broad, but atomic_speculate might be too granular (it might be practical to know if it's block-checking or quorum block preparation stage)

In the end I decided to isolate 2 SequentialOperations:

  • AddNextBlock
  • AtomicSpeculate

There is also atomic_finalize which may not be triggered concurrently, but it is only called as part of AddNextBlock, so I decided to only introduce a new safeguard (ensure_sequential_processing) in it.

how would these operations be represented? a dedicated beefy enum that would be able to hold all the items applicable to the operations of interest, and fed to a channel where we have a shallow clone of the Ledger

It's a new enum (plus some helper objects), and in the end it doesn't seem too bad size-wise; as for the channel, it actually contains a shallow clone of the VM instead, as that's the actual entry point into the storage.

where would the channel be spawned? would it be controlled by snarkOS or snarkVM?

I decided to spawn the dedicated thread when creating the VM - it is owned by it, together with the sender to the channel for transferring the operations.

Future/follow-up considerations:

  • while this proposal solves the general issue for the set of operations we currently use with a small diff and little impact on the existing APIs, it is not foolproof - we would have to manually ensure that any potential future operations (however unlikely) that may not be executed concurrently use this new setup; a better API might be possible, but I haven't come up with one yet
  • the persistent storage could inherit the thread id from the VM, and it could be the one to perform the ensure_sequential_processing check, probably in start_atomic; it would future-proof this setup, as we would immediately know if any new write batch was introduced, but not being processed sequentially
  • this new setup could technically allow us to remove multiple locks from the storage-related objects, which would be a performance improvement and a simplification (as fewer checks may be needed now)

Filing this as a draft, as I haven't run all the tests locally yet, and it would require a stress-test run on snarkOS side.

@ljedrz ljedrz force-pushed the feat/replace_vm_locks_with_channel branch from a3fedfb to 06cc701 Compare October 23, 2025 11:59

// Run each test and compare it against its corresponding expectation.
tests.par_iter().for_each(|test| {
tests.iter().for_each(|test| {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this doesn't seem to slow down the related CI job at all

) -> thread::JoinHandle<()> {
// Spawn a dedicated thread.
let vm = self.clone();
thread::spawn(move || {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this code is run in WASM? I am unsure if you can spawn threads there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of a wasm_thread crate, though I haven't used it yet; an alternative would be to use a blocking task, but then the plumbing would need to be moved to snarkOS; good point, I'll think about it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking of us Kai! We don't use VM object in wasm :). Most wasm usage is strictly for execution or using individual pieces of tooling in SnarkVM so I wouldn't worry about the wasm target at all if you're dealing with changes that ONLY affect the VM object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is only related to the VM methods.

/// A safeguard used to ensure that the given operation is processed in the thread
/// enforcing sequential processing of operations.
pub fn ensure_sequential_processing(&self) {
assert_eq!(thread::current().id(), self.sequential_ops_thread.lock().as_ref().unwrap().thread().id());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that one of the goals of this PR is to avoid assertions like this.
Is there no straightforward way to ensure certain functions are only invoked from within the worker thread by leveraging the type system?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, such safeguards are suboptimal, even if more foolproof than before; as I noted in the PR description, I see no simple ways of introducing type safety here right now, but it would most likely be a heavy lift.

let _ = tx.send(request);

// Wait for the result of the queued operation.
let Ok(response) = response_rx.blocking_recv() else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This panics when called from an async context. We need to document that somewhere.

Copy link
Collaborator Author

@ljedrz ljedrz Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is expected; in production conditions we always run these operations within the context of blocking tasks. I'm not sure where to document it, though, other than perhaps the general storage documentation. That being said, I made a note on this at the callsite for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to document that operations that use it. For example, I had check_next_block panic. That should have been in a blocking task anyway, but it still a fairly easy mistake to make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants