Trust Quorum: Handle prepare messages + Alarms #8062

andrewjstone · 2025-04-29T05:09:16Z

This builds on #8052.

Nodes now handle PrepareMsgs from coordinators. The coordinator proptest was updated to generate prepares from non-existent test only nodes and send them to the coordinator.

Additionally, protocol invariant violations are now detected in a few cases and recorded to the PersistentState. This is for debugging and support purposes. The goal is to test the code well enough that we never actually see an alarm in production.

This builds on #8052. Node's now handle `PrepareMsg`s from coordinators. The coordinator proptest was updated to generate prepares from non-existent test only nodes and send them to the coordinator. Additionally, protocol invariant violations are now detected in a few cases and recorded to the `PersistentState`. This is for debugging and support purposes. The goal is to test the code well enough that we never actually see an alarm in production.

Instead we return them wherever they can arise. This has a couple of benefits. * Higher level code can stop accepting requests when it sees an alarm to prevent endless logging and raise an alert to Nexus. * After support resolves the issue there is not necessarily any reason to manually mutate persistent state, unless that was the cause of the alarm. This also allows more straightforward/rusty error handling. Persistent state is now only returned in success cases.

sunshowers · 2025-05-26T19:14:54Z

trust-quorum/src/alarm.rs

+use omicron_uuid_kinds::RackUuid;
+use serde::{Deserialize, Serialize};
+
+/// A critical invariant violation that should never occur.


Alarm is a nice name for this, going to steal it

sunshowers · 2025-05-26T19:17:29Z

trust-quorum/src/node.rs

+        if latest_prepare.config.epoch < epoch {
+            // We haven't seen this prepare yet, but Nexus thinks we have.
+            // This is essentially the same case as above.
+            let latest_seen_epoch = Some(latest_prepare.config.epoch);
+            let alarm = Alarm::MissingPrepare { epoch, latest_seen_epoch };
+            error!(self.log, "{alarm}");
+            return Err(alarm);
+        }
+
+        if latest_prepare.config.epoch > epoch {
+            // Only commit if we have a `PrepareMsg` and it's the latest


what if it's the same? (it's a no-op I guess?) worth noting explicitly in a comment for intrepid readers in between the blocks

sunshowers · 2025-05-26T19:17:59Z

trust-quorum/src/node.rs

+            //
+            // This is a less serious error than other invariant violations
+            // since it can be recovered from. However, it is still worthy of an
+            // alarm, as the most likely case is a  disk/ ledger failure.


Suggested change

// alarm, as the most likely case is a disk/ ledger failure.

// alarm, as the most likely case is a disk/ledger failure.

Hmm, so I thought alarms were unrecoverable errors, but this is recoverable? do we need a categorization of alarms?

sunshowers · 2025-05-26T19:30:01Z

trust-quorum/src/node.rs

+        if self.persistent_state.last_committed_epoch() == Some(epoch) {
+            info!(


So I was following along fine until now, but I'm having trouble wrapping my head around all the cases here.

I presume that the general invariant here is that last_committed_epoch < latest_prepare.epoch. What are all the cases here? Currently there's:

epoch to commit < latest prepare (OutOfOrderCommit alarm)

epoch to commit == latest prepare (good)

epoch to commit > latest prepare (MissingPrepare alarm)

What if epoch to commit > latest prepare, but also it is an idempotent commit? That case seems fine at first glance but it would result in an alarm here I think?

Would it be worth writing this as an explicit match statement? (Maybe even abstracting a combined prepare/commit epoch comparison out into a function?)

sunshowers · 2025-05-26T19:31:09Z

trust-quorum/src/node.rs

+            if msg_last_committed_epoch != last_committed_epoch {
+                // If the msg contains an older last_committed_epoch than what
+                // we have, then out of order commits have occurred, as we know
+                // this prepare is later than what we've seen. This is a critical
+                // protocol invariant that has been violated.
+                //
+                // If the msg contains a newer last_committed_epoch than what
+                // we have, then we have likely missed a commit and are behind
+                // by more than one reconfiguration. The protocol currently does
+                // not allow this. Future protocol implementations may provide a
+                // capability to "jump" configurations.
+                //


worth modelling as separate enum variants similar to MissingPrepare/OutOfOrderCommit above?

sunshowers · 2025-05-26T19:31:47Z

trust-quorum/src/node.rs

+            // Idempotent request
+            if msg.config == latest_prepare.config {
+                return Ok(None);
+            }


worth noting that just calling Eq is fine here because there's no ancillary local-only data attached to this config -- this is deserialized data.

But also is it worth checking that the raw bytes are the same via e.g. a digest? Or do we consider byte sequences that deserialize to the same value to be the same

sunshowers · 2025-05-26T20:22:49Z

trust-quorum/src/node.rs

+            let coordinating_epoch = cs.reconfigure_msg().epoch();
+            if coordinating_epoch > msg.config.epoch {
+                warn!(self.log, "Received stale prepare while coordinating";
+                    "from" => %from,
+                    "msg_epoch" => %msg.config.epoch,
+                    "epoch" => %cs.reconfigure_msg().epoch()
+                );
+                return Ok(None);
+            }
+            if coordinating_epoch == msg.config.epoch {
+                let alarm = Alarm::DifferentNodesCoordinatingSameEpoch {
+                    epoch: coordinating_epoch,
+                    them: from,
+                    us: self.platform_id.clone(),
+                };
+                error!(self.log, "{alarm}");
+                return Err(alarm);
+            }


apologies, having trouble following this too. again, it seems like there are 3-4 epochs at play here, and it's not clear to me what the relationship is among all of them. Maybe just an ASCII diagram at the top of the file/function would help.

sunshowers · 2025-05-26T20:25:23Z

trust-quorum/src/validators.rs

+                // to check for an alarm and pull it out. We could also return either
+                // an `Alarm` or a `ReconfigurationError` inside `Result::Err`. This is
+                // probably the best approach, but I'm open to other structures.


Tricky! Yeah I'd probably return an error enum with Alarm and ReconfigurationError as variants, and maybe even consider not implementing std::error::Error or fmt::Display to ensure that people don't accidentally convert it to anyhow::Error.

andrewjstone added 6 commits April 29, 2025 05:01

clippy

a445874

some cleanup

37b0fb6

Some more decisions

f9ec07e

Test alarm conditions

e0c454b

andrewjstone marked this pull request as ready for review May 1, 2025 20:44

andrewjstone requested a review from sunshowers May 1, 2025 20:44

sunshowers reviewed May 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trust Quorum: Handle prepare messages + Alarms #8062

Trust Quorum: Handle prepare messages + Alarms #8062

Uh oh!

andrewjstone commented Apr 29, 2025 •

edited

Loading

Uh oh!

sunshowers May 26, 2025

Uh oh!

sunshowers May 26, 2025

Uh oh!

sunshowers May 26, 2025

Uh oh!

sunshowers May 26, 2025

Uh oh!

sunshowers May 26, 2025

Uh oh!

sunshowers May 26, 2025

Uh oh!

sunshowers May 26, 2025

Uh oh!

sunshowers May 26, 2025

Uh oh!

Uh oh!

	// alarm, as the most likely case is a disk/ ledger failure.
	// alarm, as the most likely case is a disk/ledger failure.

		if self.persistent_state.last_committed_epoch() == Some(epoch) {
		info!(

Trust Quorum: Handle prepare messages + Alarms #8062

Are you sure you want to change the base?

Trust Quorum: Handle prepare messages + Alarms #8062

Uh oh!

Conversation

andrewjstone commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrewjstone commented Apr 29, 2025 •

edited

Loading