Skip to content

Trust Quorum: Handle prepare messages + Alarms #8062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: ajs/realtq-4
Choose a base branch
from

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Apr 29, 2025

This builds on #8052.

Nodes now handle PrepareMsgs from coordinators. The coordinator proptest was updated to generate prepares from non-existent test only nodes and send them to the coordinator.

Additionally, protocol invariant violations are now detected in a few cases and recorded to the PersistentState. This is for debugging and support purposes. The goal is to test the code well enough that we never actually see an alarm in production.

This builds on #8052.

Node's now handle `PrepareMsg`s from coordinators. The coordinator
proptest was updated to generate prepares from non-existent test only
nodes and send them to the coordinator.

Additionally, protocol invariant violations are now detected in a few
cases and recorded to the `PersistentState`. This is for debugging and
support purposes. The goal is to test the code well enough that we never
actually see an alarm in production.
Instead we return them wherever they can arise. This has a couple
of benefits.

* Higher level code can stop accepting requests when it sees an alarm to
prevent endless logging and raise an alert to Nexus.
* After support resolves the issue there is not necessarily any reason
to manually mutate persistent state, unless that was the cause of the
alarm.

This also allows more straightforward/rusty error handling. Persistent
state is now only returned in success cases.
@andrewjstone andrewjstone marked this pull request as ready for review May 1, 2025 20:44
@andrewjstone andrewjstone requested a review from sunshowers May 1, 2025 20:44
use omicron_uuid_kinds::RackUuid;
use serde::{Deserialize, Serialize};

/// A critical invariant violation that should never occur.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alarm is a nice name for this, going to steal it

Comment on lines +106 to +116
if latest_prepare.config.epoch < epoch {
// We haven't seen this prepare yet, but Nexus thinks we have.
// This is essentially the same case as above.
let latest_seen_epoch = Some(latest_prepare.config.epoch);
let alarm = Alarm::MissingPrepare { epoch, latest_seen_epoch };
error!(self.log, "{alarm}");
return Err(alarm);
}

if latest_prepare.config.epoch > epoch {
// Only commit if we have a `PrepareMsg` and it's the latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it's the same? (it's a no-op I guess?) worth noting explicitly in a comment for intrepid readers in between the blocks

//
// This is a less serious error than other invariant violations
// since it can be recovered from. However, it is still worthy of an
// alarm, as the most likely case is a disk/ ledger failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// alarm, as the most likely case is a disk/ ledger failure.
// alarm, as the most likely case is a disk/ledger failure.

Hmm, so I thought alarms were unrecoverable errors, but this is recoverable? do we need a categorization of alarms?

Comment on lines +151 to +152
if self.persistent_state.last_committed_epoch() == Some(epoch) {
info!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was following along fine until now, but I'm having trouble wrapping my head around all the cases here.

I presume that the general invariant here is that last_committed_epoch < latest_prepare.epoch. What are all the cases here? Currently there's:

  • epoch to commit < latest prepare (OutOfOrderCommit alarm)
  • epoch to commit == latest prepare (good)
  • epoch to commit > latest prepare (MissingPrepare alarm)

What if epoch to commit > latest prepare, but also it is an idempotent commit? That case seems fine at first glance but it would result in an alarm here I think?

Would it be worth writing this as an explicit match statement? (Maybe even abstracting a combined prepare/commit epoch comparison out into a function?)

Comment on lines +252 to +263
if msg_last_committed_epoch != last_committed_epoch {
// If the msg contains an older last_committed_epoch than what
// we have, then out of order commits have occurred, as we know
// this prepare is later than what we've seen. This is a critical
// protocol invariant that has been violated.
//
// If the msg contains a newer last_committed_epoch than what
// we have, then we have likely missed a commit and are behind
// by more than one reconfiguration. The protocol currently does
// not allow this. Future protocol implementations may provide a
// capability to "jump" configurations.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth modelling as separate enum variants similar to MissingPrepare/OutOfOrderCommit above?

Comment on lines +242 to +245
// Idempotent request
if msg.config == latest_prepare.config {
return Ok(None);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth noting that just calling Eq is fine here because there's no ancillary local-only data attached to this config -- this is deserialized data.

But also is it worth checking that the raw bytes are the same via e.g. a digest? Or do we consider byte sequences that deserialize to the same value to be the same

Comment on lines +283 to +300
let coordinating_epoch = cs.reconfigure_msg().epoch();
if coordinating_epoch > msg.config.epoch {
warn!(self.log, "Received stale prepare while coordinating";
"from" => %from,
"msg_epoch" => %msg.config.epoch,
"epoch" => %cs.reconfigure_msg().epoch()
);
return Ok(None);
}
if coordinating_epoch == msg.config.epoch {
let alarm = Alarm::DifferentNodesCoordinatingSameEpoch {
epoch: coordinating_epoch,
them: from,
us: self.platform_id.clone(),
};
error!(self.log, "{alarm}");
return Err(alarm);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies, having trouble following this too. again, it seems like there are 3-4 epochs at play here, and it's not clear to me what the relationship is among all of them. Maybe just an ASCII diagram at the top of the file/function would help.

Comment on lines +274 to +276
// to check for an alarm and pull it out. We could also return either
// an `Alarm` or a `ReconfigurationError` inside `Result::Err`. This is
// probably the best approach, but I'm open to other structures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky! Yeah I'd probably return an error enum with Alarm and ReconfigurationError as variants, and maybe even consider not implementing std::error::Error or fmt::Display to ensure that people don't accidentally convert it to anyhow::Error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants