Skip to content

Planner/builder split: Move image source decisions to the planner #8472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: john/planning-input-fewer-options
Choose a base branch
from

Conversation

jgallagher
Copy link
Contributor

This is a draft because I haven't updated any of the tests yet; I wanted to get some a yay/nay on whether we like this direction before doing that work.

Builds on #8470.

The big changes here are:

  • BlueprintBuilder no longer decides on image sources for zones. All the sled_add_zone_* functions now take an extra image_source argument.
  • is_zone_ready_for_update() is now a method in the planner, not the builder. This is used both when deciding whether a zone can be updated and also when choosing the image source for new zones to be added.
  • can_zone_be_shut_down_safely() is now a separate method in the planner. This is only used when deciding whether a zone can be updated.
  • zone_image_source is now a method on TargetReleaseDescription. It returns an error if we have a TUF repo that doesn't have exactly one matching artifact for a given ZoneKind. In the planner, this bubbles out a few ways:
    • we'll fail planning if we try to add a zone that isn't in the target release TUF repo
    • we'll refuse to update a zone if it isn't in the target release TUF repo (but this won't fail planning)
    • we'll refuse to update Nexus (because it won't be able to tell whether whatever zone type is missing has itself been updated)

* split `is_zone_ready_for_update` into that and `can_zone_be_shut_down_safely`
* move both methods to the planner
* move zone image source method to `TargetReleaseDescription`
* log errors if a TUF repo is missing an artifact for a known zone kind
Comment on lines +1116 to +1119
if !self.can_zone_be_shut_down_safely(kind) {
return false;
}
match self.is_zone_ready_for_update(kind) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's very likely we'll want to propagate "reasons" out from here, beyond just logging.

E.g., with cockroach, if we say "You cannot update because you have underreplicated ranges", that is a planner-decision that's probably worth documenting. Otherwise, the planner will just say "can't update now", which isn't that great - not even really identifying which zone needs work, or why.

TL;DR: We probably want a stronger return type than a boolean here, for both of these conditions? something like

enum ShutdownSafety {
  CanShutdown,
  CannotShutdown {
    reason: String,
  } 
}

enum UpdateReadiness {
  Ready,
  NotReady {
    reason: String,
  } 
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely. I'm inclined to say that should come in with whatever we do to fix #8284? Anything I do here will certainly have to change for whatever the full reporting solution is, so I just did the simplest thing for now.

/// because the underlying disk / sled has been expunged" case. In this
/// case, we have no choice but to reconcile with the fact that the zone is
/// now gone.
fn can_zone_be_shut_down_safely(&self, zone_kind: ZoneKind) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass the whole zone object here? I know we don't need it yet, but it seems possible we'd want to actually answer "can this zone UUID terminate" rather than "can an arbitrary zone of this type" terminate.

(Maybe we don't care, I just think it's worth identifying, we're talking about "ANY zone of a particular kind", not necessarily a specific zone - even though, in the calling context, we are asking about a single specific zone)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. We have the whole zone object, so it seems harmless to pass the whole thing in? Then if we ever need anything about the specific zone, we have it hand. 👍

let mut updateable_zones =
out_of_date_zones.filter(|(_sled_id, zone, _new_image_source)| {
let kind = zone.zone_type.kind();
if !self.can_zone_be_shut_down_safely(kind) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This separation makes sense to me. I was a little worried with the old format, where adding the check for "live_nodes == 5" because that could prevent us from adding a CRDB node when "live_nodes == 4" (which would be bad).

But this structure should let us still make that check, and we only need to validate it in the "update-in-place" case, not necessarily in the "add-new-zone" case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants