-
Notifications
You must be signed in to change notification settings - Fork 45
Planner/builder split: Move image source decisions to the planner #8472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: john/planning-input-fewer-options
Are you sure you want to change the base?
Planner/builder split: Move image source decisions to the planner #8472
Conversation
* split `is_zone_ready_for_update` into that and `can_zone_be_shut_down_safely` * move both methods to the planner * move zone image source method to `TargetReleaseDescription` * log errors if a TUF repo is missing an artifact for a known zone kind
if !self.can_zone_be_shut_down_safely(kind) { | ||
return false; | ||
} | ||
match self.is_zone_ready_for_update(kind) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's very likely we'll want to propagate "reasons" out from here, beyond just logging.
E.g., with cockroach, if we say "You cannot update because you have underreplicated ranges", that is a planner-decision that's probably worth documenting. Otherwise, the planner will just say "can't update now", which isn't that great - not even really identifying which zone needs work, or why.
TL;DR: We probably want a stronger return type than a boolean here, for both of these conditions? something like
enum ShutdownSafety {
CanShutdown,
CannotShutdown {
reason: String,
}
}
enum UpdateReadiness {
Ready,
NotReady {
reason: String,
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, definitely. I'm inclined to say that should come in with whatever we do to fix #8284? Anything I do here will certainly have to change for whatever the full reporting solution is, so I just did the simplest thing for now.
/// because the underlying disk / sled has been expunged" case. In this | ||
/// case, we have no choice but to reconcile with the fact that the zone is | ||
/// now gone. | ||
fn can_zone_be_shut_down_safely(&self, zone_kind: ZoneKind) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pass the whole zone
object here? I know we don't need it yet, but it seems possible we'd want to actually answer "can this zone UUID terminate" rather than "can an arbitrary zone of this type" terminate.
(Maybe we don't care, I just think it's worth identifying, we're talking about "ANY zone of a particular kind", not necessarily a specific zone - even though, in the calling context, we are asking about a single specific zone)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. We have the whole zone object, so it seems harmless to pass the whole thing in? Then if we ever need anything about the specific zone, we have it hand. 👍
let mut updateable_zones = | ||
out_of_date_zones.filter(|(_sled_id, zone, _new_image_source)| { | ||
let kind = zone.zone_type.kind(); | ||
if !self.can_zone_be_shut_down_safely(kind) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This separation makes sense to me. I was a little worried with the old format, where adding the check for "live_nodes == 5" because that could prevent us from adding a CRDB node when "live_nodes == 4" (which would be bad).
But this structure should let us still make that check, and we only need to validate it in the "update-in-place" case, not necessarily in the "add-new-zone" case.
This is a draft because I haven't updated any of the tests yet; I wanted to get some a yay/nay on whether we like this direction before doing that work.
Builds on #8470.
The big changes here are:
BlueprintBuilder
no longer decides on image sources for zones. All thesled_add_zone_*
functions now take an extraimage_source
argument.is_zone_ready_for_update()
is now a method in the planner, not the builder. This is used both when deciding whether a zone can be updated and also when choosing the image source for new zones to be added.can_zone_be_shut_down_safely()
is now a separate method in the planner. This is only used when deciding whether a zone can be updated.zone_image_source
is now a method onTargetReleaseDescription
. It returns an error if we have a TUF repo that doesn't have exactly one matching artifact for a givenZoneKind
. In the planner, this bubbles out a few ways: