Planner/builder split: Move image source decisions to the planner #8472

jgallagher · 2025-06-27T20:09:50Z

This is a draft because I haven't updated any of the tests yet; I wanted to get some a yay/nay on whether we like this direction before doing that work.

Builds on #8470.

The big changes here are:

BlueprintBuilder no longer decides on image sources for zones. All the sled_add_zone_* functions now take an extra image_source argument.
is_zone_ready_for_update() is now a method in the planner, not the builder. This is used both when deciding whether a zone can be updated and also when choosing the image source for new zones to be added.
can_zone_be_shut_down_safely() is now a separate method in the planner. This is only used when deciding whether a zone can be updated.
zone_image_source is now a method on TargetReleaseDescription. It returns an error if we have a TUF repo that doesn't have exactly one matching artifact for a given ZoneKind. In the planner, this bubbles out a few ways:
- we'll fail planning if we try to add a zone that isn't in the target release TUF repo
- we'll refuse to update a zone if it isn't in the target release TUF repo (but this won't fail planning)
- we'll refuse to update Nexus (because it won't be able to tell whether whatever zone type is missing has itself been updated)

* split `is_zone_ready_for_update` into that and `can_zone_be_shut_down_safely` * move both methods to the planner * move zone image source method to `TargetReleaseDescription` * log errors if a TUF repo is missing an artifact for a known zone kind

smklein · 2025-06-27T20:38:07Z

nexus/reconfigurator/planning/src/planner.rs

+                if !self.can_zone_be_shut_down_safely(kind) {
+                    return false;
+                }
+                match self.is_zone_ready_for_update(kind) {


I think it's very likely we'll want to propagate "reasons" out from here, beyond just logging.

E.g., with cockroach, if we say "You cannot update because you have underreplicated ranges", that is a planner-decision that's probably worth documenting. Otherwise, the planner will just say "can't update now", which isn't that great - not even really identifying which zone needs work, or why.

TL;DR: We probably want a stronger return type than a boolean here, for both of these conditions? something like

enum ShutdownSafety { CanShutdown, CannotShutdown { reason: String, } } enum UpdateReadiness { Ready, NotReady { reason: String, } }

Yeah, definitely. I'm inclined to say that should come in with whatever we do to fix #8284? Anything I do here will certainly have to change for whatever the full reporting solution is, so I just did the simplest thing for now.

smklein · 2025-06-27T20:40:53Z

nexus/reconfigurator/planning/src/planner.rs

+    /// because the underlying disk / sled has been expunged" case. In this
+    /// case, we have no choice but to reconcile with the fact that the zone is
+    /// now gone.
+    fn can_zone_be_shut_down_safely(&self, zone_kind: ZoneKind) -> bool {


Should we pass the whole zone object here? I know we don't need it yet, but it seems possible we'd want to actually answer "can this zone UUID terminate" rather than "can an arbitrary zone of this type" terminate.

(Maybe we don't care, I just think it's worth identifying, we're talking about "ANY zone of a particular kind", not necessarily a specific zone - even though, in the calling context, we are asking about a single specific zone)

Good question. We have the whole zone object, so it seems harmless to pass the whole thing in? Then if we ever need anything about the specific zone, we have it hand. 👍

smklein · 2025-06-27T20:44:31Z

nexus/reconfigurator/planning/src/planner.rs

+        let mut updateable_zones =
+            out_of_date_zones.filter(|(_sled_id, zone, _new_image_source)| {
+                let kind = zone.zone_type.kind();
+                if !self.can_zone_be_shut_down_safely(kind) {


This separation makes sense to me. I was a little worried with the old format, where adding the check for "live_nodes == 5" because that could prevent us from adding a CRDB node when "live_nodes == 4" (which would be bad).

But this structure should let us still make that check, and we only need to validate it in the "update-in-place" case, not necessarily in the "add-new-zone" case.

jgallagher added 2 commits June 27, 2025 15:34

blueprint builder: require caller to provide image sources for new zones

c5499e2

jgallagher requested review from sunshowers, davepacheco, plotnick and smklein June 27, 2025 20:09

smklein reviewed Jun 27, 2025

View reviewed changes

jgallagher mentioned this pull request Jun 27, 2025

Blueprint PlanningInput: Fewer Options #8470

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Planner/builder split: Move image source decisions to the planner #8472

Planner/builder split: Move image source decisions to the planner #8472

Uh oh!

jgallagher commented Jun 27, 2025

Uh oh!

smklein Jun 27, 2025

Uh oh!

jgallagher Jun 27, 2025

Uh oh!

smklein Jun 27, 2025

Uh oh!

jgallagher Jun 27, 2025

Uh oh!

smklein Jun 27, 2025

Uh oh!

Uh oh!

Planner/builder split: Move image source decisions to the planner #8472

Are you sure you want to change the base?

Planner/builder split: Move image source decisions to the planner #8472

Uh oh!

Conversation

jgallagher commented Jun 27, 2025

Uh oh!

smklein Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!