Skip to content

[wip] [23/n] blueprint planner logic for mupdate overrides #8456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: sunshowers/spr/main.wip-20n-blueprint-planner-logic-for-mupdate-overrides
Choose a base branch
from

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented Jun 26, 2025

Start implementing the blueprint planner logic for mupdate overrides.

There's still a lot of work to do:

  • Ensure the logic is sound, especially around errors, missing inventory, etc.
  • Move zone image sources back to artifacts once we know they've been distributed to a sled. (How do we know they've been distributed to a sled?) Done in [wip] [22/n] [reconfigurator-planning] support no-op image source updates #8467.
  • Is the set of reasons for why not to proceed with other steps accurate? To me it does seem like we need to wait until both conditions are met, but I'd like to check.
  • Lots more tests.
  • I don't think this can land before the logic to reset the mupdate override field in sled-agent lands. I also think this may have to land simultaneously with the code to redirect to the install dataset within sled-agent.

Depends on:

Created using spr 1.3.6-beta.1
@sunshowers sunshowers marked this pull request as draft June 26, 2025 05:26
Created using spr 1.3.6-beta.1
Comment on lines +1283 to +1284
let mut sleds_with_override = BTreeSet::new();
for sled_id in self.input.all_sled_ids(SledFilter::InService) {
Copy link
Contributor Author

@sunshowers sunshowers Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, I'm wondering what would happen here if a sled goes away in the middle of this process, disappearing from inventory. In that case the remove_mupdate_override field never gets cleared from the blueprint.

We would do:

  • expunge the sled in the first blueprint
  • when executed, the sled policy will be updated to Expunged in the planning input
  • then next planning cycle it'll no longer be in the InService set

So I think we'll eventually converge -- it'll just take a couple cycles. (A TODO is to add a test for this.)

Comment on lines +758 to +761
let old_image_source = self.zones.set_zone_image_source(
&zone_id,
BlueprintZoneImageSource::InstallDataset,
)?;
Copy link
Contributor Author

@sunshowers sunshowers Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFD 556 says:

Wherever the planner uses the target release, it is instead ignored if its generation number is not greater than min_release_generation (if set).

As discussed in Tuesday's watercooler it's a bit more complex than that -- what we want to do is to only use the install dataset on sleds that have been mupdated, since on other sleds the install dataset may be woefully out of date.

I think I want to make the claim that this code may actually be sufficient as it stands. I don't think we need to try and do any other redirects other than this one (which is admittedly edge-triggered), as long as we prevent new zones from being set up at all while the system is recovering from the mupdate.

// override that was set in the above branch. We can remove the
// override from the blueprint.
self.set_remove_mupdate_override(None);
// TODO: change zone sources from InstallDataset to Artifact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we'll need to do this here; the normal upgrade path should change zone sources already, right? (It just needs to not do that while a mupdate override is in place.)

Although maybe sled-agent should do something like "if I'm changing from install dataset with hash X to artifact with hash X, don't actually bounce the zone".

Copy link
Contributor Author

@sunshowers sunshowers Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we'll need to do this here; the normal upgrade path should change zone sources already, right? (It just needs to not do that while a mupdate override is in place.)

Good question -- we do this one zone at a time currently, and I guess this would be an opportunity to do a bulk replace. (But why not always do a level-triggered bulk replace?)

Although maybe sled-agent should do something like "if I'm changing from install dataset with hash X to artifact with hash X, don't actually bounce the zone".

Yeah, this is reasonable.

);
}

// TODO: Do the same for RoT/SP/host OS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be as simple as:

  • clear any PendingMgsUpdates for this sled
  • change the host phase 2 in the OmicronSledConfig to the equivalent of InstallDataset for zones (this doesn't exist yet but will be coming soon)

What I'm less sure about is what happens if there are PendingMgsUpdates in the current target blueprint concurrently with a mupdate happening to that sled. Maybe wicket and Nexus end up dueling? If the mupdate completes and changes the contents of any of the target slots Nexus's prechecks should start failing, but if the mupdate happens to not change the target slots, maybe the prechecks still pass and Nexus starts trying to update it again as soon as it comes online?

// If do_plan_mupdate_override returns Waiting, we don't plan *any*
// additional steps until the system has recovered.
self.do_plan_add()?;
self.do_plan_decommission()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could still decommission things if there's a mupdate override in place? This only acts on sleds or disks that an operator has explicitly told us is gone, and is basically a followup to do_plan_expunge(). (Maybe this step should be ordered before do_sled_add() anyway? I don't think there are any dependencies between them...)

// generation table -- one of the invariants of the target
// release generation is that it only moves forward.
//
// In this case we warn but set the value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right; we should probably bail out of planning entirely in this case, right? This seems like an "I don't know what's going on in the world" kind of thing that in a simpler system we'd assert on?

@sunshowers sunshowers changed the title [wip] [20/n] blueprint planner logic for mupdate overrides [wip] [23/n] blueprint planner logic for mupdate overrides Jun 27, 2025
sunshowers added a commit that referenced this pull request Jun 27, 2025
In #8456 we'll block the `do_plan_add` step on whether the system is
currently recovering from a mupdate override. But there's no reason to
block the `do_plan_decommission` step on that.

This is easiest expressed by moving decommission to before add.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants