Skip to content

[sled-agent-config-reconciler] Flesh out zone start/stop #8140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 14, 2025

Conversation

jgallagher
Copy link
Contributor

@jgallagher jgallagher commented May 12, 2025

Adds an internal OmicronZones type that is analogous to the ZoneMap in the current service manager. There are a couple of behavioral differences between this and what ServiceManager does:

  • Starting and stopping zones are broken up into separate methods (so the reconciler can do disk/dataset work in between stopping zones and starting zones)
  • ServiceManager currently tries once to shut down a zone and logs any errors, but otherwise just pretends it succeeds. (Discussion on sled-agent: Should PUT /omicron-zones fail if it cannot remove a zone? #6776, which largely ends with "we should do a config reconciler" 😂.) This attempts to remember failures to shut a zone down (or clean up after it) and retry them later. I'm not sure all of this is right, or whether it might even introduce some new kind of failure mode (e.g., a zone getting stuck in some partially shut down state that we previously would have ignored?).

There are some minor shuffling-deck-chairs changes to other crates to support some of the testing here. As before, a decent chunk of the code here (maybe 40% or so?) is tests.

Comment on lines +324 to +334
// XXXNTP - This could be replaced with a direct connection to the
// daemon using a patched version of the chrony_candm crate to allow
// a custom server socket path. From the GZ, it should be possible to
// connect to the UNIX socket at
// format!("{}/var/run/chrony/chronyd.sock", ntp_zone.root())

let stdout = running_ntp_zone
.run_cmd(&["/usr/bin/chronyc", "-c", "tracking"])
.map_err(TimeSyncError::ExecuteChronyc)?;

let v: Vec<&str> = stdout.split(',').collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a ntp_admin server sort of like the ones we have for the clickhouse, keeper and cockroachdb zones? It'd be nice to be able to make an API call instead of zlogin-ing to the zone and running a command no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the commented-out suggestion that we could run chronyc directly in the gz but point it to the socket in the NTP zone. This code is just moved as-is from sled-agent and I'd rather not change it as part of this PR, but yeah we should do better here.

I do think we'll definitely want an ntp_admin server when we get around to reworking RSS (e.g., to address #4587) - it'd be nice for wicket to expose some of the NTP debugging things we do by hand today (e.g., ping, resolving DNS), and that would probably be easiest via an ntp_admin-like thing.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here John. There's a lot of tricky bits and unknowns, but this feels like the right direction moving forward.

}
}

impl ZoneFacilities for FakeZoneFacilities {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trait based fake impl setup is pretty nice compared to what we had before.

The tests are really easy to read and reason about. In fact I read them before the code and that worked well.


//! Module for Omicron zones.
//!
//! There is no separate tokio task here; our parent reconciler task owns this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.shut_down_zones_if_needed_impl(
desired_zones,
sled_agent_facilities,
&RealZoneFacilities,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of this pattern, and use it in the new trust quorum code. The underlying impl can take any parameter of the correct trait, but we always force the production code to use the real version at compile time without passing it in, which could lead to mistakes.

}

// This is a zone we've tried to start before; try again!
Some(ZoneState::FailedToStart(err)) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was slightly confused by the difference between the behavior here and that in PartiallyShutDown below. However it actually makes sense because if we fail to shutdown a node in FailedToStart inside shut_down_zones_if_needed_impl then the state transitions to ZoneState::PartiallyShutDown. There's never an instance where we try to start a node, fail, try to shut it down, fail, and then try to start it again while partially failed, which alleviates my initial concern.

It may be worth adding a note about this behavior and also specifying somewhere that start_zones_if_needed and shut_down_zones_if_needed cannot be called concurrently. I think we'd run into issues if they do and I don't believe running concurrently is your intent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth adding a note about this behavior

Yeah, can do; this is not the easiest thing to follow (and I'm still not 100% sure it's correct in all the ways we could fail).

specifying somewhere that start_zones_if_needed and shut_down_zones_if_needed cannot be called concurrently

Both of these take &mut self so can't be called concurrently! Do you think it should still be noted explicitly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these take &mut self so can't be called concurrently! Do you think it should still be noted explicitly?

Oh, no that makes sense. No need to document explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth adding a note about this behavior

Yeah, can do; this is not the easiest thing to follow (and I'm still not 100% sure it's correct in all the ways we could fail).

Added in b2583a7

@jgallagher jgallagher merged commit 6a56f00 into main May 14, 2025
17 checks passed
@jgallagher jgallagher deleted the john/sled-agent-config-reconciler-zones branch May 14, 2025 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants