-
Notifications
You must be signed in to change notification settings - Fork 43
[sled-agent-config-reconciler] Flesh out zone start/stop #8140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// XXXNTP - This could be replaced with a direct connection to the | ||
// daemon using a patched version of the chrony_candm crate to allow | ||
// a custom server socket path. From the GZ, it should be possible to | ||
// connect to the UNIX socket at | ||
// format!("{}/var/run/chrony/chronyd.sock", ntp_zone.root()) | ||
|
||
let stdout = running_ntp_zone | ||
.run_cmd(&["/usr/bin/chronyc", "-c", "tracking"]) | ||
.map_err(TimeSyncError::ExecuteChronyc)?; | ||
|
||
let v: Vec<&str> = stdout.split(',').collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have a ntp_admin
server sort of like the ones we have for the clickhouse, keeper and cockroachdb zones? It'd be nice to be able to make an API call instead of zlogin-ing to the zone and running a command no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also the commented-out suggestion that we could run chronyc directly in the gz but point it to the socket in the NTP zone. This code is just moved as-is from sled-agent and I'd rather not change it as part of this PR, but yeah we should do better here.
I do think we'll definitely want an ntp_admin
server when we get around to reworking RSS (e.g., to address #4587) - it'd be nice for wicket to expose some of the NTP debugging things we do by hand today (e.g., ping, resolving DNS), and that would probably be easiest via an ntp_admin
-like thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here John. There's a lot of tricky bits and unknowns, but this feels like the right direction moving forward.
} | ||
} | ||
|
||
impl ZoneFacilities for FakeZoneFacilities { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trait based fake impl setup is pretty nice compared to what we had before.
The tests are really easy to read and reason about. In fact I read them before the code and that worked well.
|
||
//! Module for Omicron zones. | ||
//! | ||
//! There is no separate tokio task here; our parent reconciler task owns this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
self.shut_down_zones_if_needed_impl( | ||
desired_zones, | ||
sled_agent_facilities, | ||
&RealZoneFacilities, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a big fan of this pattern, and use it in the new trust quorum code. The underlying impl can take any parameter of the correct trait, but we always force the production code to use the real version at compile time without passing it in, which could lead to mistakes.
} | ||
|
||
// This is a zone we've tried to start before; try again! | ||
Some(ZoneState::FailedToStart(err)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was slightly confused by the difference between the behavior here and that in PartiallyShutDown
below. However it actually makes sense because if we fail to shutdown a node in FailedToStart
inside shut_down_zones_if_needed_impl
then the state transitions to ZoneState::PartiallyShutDown
. There's never an instance where we try to start a node, fail, try to shut it down, fail, and then try to start it again while partially failed, which alleviates my initial concern.
It may be worth adding a note about this behavior and also specifying somewhere that start_zones_if_needed
and shut_down_zones_if_needed
cannot be called concurrently. I think we'd run into issues if they do and I don't believe running concurrently is your intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth adding a note about this behavior
Yeah, can do; this is not the easiest thing to follow (and I'm still not 100% sure it's correct in all the ways we could fail).
specifying somewhere that start_zones_if_needed and shut_down_zones_if_needed cannot be called concurrently
Both of these take &mut self
so can't be called concurrently! Do you think it should still be noted explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these take &mut self so can't be called concurrently! Do you think it should still be noted explicitly?
Oh, no that makes sense. No need to document explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth adding a note about this behavior
Yeah, can do; this is not the easiest thing to follow (and I'm still not 100% sure it's correct in all the ways we could fail).
Added in b2583a7
Adds an internal
OmicronZones
type that is analogous to theZoneMap
in the current service manager. There are a couple of behavioral differences between this and what ServiceManager does:PUT /omicron-zones
fail if it cannot remove a zone? #6776, which largely ends with "we should do a config reconciler" 😂.) This attempts to remember failures to shut a zone down (or clean up after it) and retry them later. I'm not sure all of this is right, or whether it might even introduce some new kind of failure mode (e.g., a zone getting stuck in some partially shut down state that we previously would have ignored?).There are some minor shuffling-deck-chairs changes to other crates to support some of the testing here. As before, a decent chunk of the code here (maybe 40% or so?) is tests.