-
Notifications
You must be signed in to change notification settings - Fork 43
[sled-agent] Integrate config-reconciler #8064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -34,14 +34,6 @@ enum SledAgentCommands { | |||
#[clap(subcommand)] | |||
Zones(ZoneCommands), | |||
|
|||
/// print information about zpools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you expecting that inventory will supplant this info? Or are you planning on replacing this access to the sled agent later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting that inventory would supplant this. (I think maybe it already has, in practice? I definitely only look at inventory when I'm curious about zpools; I don't think I've ever used these omdb subcommands.)
// TODO this inventory needs to be more robust for the reconciler; | ||
// this is nontrivial so coming in a subsequent PR. For now just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super-nit, link to PR, so this comment doesn't get lost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR isn't open yet, so hard to link. 😅 However, I put a stub in RSS on this branch where we need the updated inventory:
omicron/sled-agent/src/rack_setup/service.rs
Lines 426 to 434 in abd7542
// Wait until the config reconciler on the target sled has successfully | |
// reconciled the config at `generation`. | |
async fn wait_for_config_reconciliation_on_sled( | |
&self, | |
_sled_address: SocketAddrV6, | |
_generation: Generation, | |
) -> Result<(), SetupServiceError> { | |
unimplemented!("needs updated inventory") | |
} |
which guarantees this PR can't pass the helios/deploy
check until the inventory work is done.
} | ||
|
||
#[cfg(all(test, target_os = "illumos"))] | ||
mod illumos_tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're deleting a lot of tests here - do we really not care about these anymore, or do we have coverage elsewehre?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little of column a, little of column b:
- All the methods these tests are exercising have themselves been deleted, so we're forced to remove these tests too.
- The functionality we've removed here will live in the config-reconciler now, but the tests won't port over 1-to-1 because the mechanics are changing (e.g., ledgering is largely independent from starting zones now, which is very different from how
ensure_all_omicron_zones_persistent()
worked). The reconciler will have its own tests for the new mechanics that cover this same functionality.
// is valid against the current ledgered sled config. | ||
if let Some(config_reconciler) = config_reconciler { | ||
config_reconciler | ||
.validate_artifact_config(new_config.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about deadlocks here as pointed out in #8063
sled-agent/src/long_running_tasks.rs
Outdated
} else { | ||
TimeSyncConfig::Normal | ||
}; | ||
let mut config_reconciler = ConfigReconcilerHandle::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I miss where we call spawn_reconciliation_task
, or is it not here yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't happen this early in the process; it happens during SledAgent::new()
(after we've gotten our config from RSS or the ledger):
omicron/sled-agent/src/sled_agent.rs
Lines 600 to 610 in abd7542
// Start reconciling against our ledgered sled config. | |
config_reconciler.spawn_reconciliation_task( | |
etherstub_vnic, | |
ReconcilerFacilities { | |
service_manager: services.clone(), | |
metrics_queue: metrics_manager.request_queue(), | |
zone_bundler: long_running_task_handles.zone_bundler.clone(), | |
}, | |
SledAgentArtifactStoreWrapper(Arc::clone(&artifact_store)), | |
&parent_log, | |
); |
This is somewhat extracted from #8064, but can be landed independently and will make some of the followup sled-agent-config-reconciler PRs a little cleaner. We don't yet ledger `OmicronSledConfig`s to disk, so we're free to fiddle with the details of its fields without worrying about backwards compatibility. Fixes #7774.
abd7542
to
2574c5c
Compare
2574c5c
to
a057195
Compare
This PR integrates the new
sled-agent-config-reconciler
crate withsled-agent
. It will not currently pass tests due to the reconciler not being completely implemented, but I'd like to get any feedback on this integration work itself (particularly as it pertains to the API ofsled-agent-config-reconciler
). See the description of #8063 for more context.There are a couple serious warts with this PR:
StorageManager
(because its functionality is being absorbed intosled-agent-config-reconciler
); however, the storage manager also has a rich set of test support. This PR leaves a couple sled-agent submodules using that test support (support-bundle/storage and zone-bundle). In the long run I think it'd be better to rework these (if there are no remaining production uses ofStorageManager
), but for now I think this is... okay? Feedback welcome.