Skip to content

Commit 2a97eed

Browse files
authored
sled-agent: Fix races when starting switch zone in a4x2 (#9297)
Yesterday @internet-diglett and I were looking at some weird a4x2 failures where sled-agent successfully started the switch zone but failed to configure uplinks within it. This PR fixes a race condition and a subsequent logic bug which together were causing that failure. I'm not sure if it's possible to hit this race condition in real hardware. I tried going over a real sled-agent startup log to figure out if we just happened to start up the switch zone "fast enough", or if something in the real startup path was (implicitly) blocked on that setup being done. I _think_ it's the latter but don't have great confidence in that; this is based on comparing timestamps of logs and things that appear backed up waiting on mutexes held during the whole switch zone setup process. All of this is pretty gnarly; we have multiple issues discussing the need for some rework here anyway, but this is yet another spot fix to unblock active work. --- Race condition: If we get our underlay info while we're still starting up the switch zone, we don't inform the task doing that startup about it, and therefore it doesn't attempt to configure uplinks. In the "swapping out the request" path, we actually had `Some(underlay_info)`, but were discarding it: it's not stored in `request` or `new_request` - we only passed it as an argument to `start_switch_zone`. This is fixed by moving the `underlay_info` into `request` instead of passing it as function argument. Now when we swap out the request, the task running to perform initialization has access to the `underlay_info` and will attempt to configure uplinks. --- Logic bug: Once we fixed the above, we saw the "ensure switch zone uplinks" worker stop after a single attempt as though it was told to because inside of `try_initialize_switch_zone()` itself, the last thing it does before returning is change the state from `::Initializing` to `::Running`, _with no `worker` task_: https://github.com/oxidecomputer/omicron/blob/d743754bb4be24228e9e042ce5262c242d4fd079/sled-agent/src/services.rs#L4006-L4010 This causes `exit_tx` to be dropped, which causes `ensure_switch_zone_uplinks_configured_loop()` to bail out after a single attempt, as we see in the logs above. This is fixed by moving the worker task into the `::Running` state instead of dropping it. (The `::Running` state can have a non-`None` worker if we reconfigure the switch zone, so the supporting code already expects this to be present sometimes, and knows to stop the task when appropriate.) This bug was _mostly_ introduced by above (not fully correct!) change to fix the race condition: prior to that change, the `::Initializing` state never had the `underlay_info` in it anyway, so `ensure_switch_zone_uplinks_configured_loop()` wouldn't have even been called.
1 parent 9031662 commit 2a97eed

File tree

1 file changed

+51
-39
lines changed

1 file changed

+51
-39
lines changed

sled-agent/src/services.rs

Lines changed: 51 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -443,6 +443,7 @@ struct SwitchZoneConfig {
443443
id: Uuid,
444444
addresses: Vec<Ipv6Addr>,
445445
services: Vec<SwitchService>,
446+
underlay_info: Option<UnderlayInfo>,
446447
}
447448

448449
/// Describes one of several services that may be deployed in a switch zone
@@ -2444,7 +2445,7 @@ impl ServiceManager {
24442445
bootstrap_name_and_address: Option<(String, Ipv6Addr)>,
24452446
device_names: &[String],
24462447
) -> Result<RunningZone, Error> {
2447-
let SwitchZoneConfig { id, services, addresses } = config;
2448+
let SwitchZoneConfig { id, services, addresses, .. } = config;
24482449

24492450
let disabled_dns_client_service =
24502451
ServiceBuilder::new("network/dns/client")
@@ -3265,16 +3266,14 @@ impl ServiceManager {
32653266
};
32663267
addresses.push(Ipv6Addr::LOCALHOST);
32673268

3268-
let request =
3269-
SwitchZoneConfig { id: Uuid::new_v4(), addresses, services };
3270-
3271-
self.ensure_switch_zone(
3272-
Some(request),
3273-
filesystems,
3274-
data_links,
3269+
let request = SwitchZoneConfig {
3270+
id: Uuid::new_v4(),
3271+
addresses,
3272+
services,
32753273
underlay_info,
3276-
)
3277-
.await?;
3274+
};
3275+
3276+
self.ensure_switch_zone(Some(request), filesystems, data_links).await?;
32783277

32793278
Ok(())
32803279
}
@@ -3494,8 +3493,6 @@ impl ServiceManager {
34943493
vec![],
34953494
// data_links=
34963495
vec![],
3497-
// underlay_info=
3498-
None,
34993496
)
35003497
.await
35013498
}
@@ -3514,7 +3511,6 @@ impl ServiceManager {
35143511
request: SwitchZoneConfig,
35153512
filesystems: Vec<zone::Fs>,
35163513
data_links: Vec<String>,
3517-
underlay_info: Option<UnderlayInfo>,
35183514
) {
35193515
let (exit_tx, exit_rx) = oneshot::channel();
35203516
*zone = SwitchZoneState::Initializing {
@@ -3524,8 +3520,7 @@ impl ServiceManager {
35243520
worker: Some(Task {
35253521
exit_tx,
35263522
initializer: tokio::task::spawn(async move {
3527-
self.initialize_switch_zone_loop(underlay_info, exit_rx)
3528-
.await
3523+
self.initialize_switch_zone_loop(exit_rx).await
35293524
}),
35303525
}),
35313526
};
@@ -3537,7 +3532,6 @@ impl ServiceManager {
35373532
request: Option<SwitchZoneConfig>,
35383533
filesystems: Vec<zone::Fs>,
35393534
data_links: Vec<String>,
3540-
underlay_info: Option<UnderlayInfo>,
35413535
) -> Result<(), Error> {
35423536
let log = &self.inner.log;
35433537

@@ -3552,7 +3546,6 @@ impl ServiceManager {
35523546
request,
35533547
filesystems,
35543548
data_links,
3555-
underlay_info,
35563549
);
35573550
}
35583551
(
@@ -3918,7 +3911,7 @@ impl ServiceManager {
39183911

39193912
// We also need to ensure any uplinks are configured. Spawn a
39203913
// task that goes into an infinite retry loop until it succeeds.
3921-
if let Some(underlay_info) = underlay_info {
3914+
if let Some(underlay_info) = request.underlay_info.clone() {
39223915
if let Some(old_worker) = worker.take() {
39233916
old_worker.stop().await;
39243917
}
@@ -3978,15 +3971,15 @@ impl ServiceManager {
39783971
async fn try_initialize_switch_zone(
39793972
&self,
39803973
sled_zone: &mut SwitchZoneState,
3981-
) -> Result<(), Error> {
3974+
) -> Result<Option<UnderlayInfo>, Error> {
39823975
let SwitchZoneState::Initializing {
39833976
request,
39843977
filesystems,
39853978
data_links,
3986-
..
3987-
} = &*sled_zone
3979+
worker,
3980+
} = sled_zone
39883981
else {
3989-
return Ok(());
3982+
return Ok(None);
39903983
};
39913984

39923985
// The switch zone must use the ramdisk in order to receive requests
@@ -4003,19 +3996,28 @@ impl ServiceManager {
40033996
let zone = self
40043997
.initialize_zone(zone_args, zone_root_path, filesystems, data_links)
40053998
.await?;
3999+
let underlay_info = request.underlay_info.clone();
4000+
4001+
// Even though we've initialized the zone, the `worker` task may still
4002+
// be running to configure uplinks. If we drop `worker` now it will
4003+
// cause that task to exit before it gets a chance to do so. This is all
4004+
// very unsatisfying and needs some serious rework:
4005+
// https://github.com/oxidecomputer/omicron/issues/8970 and
4006+
// https://github.com/oxidecomputer/omicron/issues/9182 are strongly
4007+
// related.
4008+
let worker = worker.take();
40064009
*sled_zone = SwitchZoneState::Running {
40074010
request: request.clone(),
40084011
zone: Box::new(zone),
4009-
worker: None,
4012+
worker,
40104013
};
4011-
Ok(())
4014+
Ok(underlay_info)
40124015
}
40134016

40144017
// Body of a tokio task responsible for running until the switch zone is
40154018
// inititalized, or it has been told to stop.
40164019
async fn initialize_switch_zone_loop(
40174020
&self,
4018-
underlay_info: Option<UnderlayInfo>,
40194021
mut exit_rx: oneshot::Receiver<()>,
40204022
) {
40214023
// We don't really expect failures trying to initialize the switch zone
@@ -4025,13 +4027,25 @@ impl ServiceManager {
40254027

40264028
// First, go into a loop to bring up the switch zone; retry until we
40274029
// succeed or are told to give up via `exit_rx`.
4028-
loop {
4030+
let underlay_info = loop {
40294031
{
40304032
let mut sled_zone = self.inner.switch_zone.lock().await;
40314033
match self.try_initialize_switch_zone(&mut sled_zone).await {
4032-
Ok(()) => {
4033-
info!(self.inner.log, "initialized switch zone");
4034-
break;
4034+
Ok(None) => {
4035+
info!(
4036+
self.inner.log,
4037+
"initialized switch zone \
4038+
(no underlay info available yet)",
4039+
);
4040+
return;
4041+
}
4042+
Ok(Some(underlay_info)) => {
4043+
info!(
4044+
self.inner.log,
4045+
"initialized switch zone (underlay info \
4046+
available: will attempt uplink configuration)",
4047+
);
4048+
break underlay_info;
40354049
}
40364050
Err(e) => {
40374051
warn!(
@@ -4060,17 +4074,15 @@ impl ServiceManager {
40604074
continue;
40614075
}
40624076
};
4063-
}
4077+
};
40644078

4065-
// Then, if we have underlay info, go into a loop trying to configure
4066-
// our uplinks. As above, retry until we succeed or are told to stop.
4067-
if let Some(underlay_info) = underlay_info {
4068-
self.ensure_switch_zone_uplinks_configured_loop(
4069-
&underlay_info,
4070-
exit_rx,
4071-
)
4072-
.await;
4073-
}
4079+
// Then go into a loop trying to configure our uplinks. As above, retry
4080+
// until we succeed or are told to stop.
4081+
self.ensure_switch_zone_uplinks_configured_loop(
4082+
&underlay_info,
4083+
exit_rx,
4084+
)
4085+
.await;
40744086
}
40754087
}
40764088

0 commit comments

Comments
 (0)