You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sled-agent: Fix races when starting switch zone in a4x2 (#9297)
Yesterday @internet-diglett and I were looking at some weird a4x2
failures where sled-agent successfully started the switch zone but
failed to configure uplinks within it. This PR fixes a race condition
and a subsequent logic bug which together were causing that failure.
I'm not sure if it's possible to hit this race condition in real
hardware. I tried going over a real sled-agent startup log to figure out
if we just happened to start up the switch zone "fast enough", or if
something in the real startup path was (implicitly) blocked on that
setup being done. I _think_ it's the latter but don't have great
confidence in that; this is based on comparing timestamps of logs and
things that appear backed up waiting on mutexes held during the whole
switch zone setup process. All of this is pretty gnarly; we have
multiple issues discussing the need for some rework here anyway, but
this is yet another spot fix to unblock active work.
---
Race condition: If we get our underlay info while we're still starting
up the switch zone, we don't inform the task doing that startup about
it, and therefore it doesn't attempt to configure uplinks.
In the "swapping out the request" path, we actually had
`Some(underlay_info)`, but were discarding it: it's not stored in
`request` or `new_request` - we only passed it as an argument to
`start_switch_zone`. This is fixed by moving the
`underlay_info` into `request` instead of passing it as function
argument. Now when we swap out the request, the task running to perform
initialization has access to the `underlay_info` and will attempt to
configure uplinks.
---
Logic bug: Once we fixed the above, we saw the "ensure switch zone
uplinks" worker stop after a single attempt as though it was told to
because inside of `try_initialize_switch_zone()` itself, the last thing
it does before returning is change the state from `::Initializing` to
`::Running`, _with no `worker` task_:
https://github.com/oxidecomputer/omicron/blob/d743754bb4be24228e9e042ce5262c242d4fd079/sled-agent/src/services.rs#L4006-L4010
This causes `exit_tx` to be dropped, which causes
`ensure_switch_zone_uplinks_configured_loop()` to bail out after a
single attempt, as we see in the logs above. This is fixed by moving the worker task
into the `::Running` state instead of dropping it. (The `::Running`
state can have a non-`None` worker if we reconfigure the switch zone, so
the supporting code already expects this to be present sometimes, and
knows to stop the task when appropriate.)
This bug was _mostly_ introduced by above (not fully correct!) change to
fix the race condition: prior to that change, the `::Initializing` state
never had the `underlay_info` in it anyway, so
`ensure_switch_zone_uplinks_configured_loop()` wouldn't have even been
called.
0 commit comments