- 
                Notifications
    You must be signed in to change notification settings 
- Fork 59
Description
This test failed on a CI run on #9204. I don't think I can easily get back to the GitHub check run but I'm pretty sure I understand the failure mode and don't think it's very useful.
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01K7FXA5DVJCXPSM67VKV9CPE9/WasQVIxzulxDUuqWi81qWnXbx6WoCdSHvIw0Ai9rnNvEJEsV/01K7HTRY88BYFMVKWQ0NBEASCB#S7623
Excerpt from the log showing the failure:
thread 'integration_tests::crucible_replacements::test_delete_volume_region_replacement_state_replacement_done' panicked at nexus/tests/integration_tests/crucible_replacements.rs:486:14:
request transitioned to expected state: TimedOut(260.002921832s)
stack backtrace:
   0: __rustc::rust_begin_unwind
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:697:5
   1: core::panicking::panic_fmt
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:75:14
   2: core::result::unwrap_failed
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/result.rs:1761:5
   3: core::result::Result<T,E>::expect
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/result.rs:1119:23
   4: {async_fn#0}
             at ./tests/integration_tests/crucible_replacements.rs:486:14
   5: {async_fn#0}
             at ./tests/integration_tests/crucible_replacements.rs:632:14
   6: {async_fn#0}
             at ./tests/integration_tests/crucible_replacements.rs:721:59
...
specifically what happened here is we timed out on the wait_for_request_state in crucible_replacements.rs:
        pub async fn transition_request_to_replacement_done(&self) {
            // Run the "region replacement driver" task
            run_region_replacement_driver(&self.lockstep_client).await;
            // The activation above could only have started the associated saga,
            // so wait until the request is in the expected end state.
            self.wait_for_request_state(
                ExpectedEndState(RegionReplacementState::ReplacementDone),
                ExpectedIntermediateState(RegionReplacementState::Driving),
                ExpectedStartState(RegionReplacementState::Running),
            )
            .await;
        }at least in the flake I looked at, it looks like run_region_replacement_driver started the region-replacement-drive background task, which then completed, all before attach_request_volume_to_pantry() actually finished attaching the volume. this all happened in about 30 seconds in CI (vs ~19 for the test to pass locally) so we can probably turn the timeouts here way down. 170.61 MiB of logs is a bit daunting 😁
speaking of logs, some highlights to substantiate the above! from test_delete_volume_region_replacement_state_replacement_done.23223.0.log, I've snipped out some uninteresting logs and fields for the important parts:
// here's the client poking `region_replacement_driver`
{
  "time": "2025-10-14T18:23:49.037532706Z",
  "component": "lockstep client test context",
  "uri": "http://127.0.0.1:50237/bgtasks/activate",
  "method": "POST"
} 
{
  "msg": "activating",
  "time": "2025-10-14T18:23:49.046667097Z",
  "background_task": "region_replacement_driver",
  "reason": "Signaled"
}
{
  "msg": "client received response",
  "time": "2025-10-14T18:23:49.049599289Z",
  "component": "lockstep client test context",
  "status": "204"
}
// but, the saga 6ed0008b-5bff-4319-bace-d4b25f1be38b is still running! just after starting the background task we get these events:
{ 
  "msg": "saga log event",
  "time": "2025-10-14T18:23:49.092456907Z",
  "new_state": "N007 succeeded"
} 
{
  "msg": "saga log event",
  "time": "2025-10-14T18:23:49.099399259Z",
  "new_state": "N009 started"
} 
{ 
  "msg": "saga log event",
  "time": "2025-10-14T18:23:49.105368933Z",
  "new_state": "N009 succeeded"
}
// 3ms after saga node 9 succeeded, our activation of `region_replacement_driver` finishes, having done nothing
{ 
  "msg": "activation complete",
  "time": "2025-10-14T18:23:49.108287537Z",
  "background_task": "region_replacement_driver",
  "status": "{\"drive_invoked_ok\":[],\"errors\":[],\"finish_invoked_ok\":[]}",
  "elapsed": "61.56795ms"
} 
  
// and a bit later the saga is recorded as finished
{ 
  "msg": "update for saga cached state",
  "time": "2025-10-14T18:23:49.110297346Z",
  "component": "ServerContext",
  "new_state": "Done",
  "saga_id": "6ed0008b-5bff-4319-bace-d4b25f1be38b"
}
{
  "msg": "saga finished",
  "saga_name": "region-replacement-drive",
  "saga_id": "6ed0008b-5bff-4319-bace-d4b25f1be38b",
  "component": "ServerContext",
  "result": "success"
}
so, what's saga 6ed0008b-5bff-4319-bace-d4b25f1be38b? it was created at the previous POST /bgtasks/activate, and we can see a whiff of where thing go wrong back there:
// the test client has poked `region_replacement_driver`
{
  "msg": "request completed",
  "time": "2025-10-14T18:23:48.321817102Z",
  "uri": "/bgtasks/activate",
  "method": "POST",
  "req_id": "05ad8cc2-fbf7-49cd-8d2b-50465baa53e8",
  "component": "dropshot_lockstep",
  "latency_us": 11768,
  "response_code": 204
} 
// and now we're checking on that background task immediately after...
{ 
  "msg": "client request",
  "time": "2025-10-14T18:23:48.323094575Z",
  "component": "lockstep client test context",
  "body": "Body { inner: BoxBody }",
  "uri": "http://127.0.0.1:50237/bgtasks/view/region_replacement_driver",
  "method": "GET"
} 
{ 
  "msg": "incoming request",
  "time": "2025-10-14T18:23:48.324482248Z",
  "uri": "/bgtasks/view/region_replacement_driver",
  "method": "GET",
  "req_id": "809595e0-c453-4070-9a64-1da836c9d18c",
  "component": "dropshot_lockstep"
} 
{ 
  "msg": "request completed",
  "time": "2025-10-14T18:23:48.344281891Z",
  "uri": "/bgtasks/view/region_replacement_driver",
  "method": "GET",
  "req_id": "809595e0-c453-4070-9a64-1da836c9d18c",
  "component": "dropshot_lockstep",
  "latency_us": 19327,
  "response_code": 200
} 
{ 
  "msg": "client received response",
  "time": "2025-10-14T18:23:48.353154336Z",
  "component": "lockstep client test context",
  "status": "200"
} 
// but we only got to setting up a `region-replacement-drive`
{ 
  "msg": "tracking newly created saga",
  "time": "2025-10-14T18:23:48.392037289Z",
  "saga_id": "6ed0008b-5bff-4319-bace-d4b25f1be38b",
  "component": "ServerContext"
}
// ... 50ms after the client response:
{
  "msg": "client request",
  "time": "2025-10-14T18:23:48.403317052Z",
  "component": "lockstep client test context",
  "uri": "http://127.0.0.1:50237/bgtasks/view/region_replacement_driver",
  "method": "GET"
}
the GET /bgtasks/view/region_replacement_driver in the middle is because this all happens in attach_request_volume_to_pantry. first we check ExpectedStartState(RegionReplacementState::Running), which the test client sees before the saga is created. this is the same state as we'd see when the saga is done, and the helper tries to account for that:
            // Additionally, assert that the drive saga recorded that it sent
            // the attachment request to the simulated pantry.
            //
            // If `wait_for_request_state` has the same expected start and end
            // state (as it does above), it's possible to exit that function
            // having not yet started the saga yet, and this requires an
            // additional `wait_for_condition` to wait for the expected recorded
            // step.
            let most_recent_step = wait_for_condition(
                || {
                    let datastore = self.datastore.clone();
                    let opctx = self.opctx();
                    let replacement_request_id = self.replacement_request_id;
                    async move {
                        match datastore
                            .current_region_replacement_request_step(
                                &opctx,
                                replacement_request_id,
                            )
                            .await
                            .unwrap()
                        {
                            Some(step) => Ok(step),
                            None => {
                                // The saga either has not started yet or is
                                // still running - see the comment before this
                                // check for more info.
                                Err(CondCheckError::<()>::NotYet)
                            }
                        }
                    }
                },
                &std::time::Duration::from_millis(50),
                &std::time::Duration::from_secs(30),
            )but I think what has happened in these flakes is that we raced with the background task and the client got here before the region replacement was in Driving. we try to spin on current_region_replacement_request_step() as a barrier, but the barrier is only that a region replacement is in Driving, not that it's back in Running like wait_for_request_state would look for.
then the test client can zooms onward prematurely with the rest of the test and in this case pokes region_replacement_driver early enough that there's no work for it to do, and we time out waiting for a state transition that'll never happen.
I think instead of Some(step) -> Ok(step) above, we should look for something like
Some(step) => {
    let replacement_state = datastore.get_region_replacement_request_by_id(...)
        .await
        .unwrap();
    if replacement_state ==  RegionReplacementState::Running {
        Ok(step)
    } else {
        Err(CondCheckError::<()>::NotYet)
    }
}which should check both that we hit Driving at some point and then got back to Running before getting out of attach_request_volume_to_pantry.