test failed in CI: test_delete_volume_region_replacement_state_replacement_done

This test failed on a CI run on https://github.com/oxidecomputer/omicron/pull/9204. I don't think I can easily get back to the GitHub check run but I'm pretty sure I understand the failure mode and don't think it's very useful.

Log showing the specific test failure:

    https://buildomat.eng.oxide.computer/wg/0/details/01K7FXA5DVJCXPSM67VKV9CPE9/WasQVIxzulxDUuqWi81qWnXbx6WoCdSHvIw0Ai9rnNvEJEsV/01K7HTRY88BYFMVKWQ0NBEASCB#S7623

Excerpt from the log showing the failure:

```
thread 'integration_tests::crucible_replacements::test_delete_volume_region_replacement_state_replacement_done' panicked at nexus/tests/integration_tests/crucible_replacements.rs:486:14:
request transitioned to expected state: TimedOut(260.002921832s)
stack backtrace:
   0: __rustc::rust_begin_unwind
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:697:5
   1: core::panicking::panic_fmt
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:75:14
   2: core::result::unwrap_failed
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/result.rs:1761:5
   3: core::result::Result<T,E>::expect
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/result.rs:1119:23
   4: {async_fn#0}
             at ./tests/integration_tests/crucible_replacements.rs:486:14
   5: {async_fn#0}
             at ./tests/integration_tests/crucible_replacements.rs:632:14
   6: {async_fn#0}
             at ./tests/integration_tests/crucible_replacements.rs:721:59
...
```

specifically what happened here is we timed out on the `wait_for_request_state` in [crucible_replacements.rs](https://github.com/oxidecomputer/omicron/blob/f53eb00/nexus/tests/integration_tests/crucible_replacements.rs#L619-L633):
```rust
        pub async fn transition_request_to_replacement_done(&self) {
            // Run the "region replacement driver" task

            run_region_replacement_driver(&self.lockstep_client).await;

            // The activation above could only have started the associated saga,
            // so wait until the request is in the expected end state.

            self.wait_for_request_state(
                ExpectedEndState(RegionReplacementState::ReplacementDone),
                ExpectedIntermediateState(RegionReplacementState::Driving),
                ExpectedStartState(RegionReplacementState::Running),
            )
            .await;
        }
```

at least in the flake I looked at, it looks like `run_region_replacement_driver` started the `region-replacement-drive` background task, which then completed, all before `attach_request_volume_to_pantry()` actually finished attaching the volume. this all happened in about 30 seconds in CI (vs ~19 for the test to pass locally) so we can probably turn the timeouts here way down. `170.61 MiB` of logs is a bit daunting 😁 

speaking of logs, some highlights to substantiate the above! from [test_delete_volume_region_replacement_state_replacement_done.23223.0.log](https://buildomat.eng.oxide.computer/wg/0/artefact/01K7FXA5DVJCXPSM67VKV9CPE9/WasQVIxzulxDUuqWi81qWnXbx6WoCdSHvIw0Ai9rnNvEJEsV/01K7HTRY88BYFMVKWQ0NBEASCB/01K7J1ZEJ3XKGY7C1GMKKGQ542/test_all-bb1dddcb6a623257-test_delete_volume_region_replacement_state_replacement_done.23223.0.log), I've snipped out some uninteresting logs and fields for the important parts:

```
// here's the client poking `region_replacement_driver`
{
  "time": "2025-10-14T18:23:49.037532706Z",
  "component": "lockstep client test context",
  "uri": "http://127.0.0.1:50237/bgtasks/activate",
  "method": "POST"
} 
{
  "msg": "activating",
  "time": "2025-10-14T18:23:49.046667097Z",
  "background_task": "region_replacement_driver",
  "reason": "Signaled"
}
{
  "msg": "client received response",
  "time": "2025-10-14T18:23:49.049599289Z",
  "component": "lockstep client test context",
  "status": "204"
}

// but, the saga 6ed0008b-5bff-4319-bace-d4b25f1be38b is still running! just after starting the background task we get these events:
{ 
  "msg": "saga log event",
  "time": "2025-10-14T18:23:49.092456907Z",
  "new_state": "N007 succeeded"
} 
{
  "msg": "saga log event",
  "time": "2025-10-14T18:23:49.099399259Z",
  "new_state": "N009 started"
} 
{ 
  "msg": "saga log event",
  "time": "2025-10-14T18:23:49.105368933Z",
  "new_state": "N009 succeeded"
}

// 3ms after saga node 9 succeeded, our activation of `region_replacement_driver` finishes, having done nothing
{ 
  "msg": "activation complete",
  "time": "2025-10-14T18:23:49.108287537Z",
  "background_task": "region_replacement_driver",
  "status": "{\"drive_invoked_ok\":[],\"errors\":[],\"finish_invoked_ok\":[]}",
  "elapsed": "61.56795ms"
} 
  
// and a bit later the saga is recorded as finished
{ 
  "msg": "update for saga cached state",
  "time": "2025-10-14T18:23:49.110297346Z",
  "component": "ServerContext",
  "new_state": "Done",
  "saga_id": "6ed0008b-5bff-4319-bace-d4b25f1be38b"
}
{
  "msg": "saga finished",
  "saga_name": "region-replacement-drive",
  "saga_id": "6ed0008b-5bff-4319-bace-d4b25f1be38b",
  "component": "ServerContext",
  "result": "success"
}
```

so, what's saga `6ed0008b-5bff-4319-bace-d4b25f1be38b`? it was created at the previous `POST /bgtasks/activate`, and we can see a whiff of where thing go wrong back there:

```
// the test client has poked `region_replacement_driver`
{
  "msg": "request completed",
  "time": "2025-10-14T18:23:48.321817102Z",
  "uri": "/bgtasks/activate",
  "method": "POST",
  "req_id": "05ad8cc2-fbf7-49cd-8d2b-50465baa53e8",
  "component": "dropshot_lockstep",
  "latency_us": 11768,
  "response_code": 204
} 
// and now we're checking on that background task immediately after...
{ 
  "msg": "client request",
  "time": "2025-10-14T18:23:48.323094575Z",
  "component": "lockstep client test context",
  "body": "Body { inner: BoxBody }",
  "uri": "http://127.0.0.1:50237/bgtasks/view/region_replacement_driver",
  "method": "GET"
} 
{ 
  "msg": "incoming request",
  "time": "2025-10-14T18:23:48.324482248Z",
  "uri": "/bgtasks/view/region_replacement_driver",
  "method": "GET",
  "req_id": "809595e0-c453-4070-9a64-1da836c9d18c",
  "component": "dropshot_lockstep"
} 
{ 
  "msg": "request completed",
  "time": "2025-10-14T18:23:48.344281891Z",
  "uri": "/bgtasks/view/region_replacement_driver",
  "method": "GET",
  "req_id": "809595e0-c453-4070-9a64-1da836c9d18c",
  "component": "dropshot_lockstep",
  "latency_us": 19327,
  "response_code": 200
} 
{ 
  "msg": "client received response",
  "time": "2025-10-14T18:23:48.353154336Z",
  "component": "lockstep client test context",
  "status": "200"
} 
// but we only got to setting up a `region-replacement-drive`
{ 
  "msg": "tracking newly created saga",
  "time": "2025-10-14T18:23:48.392037289Z",
  "saga_id": "6ed0008b-5bff-4319-bace-d4b25f1be38b",
  "component": "ServerContext"
}
// ... 50ms after the client response:
{
  "msg": "client request",
  "time": "2025-10-14T18:23:48.403317052Z",
  "component": "lockstep client test context",
  "uri": "http://127.0.0.1:50237/bgtasks/view/region_replacement_driver",
  "method": "GET"
}
```

the `GET /bgtasks/view/region_replacement_driver` in the middle is because this all happens in [`attach_request_volume_to_pantry`](https://github.com/oxidecomputer/omicron/blob/f53eb00/nexus/tests/integration_tests/crucible_replacements.rs#L528-L585). first we check `ExpectedStartState(RegionReplacementState::Running)`, which the test client sees before the saga is created. this is the same state as we'd see when the saga is done, and the helper tries to account for that:
```rust
            // Additionally, assert that the drive saga recorded that it sent
            // the attachment request to the simulated pantry.
            //
            // If `wait_for_request_state` has the same expected start and end
            // state (as it does above), it's possible to exit that function
            // having not yet started the saga yet, and this requires an
            // additional `wait_for_condition` to wait for the expected recorded
            // step.

            let most_recent_step = wait_for_condition(
                || {
                    let datastore = self.datastore.clone();
                    let opctx = self.opctx();
                    let replacement_request_id = self.replacement_request_id;

                    async move {
                        match datastore
                            .current_region_replacement_request_step(
                                &opctx,
                                replacement_request_id,
                            )
                            .await
                            .unwrap()
                        {
                            Some(step) => Ok(step),

                            None => {
                                // The saga either has not started yet or is
                                // still running - see the comment before this
                                // check for more info.
                                Err(CondCheckError::<()>::NotYet)
                            }
                        }
                    }
                },
                &std::time::Duration::from_millis(50),
                &std::time::Duration::from_secs(30),
            )
```

but I think what has happened in these flakes is that we raced with the background task and the client got here before the region replacement was in `Driving`. we try to spin on `current_region_replacement_request_step()` as a barrier, but the barrier is only _that a region replacement is in `Driving`_, not that it's back in `Running` like `wait_for_request_state` would look for.

then the test client can zooms onward prematurely with the rest of the test and in this case pokes `region_replacement_driver` early enough that there's no work for it to do, and we time out waiting for a state transition that'll never happen.

I _think_ instead of `Some(step) -> Ok(step)` above, we should look for something like

```rust
Some(step) => {
    let replacement_state = datastore.get_region_replacement_request_by_id(...)
        .await
        .unwrap();
    if replacement_state ==  RegionReplacementState::Running {
        Ok(step)
    } else {
        Err(CondCheckError::<()>::NotYet)
    }
}
```

which should check both that we hit `Driving` at some point and then got back to `Running` before getting out of `attach_request_volume_to_pantry`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test failed in CI: test_delete_volume_region_replacement_state_replacement_done #9216

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

test failed in CI: test_delete_volume_region_replacement_state_replacement_done #9216

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions