lib: Crucible backend must ensure incomplete writes are not lost during migration

When a guest writes to a Crucible-backed disk, the Crucible upstairs owned by the disk's backend may acknowledge the write (allowing it to complete from the guest's point of view) before it is actually submitted to any Crucible downstairs servers. This creates a small window during migration where a write can be lost:

1. The guest writes some data to a Crucible-backed disk device.
2. Propolis's device layer sends this to the Crucible backend, which relays it to the Crucible upstairs (see `block::crucible::process_request`).
3. Crucible upstairs acknowledges the write and completes the request immediately, but doesn't immediately submit anything to the downstairs.
4. This completion bubbles back up through the device to the guest; the device now believes this work is completed
5. The VM begins to migrate.
6. The pause phase of live migration tells disk devices to quiesce and run down all active block requests. This completes trivially because, as far as the device is concerned, the still-outstanding write was completed in step 4.
7. Migration completes; the migration target's Crucible backends activate and supersede the source's backends.
8. The source upstairs finally submits the write that it deferred in step 3. This fails because a newer upstairs activated first.
9. The guest reads the data it wrote and finds that it's missing!

Crucible backends should get a callout during live migration (through the `BlockBackend` or `Lifecycle` traits) that allows them to ensure this can't occur.

---

The most robust way to fix this is to implement state transfer for Crucible backends. This would look something like this:

- Crucible backends `impl Lifecycle` and have a `pause` implementation.
- On a request to pause, the backend stops submitting new work to its Crucible upstairs, and possibly also asks that upstairs to quiesce. The backend does not report that it has finished pausing until its upstairs is fully quiesced.
- The backend then transfers its unsubmitted work, plus any Crucible upstairs state, in the device state transfer phase to be loaded by the target.
- The target restores this state prior to resuming. When it does resume the buffered work is allowed to try to finish through the new upstairs.

This will probably require significant work on the Crucible side. On the Propolis end, it might also require us to revisit how block backends and devices pause. (IIUC, today block devices accept new work once they're paused, but don't submit it to their backends; their `paused` futures then do not return until the backends have acknowledged all previously-outstanding work. That outstanding work won't complete in this model, so we'd need to revisit that contract.)

In Matrix, we discussed using flushes as a sort of minimum-viable workaround: at some point between steps 6 and 7,[^1] the backend should send a flush request to the upstairs. If this is acknowledged quickly (note that the VM must be pausing or paused at this point!), migration can proceed; otherwise, migration should fail, and the source should resume execution.

[^1]: The pause phase is a natural choice, but pause is currently infallible, so we'd have to tweak it to make it fallible. The state export phase is another option; that phase is already fallible, but it's squicky to have an export callback that doesn't actually export anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lib: Crucible backend must ensure incomplete writes are not lost during migration #884

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

lib: Crucible backend must ensure incomplete writes are not lost during migration #884

Description

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions