Skip to content

lib: Crucible backend must ensure incomplete writes are not lost during migrationΒ #884

@gjcolombo

Description

@gjcolombo

When a guest writes to a Crucible-backed disk, the Crucible upstairs owned by the disk's backend may acknowledge the write (allowing it to complete from the guest's point of view) before it is actually submitted to any Crucible downstairs servers. This creates a small window during migration where a write can be lost:

  1. The guest writes some data to a Crucible-backed disk device.
  2. Propolis's device layer sends this to the Crucible backend, which relays it to the Crucible upstairs (see block::crucible::process_request).
  3. Crucible upstairs acknowledges the write and completes the request immediately, but doesn't immediately submit anything to the downstairs.
  4. This completion bubbles back up through the device to the guest; the device now believes this work is completed
  5. The VM begins to migrate.
  6. The pause phase of live migration tells disk devices to quiesce and run down all active block requests. This completes trivially because, as far as the device is concerned, the still-outstanding write was completed in step 4.
  7. Migration completes; the migration target's Crucible backends activate and supersede the source's backends.
  8. The source upstairs finally submits the write that it deferred in step 3. This fails because a newer upstairs activated first.
  9. The guest reads the data it wrote and finds that it's missing!

Crucible backends should get a callout during live migration (through the BlockBackend or Lifecycle traits) that allows them to ensure this can't occur.


The most robust way to fix this is to implement state transfer for Crucible backends. This would look something like this:

  • Crucible backends impl Lifecycle and have a pause implementation.
  • On a request to pause, the backend stops submitting new work to its Crucible upstairs, and possibly also asks that upstairs to quiesce. The backend does not report that it has finished pausing until its upstairs is fully quiesced.
  • The backend then transfers its unsubmitted work, plus any Crucible upstairs state, in the device state transfer phase to be loaded by the target.
  • The target restores this state prior to resuming. When it does resume the buffered work is allowed to try to finish through the new upstairs.

This will probably require significant work on the Crucible side. On the Propolis end, it might also require us to revisit how block backends and devices pause. (IIUC, today block devices accept new work once they're paused, but don't submit it to their backends; their paused futures then do not return until the backends have acknowledged all previously-outstanding work. That outstanding work won't complete in this model, so we'd need to revisit that contract.)

In Matrix, we discussed using flushes as a sort of minimum-viable workaround: at some point between steps 6 and 7,1 the backend should send a flush request to the upstairs. If this is acknowledged quickly (note that the VM must be pausing or paused at this point!), migration can proceed; otherwise, migration should fail, and the source should resume execution.

Footnotes

  1. The pause phase is a natural choice, but pause is currently infallible, so we'd have to tweak it to make it fallible. The state export phase is another option; that phase is already fallible, but it's squicky to have an export callback that doesn't actually export anything. ↩

Metadata

Metadata

Assignees

No one assigned

    Labels

    migrationIssues related to live migration.storageRelated to storage devices/backends.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions