Add heal_orphaned_dtl import option for crash recovery during topology changes

  Describe the feature you would like to see added to OpenZFS

  Add a targeted recovery option (heal_orphaned_dtl) for pool import that detects and clears orphaned DTL entries on hole/missing vdevs after a crash during topology changes.

  TL;DR: When a system crashes mid-detach, orphaned DTL entries can remain on holes, triggering a phantom resilver (no target device, scans entire pool). Currently recovery requires disabling all validation (spa_load_verify_*=0) - a blunt instrument. A targeted healing
  option would be safer and more user-friendly.

  How will this feature improve OpenZFS?

  Current situation after crash during vdev detach:
  - Pool import fails validation, or triggers phantom resilver
  - zpool status shows "resilver in progress" but NO device has "(resilvering)" marker
  - Resilver scans entire pool (13.8T in my case) with 0% progress forever
  - zpool scrub -s returns EBUSY
  - Only workarounds: disable all validation OR let it hammer drives for hours

  With this feature:
  - Pool import detects orphaned DTL on holes automatically
  - Clear error message directs user to: zpool import -o heal_orphaned_dtl=on poolname
  - User explicitly consents to healing action
  - Recovery is logged to pool history
  - No phantom resilver triggered

  Why not just fix the crash atomicity? Crashes during topology changes are inherently difficult to make fully atomic - hardware failures, kernel panics, D-state deadlocks can interrupt any transaction. The existing spa_load_verify_* tunables acknowledge this reality.
  This feature adds a targeted recovery path rather than requiring users to disable all validation.

  Additional context

  Environment:
  - OpenZFS: zfs-2.2.2-0ubuntu9.4
  - Kernel: 6.8.0-90-generic
  - Pool: dRAID1 + special vdevs + log vdev, ~14TB

  What happened:
  # Mass detach on 2026-01-26 (5 devices in ~1 second):
  2026-01-26.19:29:08 [txg:2675586] detach wwn-...-part1
  2026-01-26.19:29:08 [txg:2675593] detach wwn-...-part2
  2026-01-26.19:29:08 [txg:2675600] detach wwn-...-part3
  2026-01-26.19:29:08 [txg:2675607] detach wwn-...-part4
  2026-01-26.19:29:09 [txg:2675614] detach wwn-...-part5

  # System crash (USB drive issues → kernel hang → sysrq-b)
  # Pool import on 2026-01-29 triggers phantom resilver

  Current pool state (6 holes from detached devices):
  hole_array[0]: 2
  hole_array[1]: 5
  hole_array[2]: 7
  hole_array[3]: 9
  hole_array[4]: 10
  hole_array[5]: 11

  Phantom resilver (no target):
  scan: resilver in progress since Thu Jan 29 06:21:39 2026
      0B / 13.8T scanned, 0B / 13.8T issued
      0B resilvered, 0.00% done, no estimated completion time

  Current workarounds required:
  # To import at all:
  options zfs spa_load_verify_data=0
  options zfs spa_load_verify_metadata=0

  # To prevent system meltdown from phantom resilver:
  options zfs zfs_scan_suspend_progress=1
  options zfs zfs_vdev_scrub_max_active=0

  Proposed implementation: PR at https://github.com/nmarasoiu/zfs/tree/fix/orphaned-dtl-healing

  Adds vdev_dtl_check_orphaned() called during import that:
  1. Detects DTL objects on hole/missing vdevs
  2. By default: fails import with helpful error message
  3. With -o heal_orphaned_dtl=on: clears orphaned DTL, logs to history, continues

  Happy to refine the PR based on feedback. The alternative for me is recreating the entire pool from scratch, which works but feels like a sledgehammer for what could be a scalpel fix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add heal_orphaned_dtl import option for crash recovery during topology changes #18172

Mass detach on 2026-01-26 (5 devices in ~1 second):

System crash (USB drive issues → kernel hang → sysrq-b)

Pool import on 2026-01-29 triggers phantom resilver

To import at all:

To prevent system meltdown from phantom resilver:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add heal_orphaned_dtl import option for crash recovery during topology changes #18172

Description

Mass detach on 2026-01-26 (5 devices in ~1 second):

System crash (USB drive issues → kernel hang → sysrq-b)

Pool import on 2026-01-29 triggers phantom resilver

To import at all:

To prevent system meltdown from phantom resilver:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions