Skip to content

fix: retry mkfs on next reconciliation if interrupted#481

Open
LoneExile wants to merge 1 commit intoLINBIT:masterfrom
LoneExile:fix/mkfs-retry-on-interrupted-provisioning
Open

fix: retry mkfs on next reconciliation if interrupted#481
LoneExile wants to merge 1 commit intoLINBIT:masterfrom
LoneExile:fix/mkfs-retry-on-interrupted-provisioning

Conversation

@LoneExile
Copy link

@LoneExile LoneExile commented Feb 15, 2026

Problem

When mkfs is interrupted during initial volume provisioning (e.g., timeout on large volumes, or transient failure while the satellite stays running), the DRBD device is left without a filesystem. The volume appears successfully provisioned but gets stuck in a permanent FailedMount loop because fsck fails on the unformatted device (exit code 8: "Bad magic number in super-block").

This does not self-heal because two one-shot gate flags — checkFileSystem and createPrimary — are cleared before their respective mkfs blocks run. If mkfs throws, both flags are already false and mkfs is never retried on subsequent reconciliations.

Scenario checkFileSystem createPrimary Self-heals?
Node reboot (satellite restarts) Resets to true (in-memory) Controller re-sends via StltRscDfnApiCallHandler Yes
mkfs timeout (satellite running) Already false Already false No — stuck forever
mkfs failure (satellite running) Already false Already false No — stuck forever

Root Cause

Flag 1: checkFileSystem in MkfsUtils.makeFileSystemOnMarked()

disableCheckFileSystem() is called at line 136 of MkfsUtils.java, before the mkfs loop begins. If any mkfs call throws StorageException (timeout or failure), the flag is already cleared and the next reconciliation skips the entire block.

  • checkFileSystem is a plain boolean field in AbsRscData.java (line 59), initialized to true in the constructor (line 92)
  • Not persisted to database — resets to true on satellite restart

Flag 2: createPrimary in DrbdLayer.condInitialOrSkipSync()

rsc.unsetCreatePrimary() is called at line 1773 of DrbdLayer.java, before both setResourceUpToDate() and the mkfs block. After clearing:

  • PROP_PRIMARY_SET is already set by controller → Branch A (request primary) is skipped
  • createPrimary is false → Branch B (go primary + mkfs) is skipped

Even if checkFileSystem were fixed independently, this outer gate blocks re-entry to the DRBD mkfs path.

  • createPrimary is a plain boolean field in Resource.java (line 86), default false
  • Set by controller via StltRscDfnApiCallHandler.setCreatePrimary() (line 70) after satellite requests primary

Timeout context

The default external command timeout is 45 seconds (ChildProcessHandler.java, line 20). For large volumes (100+ GB), mkfs can exceed this timeout (see #371).

Fix

Move both flags to after their respective mkfs blocks complete successfully. If mkfs throws, the exception exits the method before the flag is cleared, so the next reconciliation retries.

MkfsUtils.java: Move disableCheckFileSystem() from line 136 (before the mkfs loop) to after the loop completes (line 253 in the patched file).

DrbdLayer.java: Move unsetCreatePrimary() from line 1773 (before the mkfs/sync block) to after it completes (line 1797 in the patched file).

Safety

  • The existing blkid check in hasFileSystem() (MkfsUtils.java line 70) prevents reformatting volumes that already have a filesystem — partially completed multi-volume runs are safe
  • setResourceUpToDate() (initial sync trigger) may run again on retry; DRBD handles redundant primary/secondary transitions gracefully
  • No change to persisted state — both flags are volatile in-memory only

Related

Both MkfsUtils.makeFileSystemOnMarked() and DrbdLayer.condInitialOrSkipSync()
clear their one-shot gate flags before mkfs runs. If mkfs is interrupted
(timeout or failure while the satellite stays running), the flags are
already cleared and mkfs is never retried, leaving the DRBD device
without a filesystem and the volume stuck in FailedMount.

Move both flags to after mkfs succeeds:

- MkfsUtils: move disableCheckFileSystem() from before the mkfs loop to
  after it completes. If mkfs throws, the exception exits the method
  before the flag is cleared, so the next reconciliation retries.

- DrbdLayer: move unsetCreatePrimary() from before the mkfs block to
  after it completes. This keeps the createPrimary gate open on failure
  so the DRBD path re-enters on the next device manager run.

The existing blkid check (hasFileSystem) already guards against
reformatting volumes that have a filesystem, so successfully formatted
volumes from a partial run are not reformatted.
@ghernadi
Copy link
Contributor

Honestly I am not sure what to make of this. Why should a second attempt succeed if the first one failed?

Besides, we are about to improve the timeout handling for mkfs as well as for drbdadm create-md to not timeout for large devices. Would that already help your use case or would you still want this PR to be applied?

@LoneExile
Copy link
Author

The timeout fix would cover the most common trigger, but the underlying issue remains if mkfs is interrupted for any reason while the satellite stays running (transient I/O error, OOM-killed child process, etc.) the one-shot flags are already cleared and the volume is stuck forever. only a satellite restart recovers it.

This PR just makes flag clearing conditional on success. no behavior change when mkfs succeeds, safe retry on next reconciliation when it doesn't (guarded by blkid)

Happy to wait or adapt if you'd prefer to land the timeout fix first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants