feat: Workspace filesystem cleanup #391
7 issues
find-bugs: Found 7 issues (4 medium, 3 low)
Medium
Startup registry lock not released if SIGTERM/SIGINT arrives before listen callback - `src/daemon.ts:507-510`
releaseStartupRegistryLock is only invoked from three places: the listen-callback's try/finally, handleStartupServerError, and the outer catch. The signal handlers (SIGTERM/SIGINT → shutdown(0)) are registered at lines 507-508, but shutdown() does not release the startup registry lock. If a signal is delivered between server.listen(...) being scheduled and its callback firing, shutdown() will run, process.exit will occur via the cleanup pipeline, and the filesystem-based registry mutation lock will only be reclaimed via lease expiry (DAEMON_REGISTRY_LOCK_LEASE_MS = 30s). This can transiently block another daemon's startup for the same workspace.
Also found at:
src/daemon.ts:444-452
canRemoveRegistryEntry treats missing instanceId as non-removable for live owners - `src/daemon/daemon-registry.ts:219-225`
When allowLiveOwner is true and the caller's pid matches the live entry's pid, removal still requires entry.instanceId !== undefined && options.instanceId === entry.instanceId. Older entries written before instanceId was introduced (the field is optional in the interface and validator) will have entry.instanceId === undefined, making them permanently un-removable by their own owning process even when pid matches. This can leave stale registry files that legitimately belong to the current live process and block subsequent daemon lifecycle operations that depend on cleanup.
Also found at:
src/daemon/daemon-registry.ts:132-140
Ownerless lock-dir recovery skips post-quarantine ownership verification, allowing destruction of a freshly written lock - `src/utils/fs-lock.ts:117-124`
In tryRecoverExpiredLockDir, when shouldRecoverLockDir returns recovery.owner === null (the lock dir existed but had no owner.json and was older than the lease), the function quarantines the directory and immediately removes it without re-reading the quarantined contents. Between the initial owner read and the rename, another process could have completed createLock (mkdir succeeded earlier, then writeFile of owner.json finished), making the directory a valid live lock. We then rename it away and rm it, silently destroying that process's lock and letting two holders believe they own the same resource. The owner!=null branch guards against this with fsLockOwnersEqual, but the null branch does not.
Also found at:
src/utils/process-liveness.ts:7-12
Scheduled sweep cooldown can be bypassed before completion, allowing concurrent scheduling for same scope - `src/utils/workspace-filesystem-lifecycle.ts:404-428`
scheduleWorkspaceFilesystemLifecycleSweep only updates lastScheduledAtByScope and lastScheduledAtByPreKey after the sweep completes (in .then). Between scheduling and completion, the runningScheduledSweeps set guards against same-scope re-entry, but the pre-key cooldown check uses lastScheduledAtByPreKey which has not yet been written. Two callers using different preKey values (e.g. one passing workspaceKey, another passing logDir that resolves to the same scope) can both pass cooldown gates and one will then early-return at the runningScheduledSweeps.has check — but a caller using only a logDir override that maps to a different scheduleKey could schedule a redundant concurrent sweep targeting overlapping paths. The cooldown is best-effort, but the asymmetry between pre-key and scope keys means rapid bursts of artifact-created events trigger more sweeps than intended.
Low
Failed quarantine restore leaks .stale.<pid>.<uuid> directories indefinitely - `src/utils/fs-lock.ts:56-63`
restoreQuarantinedLockDir intentionally leaves the quarantined directory in place when rename-back fails (e.g., because another contender now holds lockDir). Because the quarantine name embeds the current pid and a fresh UUID, nothing else will ever reclaim or clean it up from this code path. Over time, repeated contention produces an unbounded number of orphan .stale.* directories under the lock parent, a slow disk-fill / inode-exhaustion DoS on long-lived workspaces.
Lock treated as still valid when expiresAtMs equals now - `src/utils/fs-lock.ts:80-86`
shouldRecoverLockDir uses staleOwner.expiresAtMs > now to decide non-expiry. When the clock equals expiresAtMs exactly, the lock is considered live and recovery is refused, even though the lease has nominally elapsed. This is a minor off-by-one that delays recovery by one tick but does not cause correctness issues; consider >= now.
Orphaned log file when helper-pid rename fails - `src/utils/simulator-steps.ts:377`
When renameHelperLogPathOrThrow fails, it kills the detached helper and throws, causing the outer catch to close the file descriptor. However, the original log file at the owner-only path (osLogFilePath before rename) is never unlinked, leaving an orphaned zero-or-partial-byte log file in the logs directory. While workspace-scoped cleanup may eventually sweep these, an immediate cleanup on the failure path would prevent accumulation of artifacts when rename consistently fails (e.g., permission issues).
Duration: 18m 13s · Tokens: 1.2M in / 20.8k out · Cost: $4.35 (+merge: $0.00)
Annotations
Check warning on line 510 in src/daemon.ts
sentry-warden / warden: find-bugs
Startup registry lock not released if SIGTERM/SIGINT arrives before listen callback
`releaseStartupRegistryLock` is only invoked from three places: the listen-callback's try/finally, `handleStartupServerError`, and the outer catch. The signal handlers (`SIGTERM`/`SIGINT` → `shutdown(0)`) are registered at lines 507-508, but `shutdown()` does not release the startup registry lock. If a signal is delivered between `server.listen(...)` being scheduled and its callback firing, `shutdown()` will run, `process.exit` will occur via the cleanup pipeline, and the filesystem-based registry mutation lock will only be reclaimed via lease expiry (DAEMON_REGISTRY_LOCK_LEASE_MS = 30s). This can transiently block another daemon's startup for the same workspace.
Check warning on line 452 in src/daemon.ts
sentry-warden / warden: find-bugs
[9KQ-F3Q] Startup registry lock not released if SIGTERM/SIGINT arrives before listen callback (additional location)
`releaseStartupRegistryLock` is only invoked from three places: the listen-callback's try/finally, `handleStartupServerError`, and the outer catch. The signal handlers (`SIGTERM`/`SIGINT` → `shutdown(0)`) are registered at lines 507-508, but `shutdown()` does not release the startup registry lock. If a signal is delivered between `server.listen(...)` being scheduled and its callback firing, `shutdown()` will run, `process.exit` will occur via the cleanup pipeline, and the filesystem-based registry mutation lock will only be reclaimed via lease expiry (DAEMON_REGISTRY_LOCK_LEASE_MS = 30s). This can transiently block another daemon's startup for the same workspace.
Check warning on line 225 in src/daemon/daemon-registry.ts
sentry-warden / warden: find-bugs
canRemoveRegistryEntry treats missing instanceId as non-removable for live owners
When `allowLiveOwner` is true and the caller's pid matches the live entry's pid, removal still requires `entry.instanceId !== undefined && options.instanceId === entry.instanceId`. Older entries written before instanceId was introduced (the field is optional in the interface and validator) will have `entry.instanceId === undefined`, making them permanently un-removable by their own owning process even when pid matches. This can leave stale registry files that legitimately belong to the current live process and block subsequent daemon lifecycle operations that depend on cleanup.
Check warning on line 140 in src/daemon/daemon-registry.ts
sentry-warden / warden: find-bugs
[RBQ-3TW] canRemoveRegistryEntry treats missing instanceId as non-removable for live owners (additional location)
When `allowLiveOwner` is true and the caller's pid matches the live entry's pid, removal still requires `entry.instanceId !== undefined && options.instanceId === entry.instanceId`. Older entries written before instanceId was introduced (the field is optional in the interface and validator) will have `entry.instanceId === undefined`, making them permanently un-removable by their own owning process even when pid matches. This can leave stale registry files that legitimately belong to the current live process and block subsequent daemon lifecycle operations that depend on cleanup.
Check warning on line 124 in src/utils/fs-lock.ts
sentry-warden / warden: find-bugs
Ownerless lock-dir recovery skips post-quarantine ownership verification, allowing destruction of a freshly written lock
In tryRecoverExpiredLockDir, when shouldRecoverLockDir returns recovery.owner === null (the lock dir existed but had no owner.json and was older than the lease), the function quarantines the directory and immediately removes it without re-reading the quarantined contents. Between the initial owner read and the rename, another process could have completed createLock (mkdir succeeded earlier, then writeFile of owner.json finished), making the directory a valid live lock. We then rename it away and rm it, silently destroying that process's lock and letting two holders believe they own the same resource. The owner!=null branch guards against this with fsLockOwnersEqual, but the null branch does not.
Check warning on line 12 in src/utils/process-liveness.ts
sentry-warden / warden: find-bugs
[C42-7XN] Ownerless lock-dir recovery skips post-quarantine ownership verification, allowing destruction of a freshly written lock (additional location)
In tryRecoverExpiredLockDir, when shouldRecoverLockDir returns recovery.owner === null (the lock dir existed but had no owner.json and was older than the lease), the function quarantines the directory and immediately removes it without re-reading the quarantined contents. Between the initial owner read and the rename, another process could have completed createLock (mkdir succeeded earlier, then writeFile of owner.json finished), making the directory a valid live lock. We then rename it away and rm it, silently destroying that process's lock and letting two holders believe they own the same resource. The owner!=null branch guards against this with fsLockOwnersEqual, but the null branch does not.
Check warning on line 428 in src/utils/workspace-filesystem-lifecycle.ts
sentry-warden / warden: find-bugs
Scheduled sweep cooldown can be bypassed before completion, allowing concurrent scheduling for same scope
`scheduleWorkspaceFilesystemLifecycleSweep` only updates `lastScheduledAtByScope` and `lastScheduledAtByPreKey` after the sweep completes (in `.then`). Between scheduling and completion, the `runningScheduledSweeps` set guards against same-scope re-entry, but the pre-key cooldown check uses `lastScheduledAtByPreKey` which has not yet been written. Two callers using different `preKey` values (e.g. one passing `workspaceKey`, another passing `logDir` that resolves to the same scope) can both pass cooldown gates and one will then early-return at the `runningScheduledSweeps.has` check — but a caller using only a `logDir` override that maps to a different scheduleKey could schedule a redundant concurrent sweep targeting overlapping paths. The cooldown is best-effort, but the asymmetry between pre-key and scope keys means rapid bursts of artifact-created events trigger more sweeps than intended.