fix(runner): self-restart resilience under systemd / external process supervision#814
fix(runner): self-restart resilience under systemd / external process supervision#814heavygee wants to merge 4 commits into
Conversation
…tart The heartbeat in cli/src/runner/run.ts triggers spawnHappyCLI(['runner','start']) + process.exit(0) when getInstalledCliMtimeMs() differs from startedWithCliMtimeMs. The same mtime guard fires in controlClient.isRunnerRunningCurrentlyInstalledHappyVersion when a fresh CLI invocation inspects the live runner. For operators who own process supervision (systemd, tmux, custom rebuild pipelines, etc.), source-file mtimes shift for reasons unrelated to npm upgrades. The clean exit defeats Restart=on-failure under systemd and leaves the machine offline. Setting HAPI_DISABLE_VERSION_HANDOFF=1 in the runner's environment now skips both checks while keeping the rest of the heartbeat (session pruning, state file persistence) intact. Default behavior is unchanged for npm consumers. Co-authored-by: Cursor <cursoragent@cursor.com>
…ndoff
The mtime-driven self-restart in cli/src/runner/run.ts spawned
`hapi runner start` with no arguments, then process.exit(0)'d
unconditionally after a 10s sleep. Two failure modes:
1. The forwarded `runner start-sync` lost the operator's --workspace-root
flags (anything passed at the original invocation). Browse + spawn
silently degraded to "no workspace roots".
2. If the replacement runner failed to come up at all (build was mid-flight,
binary missing, etc.) the original runner still exited cleanly. Under
systemd Restart=on-failure that means no runner is brought back, and
the machine drops off the hub until manual intervention.
Changes:
- persistence.ts: add startedWithArgv?: string[] to RunnerLocallyPersistedState
- run.ts: snapshot process.argv.slice(2) at startup, persist it on initial
state write and on every heartbeat, replay it as the new runner's argv
(default to ['runner','start-sync'] when nothing was captured)
- controlClient.ts: new waitForRunnerHandoff(oldPid, {timeoutMs}) polls
runner.state.json for a different live PID
- run.ts: only clearInterval + process.exit(0) when handoff is confirmed.
On spawn failure or 30s timeout, refresh the mtime baseline (so we don't
respawn-loop on the same drift) and stay alive so the machine keeps
serving.
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Findings
-
[Major] Failed handoffs persist the new mtime on the old runner, making stale code look current. Both failure paths assign
startedWithCliMtimeMs = installedCliMtimeMs, and the heartbeat later writes that value intorunner.state.json; becauseisRunnerRunningCurrentlyInstalledHappyVersion()prefers mtime when present, futurehapi runner startcalls will treat the still-old in-memory runner as current after one failed spawn/timeout. Evidence:cli/src/runner/run.ts:854,cli/src/runner/run.ts:867.
Suggested fix:let nextHandoffAttemptAt = 0; if (installedCliMtimeMs !== startedWithCliMtimeMs && Date.now() >= nextHandoffAttemptAt) { // attempt handoff } const deferHandoffRetry = () => { nextHandoffAttemptAt = Date.now() + 5 * 60_000; heartbeatRunning = false; }; // On spawn failure / handoff timeout, call deferHandoffRetry(); // do not mutate startedWithCliMtimeMs, so state still reports the runner as stale.
-
[Major]
HAPI_DISABLE_VERSION_HANDOFFis only checked in the invoking CLI process, not the live runner that wrote the state. Under the documented systemd/custom-supervision use case, the env var is commonly set only on the service; a shellhapi runner startwithout that env will still see mtime drift and stop the supervised runner during rebuild. Evidence:cli/src/runner/controlClient.ts:192.
Suggested fix:export interface RunnerLocallyPersistedState { // ... startedWithVersionHandoffDisabled?: boolean; } const versionHandoffDisabled = process.env.HAPI_DISABLE_VERSION_HANDOFF === '1'; writeRunnerState({ // ... startedWithVersionHandoffDisabled: versionHandoffDisabled }); if (process.env.HAPI_DISABLE_VERSION_HANDOFF === '1' || state.startedWithVersionHandoffDisabled === true) { logger.debug('[RUNNER CONTROL] Version handoff disabled, skipping mtime/version drift check'); } else { // existing mtime/version check }
Questions
- None.
Summary
Review mode: initial
Two runner self-restart correctness issues found. Both are in the new handoff resilience path and can either mask a stale runner after a failed replacement or fail to honor the new supervision opt-out in the common service-only environment configuration.
Testing
- Not run (automation).
HAPI Bot
Two Major correctness fixes flagged by upstream Codex review on PR tiann#814: (1) Stale-mtime poisoning on failed handoff (run.ts:854,867) The previous failure paths assigned startedWithCliMtimeMs = installedCliMtimeMs which the next heartbeat persisted to runner.state.json. Downstream isRunnerRunningCurrentlyInstalledHappyVersion() then reported the still-stale runner as current, masking the failure until the *next* genuine mtime change. Symptom: an mtime change that briefly failed to hand off would be silently forgotten. Fix: leave startedWithCliMtimeMs immutable. Gate handoff entry on a new nextHandoffAttemptAt timestamp; failure paths bump it by HANDOFF_RETRY_BACKOFF_MS (5 min) via deferHandoffRetry(). The heartbeat continues to write the honest "still on the old code" mtime, and the runner naturally re-attempts after the cooldown. (2) HAPI_DISABLE_VERSION_HANDOFF not honored by live runner (controlClient.ts:192, persistence.ts) The env var was only checked in the invoking CLI process. Under the documented systemd use case the env is set on the service unit but NOT on the operator's interactive shell - so a shell `hapi runner start` would still treat mtime drift as stale and kill the supervised runner during a rebuild. The exact regression this layer was built to prevent. Fix: capture HAPI_DISABLE_VERSION_HANDOFF at runner start time into state.startedWithVersionHandoffDisabled, persisted via the heartbeat. The controlClient mtime check now OR's the live env var with the persisted snapshot, so any caller honours the running runner's opt-out regardless of their own environment. Tests: cli typecheck clean; 14/14 runner unit tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>
|
All four Codex Major findings are now addressed via thread replies (resolved):
Leaving this struck-through rather than deleting so the conversation thread record is preserved. Apologies for the off-protocol original. |
There was a problem hiding this comment.
Findings
- [Major] Raw argv capture drops the top-level command in compiled runner handoff —
process.argv.slice(2)is not the normalized CLI argv this repo uses elsewhere; in compiled mode it persists['start-sync', ...], so self-restart spawnshapi start-sync ...andresolveCommand()falls back to the default Claude command instead ofrunner start-sync. Evidence:cli/src/runner/run.ts:672.
Suggested fix:import { getCliArgs } from '@/utils/cliArgs'; const startedWithArgv = getCliArgs(); const handoffArgv = startedWithArgv[0] === 'runner' ? startedWithArgv : ['runner', 'start-sync'];
- [Major] Handoff wait is bypassed by replacement startup killing the old runner — the spawned replacement still enters
startRunner(), sees the old runner as stale, and callsstopRunner()before it can acquire the lock/write new state. That means the old process can exit via/stopbeforewaitForRunnerHandoff()confirms a different PID; if the replacement then fails before writing state, the machine still goes offline. Evidence:cli/src/runner/run.ts:892.
Suggested fix:// Replacement startup: in explicit handoff mode, do not ask the child to stop the old runner. const handoffFromPid = Number(process.env.HAPI_RUNNER_HANDOFF_FROM_PID); if (!runningRunnerVersionMatches) { const state = await readRunnerState(); if (!Number.isFinite(handoffFromPid) || state?.pid !== handoffFromPid) { await stopRunner(); } } // Old runner stale-version path: only use waitForRunnerHandoff with a mode // where the old runner owns shutdown/retry, not the replacement child. const child = spawnHappyCLI(handoffArgv, { detached: true, stdio: 'ignore', env: { ...process.env, HAPI_RUNNER_HANDOFF_FROM_PID: String(process.pid) } }); child.unref();
Questions
- None.
Summary
- Review mode: follow-up after new commits.
- The previous bot findings were addressed directly, but the latest full diff still leaves self-restart broken for compiled binaries and does not actually guarantee the old runner stays alive until replacement state is written.
Testing
- Not run (automation).
HAPI Bot
Two additional Major findings on the runner self-restart layer that were not addressed in a49fc57: 1. run.ts:672 - process.argv.slice(2) returns ['start-sync', ...] in compiled binary mode (raw argv is [hapi, runner, start-sync, ...]), so the handoff spawned `hapi start-sync ...` which resolveCommand treats as an unknown top-level and falls back to Claude. Replaced with getCliArgs() (the project's canonical argv normalizer) plus a defensive guard that falls back to ['runner', 'start-sync'] if the captured argv does not begin with 'runner'. 2. run.ts:892 - waitForRunnerHandoff did not actually keep the old runner alive. The child's startRunner() unconditionally called stopRunner() before acquiring the lock or writing its own state, so the parent's /stop handler resolved shutdown and exited BEFORE the child committed. If the child then failed (lock contention, auth error, anything between stopRunner and writeRunnerState), the machine went offline with no runner at all. New handoff protocol: - Parent sets HAPI_RUNNER_HANDOFF_FROM_PID=<pid> on the spawned child's env, then releases the lock BEFORE entering waitForRunnerHandoff (breaks the parent-holds-lock / child-needs-lock-to-write-state deadlock). - On wait-timeout the parent re-acquires the lock (long-retry, 30s) and defers retry; if re-acquire fails (third party took the lock) the parent exits cleanly so it does not stay alive without the lock invariant. - Child detects the env signal; if state.pid matches and that pid is alive, this is an authorized handoff: skip stopRunner(), skip the version-match early-exit, and acquire the lock with a longer retry window (60 attempts x 500ms) so it waits through the parent's asynchronous release. CLI typecheck clean. 14/14 runner unit tests still pass. The wider 46/664 failures in the CLI suite are pre-existing in this branch (unrelated: AppServerEventConverter, cursorEventConverter, hook server, Query) - baseline before this commit has 42+; my changes do not regress them. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Findings
- None.
Summary
- Review mode: follow-up after new commits.
- No reportable issues found in the latest diff at >=80% confidence.
- Residual risk: the new self-restart handoff and
HAPI_DISABLE_VERSION_HANDOFFpersistence paths do not appear to have automated coverage in the touched tests.
Testing
- Not run (automation).
HAPI Bot
Postmortem context: tiann#814 #issuecomment-4639449666 (2026-06-06). Top-level PR comment was created instead of replying to the bot's review threads and resolving them. New canonical helper: scripts/tooling/hapi-pr-reply.sh (shim at ~/.local/bin/hapi-pr-reply) Atomically: 1. POST reply to the review thread containing <comment_id>. 2. Run resolveReviewThread GraphQL mutation on that thread. Looks up the GraphQL thread id by REST comment id, validates inputs, and aborts at the right step on any failure so you never end up with "reply succeeded but thread looks unresolved" drift. Companion doc: docs/tooling/pr-reply.md explains the helper, the underlying API calls (for when the helper cannot be used), and the Cursor beforeShellExecution hard-blocks that now refuse gh pr comment and git push origin on a PR with unresolved review threads (personal hooks live outside this repo and are not committed here). Co-authored-by: Cursor <cursoragent@cursor.com>
Closes #816
Problem
The runner's heartbeat in
cli/src/runner/run.tstriggers a self-restart whenevergetInstalledCliMtimeMs()differs fromstartedWithCliMtimeMs. The same mtime guard fires incontrolClient.isRunnerRunningCurrentlyInstalledHappyVersion. This is great for npm consumers - clean zero-downtime upgrades when a new version lands on disk.It breaks badly when the runner is owned by an external process supervisor (systemd, tmux, supervisord, etc.) and source-file mtimes shift for reasons unrelated to an actual npm upgrade - e.g. local builds, file syncs, custom rebuild pipelines. Two failure modes hit simultaneously:
hapi runner startandprocess.exit(0)s. systemd'sRestart=on-failuredoes not restart on clean exit. Machine drops off the hub until manual intervention.--workspace-root) is silently lost. Browse + spawn degrade tono workspace rootswithout warning.In practice these hit together: even if you patch systemd to restart on clean exit, the new runner comes up with the wrong configuration.
Fix
Two commits:
1.
HAPI_DISABLE_VERSION_HANDOFF=1opt-out (commit 1)When that env var is set in the runner's environment, both mtime/version drift checks are skipped. Heartbeat continues normally (session pruning, state-file persistence still run). Default behavior is unchanged for npm consumers - this is purely opt-in for operators who own process supervision.
2. Preserve argv across self-restart + verify handoff (commit 2)
Even when version-handoff IS enabled, the existing self-restart path had two correctness bugs:
process.argv.slice(2)at startup, persist it on initial state write and every heartbeat, replay it as the new runner's argv (default to['runner','start-sync']when nothing was captured).waitForRunnerHandoff(oldPid, {timeoutMs})incontrolClient.tspollsrunner.state.jsonfor a different live PID before the dying runner exits. On spawn failure or 30s timeout, the old runner refreshes its mtime baseline (so it doesn't respawn-loop on the same drift) and stays alive so the machine keeps serving the hub.Files
cli/src/persistence.ts(+7): addstartedWithArgv?: string[]toRunnerLocallyPersistedState.cli/src/runner/controlClient.ts(+58/-2): env-var guard in version check; newwaitForRunnerHandoffpoller.cli/src/runner/run.ts(+96/-10): argv snapshot/persist/replay, both env-var guard sites, handoff confirmation before exit.Compatibility
Environment=HAPI_DISABLE_VERSION_HANDOFF=1. Recommended for any setup where the supervisor owns the runner's lifecycle.Tested
In-production on a systemd-supervised runner (Debian, Bun runtime, custom rebuild pipeline) for ~6 days.
HAPI_DISABLE_VERSION_HANDOFF=1set in the unit file; runner has not self-killed on any rebuild since. Self-restart paths exercised via deliberate mtime touches; argv preservation verified by inspecting the replacement runner's/proc/<pid>/cmdline.