Skip to content

fix(runner): self-restart resilience under systemd / external process supervision#814

Open
heavygee wants to merge 4 commits into
tiann:mainfrom
heavygee:fix/runner-systemd-self-restart-resilience
Open

fix(runner): self-restart resilience under systemd / external process supervision#814
heavygee wants to merge 4 commits into
tiann:mainfrom
heavygee:fix/runner-systemd-self-restart-resilience

Conversation

@heavygee
Copy link
Copy Markdown
Collaborator

@heavygee heavygee commented Jun 6, 2026

Closes #816

Problem

The runner's heartbeat in cli/src/runner/run.ts triggers a self-restart whenever getInstalledCliMtimeMs() differs from startedWithCliMtimeMs. The same mtime guard fires in controlClient.isRunnerRunningCurrentlyInstalledHappyVersion. This is great for npm consumers - clean zero-downtime upgrades when a new version lands on disk.

It breaks badly when the runner is owned by an external process supervisor (systemd, tmux, supervisord, etc.) and source-file mtimes shift for reasons unrelated to an actual npm upgrade - e.g. local builds, file syncs, custom rebuild pipelines. Two failure modes hit simultaneously:

  1. The runner spawns hapi runner start and process.exit(0)s. systemd's Restart=on-failure does not restart on clean exit. Machine drops off the hub until manual intervention.
  2. The replacement runner is spawned with no argv - anything the operator passed at the original invocation (e.g. --workspace-root) is silently lost. Browse + spawn degrade to no workspace roots without warning.

In practice these hit together: even if you patch systemd to restart on clean exit, the new runner comes up with the wrong configuration.

Fix

Two commits:

1. HAPI_DISABLE_VERSION_HANDOFF=1 opt-out (commit 1)

When that env var is set in the runner's environment, both mtime/version drift checks are skipped. Heartbeat continues normally (session pruning, state-file persistence still run). Default behavior is unchanged for npm consumers - this is purely opt-in for operators who own process supervision.

2. Preserve argv across self-restart + verify handoff (commit 2)

Even when version-handoff IS enabled, the existing self-restart path had two correctness bugs:

  • argv loss: snapshot process.argv.slice(2) at startup, persist it on initial state write and every heartbeat, replay it as the new runner's argv (default to ['runner','start-sync'] when nothing was captured).
  • fail-silent on respawn failure: new waitForRunnerHandoff(oldPid, {timeoutMs}) in controlClient.ts polls runner.state.json for a different live PID before the dying runner exits. On spawn failure or 30s timeout, the old runner refreshes its mtime baseline (so it doesn't respawn-loop on the same drift) and stays alive so the machine keeps serving the hub.

Files

  • cli/src/persistence.ts (+7): add startedWithArgv?: string[] to RunnerLocallyPersistedState.
  • cli/src/runner/controlClient.ts (+58/-2): env-var guard in version check; new waitForRunnerHandoff poller.
  • cli/src/runner/run.ts (+96/-10): argv snapshot/persist/replay, both env-var guard sites, handoff confirmation before exit.

Compatibility

  • npm consumers: zero behavior change. Env var is unset; mtime drift triggers self-restart exactly as before. Self-restart now correctly preserves argv (this is a pure improvement; no downside).
  • systemd / tmux / supervisord operators: opt-in via Environment=HAPI_DISABLE_VERSION_HANDOFF=1. Recommended for any setup where the supervisor owns the runner's lifecycle.

Tested

In-production on a systemd-supervised runner (Debian, Bun runtime, custom rebuild pipeline) for ~6 days. HAPI_DISABLE_VERSION_HANDOFF=1 set in the unit file; runner has not self-killed on any rebuild since. Self-restart paths exercised via deliberate mtime touches; argv preservation verified by inspecting the replacement runner's /proc/<pid>/cmdline.

heavygee and others added 2 commits June 6, 2026 12:19
…tart

The heartbeat in cli/src/runner/run.ts triggers spawnHappyCLI(['runner','start'])
+ process.exit(0) when getInstalledCliMtimeMs() differs from startedWithCliMtimeMs.
The same mtime guard fires in controlClient.isRunnerRunningCurrentlyInstalledHappyVersion
when a fresh CLI invocation inspects the live runner.

For operators who own process supervision (systemd, tmux, custom rebuild
pipelines, etc.), source-file mtimes shift for reasons unrelated to npm
upgrades. The clean exit defeats Restart=on-failure under systemd and
leaves the machine offline.

Setting HAPI_DISABLE_VERSION_HANDOFF=1 in the runner's environment now skips
both checks while keeping the rest of the heartbeat (session pruning, state
file persistence) intact. Default behavior is unchanged for npm consumers.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ndoff

The mtime-driven self-restart in cli/src/runner/run.ts spawned
`hapi runner start` with no arguments, then process.exit(0)'d
unconditionally after a 10s sleep. Two failure modes:

1. The forwarded `runner start-sync` lost the operator's --workspace-root
   flags (anything passed at the original invocation). Browse + spawn
   silently degraded to "no workspace roots".
2. If the replacement runner failed to come up at all (build was mid-flight,
   binary missing, etc.) the original runner still exited cleanly. Under
   systemd Restart=on-failure that means no runner is brought back, and
   the machine drops off the hub until manual intervention.

Changes:

- persistence.ts: add startedWithArgv?: string[] to RunnerLocallyPersistedState
- run.ts: snapshot process.argv.slice(2) at startup, persist it on initial
  state write and on every heartbeat, replay it as the new runner's argv
  (default to ['runner','start-sync'] when nothing was captured)
- controlClient.ts: new waitForRunnerHandoff(oldPid, {timeoutMs}) polls
  runner.state.json for a different live PID
- run.ts: only clearInterval + process.exit(0) when handoff is confirmed.
  On spawn failure or 30s timeout, refresh the mtime baseline (so we don't
  respawn-loop on the same drift) and stay alive so the machine keeps
  serving.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Failed handoffs persist the new mtime on the old runner, making stale code look current. Both failure paths assign startedWithCliMtimeMs = installedCliMtimeMs, and the heartbeat later writes that value into runner.state.json; because isRunnerRunningCurrentlyInstalledHappyVersion() prefers mtime when present, future hapi runner start calls will treat the still-old in-memory runner as current after one failed spawn/timeout. Evidence: cli/src/runner/run.ts:854, cli/src/runner/run.ts:867.
    Suggested fix:

    let nextHandoffAttemptAt = 0;
    
    if (installedCliMtimeMs !== startedWithCliMtimeMs && Date.now() >= nextHandoffAttemptAt) {
      // attempt handoff
    }
    
    const deferHandoffRetry = () => {
      nextHandoffAttemptAt = Date.now() + 5 * 60_000;
      heartbeatRunning = false;
    };
    
    // On spawn failure / handoff timeout, call deferHandoffRetry();
    // do not mutate startedWithCliMtimeMs, so state still reports the runner as stale.
  • [Major] HAPI_DISABLE_VERSION_HANDOFF is only checked in the invoking CLI process, not the live runner that wrote the state. Under the documented systemd/custom-supervision use case, the env var is commonly set only on the service; a shell hapi runner start without that env will still see mtime drift and stop the supervised runner during rebuild. Evidence: cli/src/runner/controlClient.ts:192.
    Suggested fix:

    export interface RunnerLocallyPersistedState {
      // ...
      startedWithVersionHandoffDisabled?: boolean;
    }
    
    const versionHandoffDisabled = process.env.HAPI_DISABLE_VERSION_HANDOFF === '1';
    writeRunnerState({
      // ...
      startedWithVersionHandoffDisabled: versionHandoffDisabled
    });
    
    if (process.env.HAPI_DISABLE_VERSION_HANDOFF === '1' || state.startedWithVersionHandoffDisabled === true) {
      logger.debug('[RUNNER CONTROL] Version handoff disabled, skipping mtime/version drift check');
    } else {
      // existing mtime/version check
    }

Questions

  • None.

Summary
Review mode: initial
Two runner self-restart correctness issues found. Both are in the new handoff resilience path and can either mask a stale runner after a failed replacement or fail to honor the new supervision opt-out in the common service-only environment configuration.

Testing

  • Not run (automation).

HAPI Bot

Comment thread cli/src/runner/run.ts Outdated
Comment thread cli/src/runner/controlClient.ts Outdated
Two Major correctness fixes flagged by upstream Codex review on PR tiann#814:

(1) Stale-mtime poisoning on failed handoff (run.ts:854,867)

The previous failure paths assigned
  startedWithCliMtimeMs = installedCliMtimeMs
which the next heartbeat persisted to runner.state.json. Downstream
isRunnerRunningCurrentlyInstalledHappyVersion() then reported the
still-stale runner as current, masking the failure until the *next*
genuine mtime change. Symptom: an mtime change that briefly failed
to hand off would be silently forgotten.

Fix: leave startedWithCliMtimeMs immutable. Gate handoff entry on
a new nextHandoffAttemptAt timestamp; failure paths bump it by
HANDOFF_RETRY_BACKOFF_MS (5 min) via deferHandoffRetry(). The
heartbeat continues to write the honest "still on the old code"
mtime, and the runner naturally re-attempts after the cooldown.

(2) HAPI_DISABLE_VERSION_HANDOFF not honored by live runner
    (controlClient.ts:192, persistence.ts)

The env var was only checked in the invoking CLI process. Under the
documented systemd use case the env is set on the service unit but
NOT on the operator's interactive shell - so a shell `hapi runner
start` would still treat mtime drift as stale and kill the supervised
runner during a rebuild. The exact regression this layer was built
to prevent.

Fix: capture HAPI_DISABLE_VERSION_HANDOFF at runner start time into
state.startedWithVersionHandoffDisabled, persisted via the heartbeat.
The controlClient mtime check now OR's the live env var with the
persisted snapshot, so any caller honours the running runner's
opt-out regardless of their own environment.

Tests: cli typecheck clean; 14/14 runner unit tests pass.
Co-authored-by: Cursor <cursoragent@cursor.com>
@heavygee
Copy link
Copy Markdown
Collaborator Author

heavygee commented Jun 6, 2026

[superseded] This top-level comment was the wrong shape - this project responds to bot review findings on the threads themselves, not via standalone PR comments.

All four Codex Major findings are now addressed via thread replies (resolved):

  • run.ts stale-mtime - replied + resolved (a49fc57)
  • controlClient.ts HAPI_DISABLE_VERSION_HANDOFF persistence - replied + resolved (a49fc57)
  • run.ts:672 argv slice in compiled binary mode - replied + resolved (4a03f42)
  • run.ts:892 waitForRunnerHandoff race with child stopRunner() - replied + resolved (4a03f42)

Leaving this struck-through rather than deleting so the conversation thread record is preserved. Apologies for the off-protocol original.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Raw argv capture drops the top-level command in compiled runner handoff — process.argv.slice(2) is not the normalized CLI argv this repo uses elsewhere; in compiled mode it persists ['start-sync', ...], so self-restart spawns hapi start-sync ... and resolveCommand() falls back to the default Claude command instead of runner start-sync. Evidence: cli/src/runner/run.ts:672.
    Suggested fix:
    import { getCliArgs } from '@/utils/cliArgs';
    
    const startedWithArgv = getCliArgs();
    const handoffArgv = startedWithArgv[0] === 'runner'
      ? startedWithArgv
      : ['runner', 'start-sync'];
  • [Major] Handoff wait is bypassed by replacement startup killing the old runner — the spawned replacement still enters startRunner(), sees the old runner as stale, and calls stopRunner() before it can acquire the lock/write new state. That means the old process can exit via /stop before waitForRunnerHandoff() confirms a different PID; if the replacement then fails before writing state, the machine still goes offline. Evidence: cli/src/runner/run.ts:892.
    Suggested fix:
    // Replacement startup: in explicit handoff mode, do not ask the child to stop the old runner.
    const handoffFromPid = Number(process.env.HAPI_RUNNER_HANDOFF_FROM_PID);
    if (!runningRunnerVersionMatches) {
      const state = await readRunnerState();
      if (!Number.isFinite(handoffFromPid) || state?.pid !== handoffFromPid) {
        await stopRunner();
      }
    }
    
    // Old runner stale-version path: only use waitForRunnerHandoff with a mode
    // where the old runner owns shutdown/retry, not the replacement child.
    const child = spawnHappyCLI(handoffArgv, {
      detached: true,
      stdio: 'ignore',
      env: { ...process.env, HAPI_RUNNER_HANDOFF_FROM_PID: String(process.pid) }
    });
    child.unref();

Questions

  • None.

Summary

  • Review mode: follow-up after new commits.
  • The previous bot findings were addressed directly, but the latest full diff still leaves self-restart broken for compiled binaries and does not actually guarantee the old runner stays alive until replacement state is written.

Testing

  • Not run (automation).

HAPI Bot

Comment thread cli/src/runner/run.ts Outdated
Comment thread cli/src/runner/run.ts
Two additional Major findings on the runner self-restart layer that
were not addressed in a49fc57:

1. run.ts:672 - process.argv.slice(2) returns ['start-sync', ...] in
   compiled binary mode (raw argv is [hapi, runner, start-sync, ...]),
   so the handoff spawned `hapi start-sync ...` which resolveCommand
   treats as an unknown top-level and falls back to Claude. Replaced
   with getCliArgs() (the project's canonical argv normalizer) plus a
   defensive guard that falls back to ['runner', 'start-sync'] if the
   captured argv does not begin with 'runner'.

2. run.ts:892 - waitForRunnerHandoff did not actually keep the old
   runner alive. The child's startRunner() unconditionally called
   stopRunner() before acquiring the lock or writing its own state,
   so the parent's /stop handler resolved shutdown and exited BEFORE
   the child committed. If the child then failed (lock contention,
   auth error, anything between stopRunner and writeRunnerState),
   the machine went offline with no runner at all.

   New handoff protocol:
   - Parent sets HAPI_RUNNER_HANDOFF_FROM_PID=<pid> on the spawned
     child's env, then releases the lock BEFORE entering
     waitForRunnerHandoff (breaks the parent-holds-lock /
     child-needs-lock-to-write-state deadlock).
   - On wait-timeout the parent re-acquires the lock (long-retry, 30s)
     and defers retry; if re-acquire fails (third party took the
     lock) the parent exits cleanly so it does not stay alive without
     the lock invariant.
   - Child detects the env signal; if state.pid matches and that pid
     is alive, this is an authorized handoff: skip stopRunner(),
     skip the version-match early-exit, and acquire the lock with a
     longer retry window (60 attempts x 500ms) so it waits through
     the parent's asynchronous release.

CLI typecheck clean. 14/14 runner unit tests still pass. The wider
46/664 failures in the CLI suite are pre-existing in this branch
(unrelated: AppServerEventConverter, cursorEventConverter, hook
server, Query) - baseline before this commit has 42+; my changes do
not regress them.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • None.

Summary

  • Review mode: follow-up after new commits.
  • No reportable issues found in the latest diff at >=80% confidence.
  • Residual risk: the new self-restart handoff and HAPI_DISABLE_VERSION_HANDOFF persistence paths do not appear to have automated coverage in the touched tests.

Testing

  • Not run (automation).

HAPI Bot

heavygee added a commit to heavygee/hapi that referenced this pull request Jun 6, 2026
Postmortem context: tiann#814 #issuecomment-4639449666 (2026-06-06).
Top-level PR comment was created instead of replying to the bot's review
threads and resolving them.

New canonical helper:
  scripts/tooling/hapi-pr-reply.sh (shim at ~/.local/bin/hapi-pr-reply)

Atomically:
  1. POST reply to the review thread containing <comment_id>.
  2. Run resolveReviewThread GraphQL mutation on that thread.

Looks up the GraphQL thread id by REST comment id, validates inputs,
and aborts at the right step on any failure so you never end up with
"reply succeeded but thread looks unresolved" drift.

Companion doc: docs/tooling/pr-reply.md explains the helper, the
underlying API calls (for when the helper cannot be used), and the
Cursor beforeShellExecution hard-blocks that now refuse gh pr comment
and git push origin on a PR with unresolved review threads (personal
hooks live outside this repo and are not committed here).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(runner): mtime-driven self-restart kills systemd-supervised runners and loses argv

1 participant