fix(runner): self-restart resilience under systemd / external process supervision by heavygee · Pull Request #814 · tiann/hapi

heavygee · 2026-06-06T11:20:49Z

Closes #816

Problem

The runner's heartbeat in cli/src/runner/run.ts triggers a self-restart whenever getInstalledCliMtimeMs() differs from startedWithCliMtimeMs. The same mtime guard fires in controlClient.isRunnerRunningCurrentlyInstalledHappyVersion. This is great for npm consumers - clean zero-downtime upgrades when a new version lands on disk.

It breaks badly when the runner is owned by an external process supervisor (systemd, tmux, supervisord, etc.) and source-file mtimes shift for reasons unrelated to an actual npm upgrade - e.g. local builds, file syncs, custom rebuild pipelines. Two failure modes hit simultaneously:

The runner spawns hapi runner start and process.exit(0)s. systemd's Restart=on-failure does not restart on clean exit. Machine drops off the hub until manual intervention.
The replacement runner is spawned with no argv - anything the operator passed at the original invocation (e.g. --workspace-root) is silently lost. Browse + spawn degrade to no workspace roots without warning.

In practice these hit together: even if you patch systemd to restart on clean exit, the new runner comes up with the wrong configuration.

Fix

Two commits:

1. `HAPI_DISABLE_VERSION_HANDOFF=1` opt-out (commit 1)

When that env var is set in the runner's environment, both mtime/version drift checks are skipped. Heartbeat continues normally (session pruning, state-file persistence still run). Default behavior is unchanged for npm consumers - this is purely opt-in for operators who own process supervision.

2. Preserve argv across self-restart + verify handoff (commit 2)

Even when version-handoff IS enabled, the existing self-restart path had two correctness bugs:

argv loss: snapshot process.argv.slice(2) at startup, persist it on initial state write and every heartbeat, replay it as the new runner's argv (default to ['runner','start-sync'] when nothing was captured).
fail-silent on respawn failure: new waitForRunnerHandoff(oldPid, {timeoutMs}) in controlClient.ts polls runner.state.json for a different live PID before the dying runner exits. On spawn failure or 30s timeout, the old runner refreshes its mtime baseline (so it doesn't respawn-loop on the same drift) and stays alive so the machine keeps serving the hub.

Files

cli/src/persistence.ts (+7): add startedWithArgv?: string[] to RunnerLocallyPersistedState.
cli/src/runner/controlClient.ts (+58/-2): env-var guard in version check; new waitForRunnerHandoff poller.
cli/src/runner/run.ts (+96/-10): argv snapshot/persist/replay, both env-var guard sites, handoff confirmation before exit.

Compatibility

npm consumers: zero behavior change. Env var is unset; mtime drift triggers self-restart exactly as before. Self-restart now correctly preserves argv (this is a pure improvement; no downside).
systemd / tmux / supervisord operators: opt-in via Environment=HAPI_DISABLE_VERSION_HANDOFF=1. Recommended for any setup where the supervisor owns the runner's lifecycle.

Tested

In-production on a systemd-supervised runner (Debian, Bun runtime, custom rebuild pipeline) for ~6 days. HAPI_DISABLE_VERSION_HANDOFF=1 set in the unit file; runner has not self-killed on any rebuild since. Self-restart paths exercised via deliberate mtime touches; argv preservation verified by inspecting the replacement runner's /proc/<pid>/cmdline.

…tart The heartbeat in cli/src/runner/run.ts triggers spawnHappyCLI(['runner','start']) + process.exit(0) when getInstalledCliMtimeMs() differs from startedWithCliMtimeMs. The same mtime guard fires in controlClient.isRunnerRunningCurrentlyInstalledHappyVersion when a fresh CLI invocation inspects the live runner. For operators who own process supervision (systemd, tmux, custom rebuild pipelines, etc.), source-file mtimes shift for reasons unrelated to npm upgrades. The clean exit defeats Restart=on-failure under systemd and leaves the machine offline. Setting HAPI_DISABLE_VERSION_HANDOFF=1 in the runner's environment now skips both checks while keeping the rest of the heartbeat (session pruning, state file persistence) intact. Default behavior is unchanged for npm consumers. Co-authored-by: Cursor <cursoragent@cursor.com>

…ndoff The mtime-driven self-restart in cli/src/runner/run.ts spawned `hapi runner start` with no arguments, then process.exit(0)'d unconditionally after a 10s sleep. Two failure modes: 1. The forwarded `runner start-sync` lost the operator's --workspace-root flags (anything passed at the original invocation). Browse + spawn silently degraded to "no workspace roots". 2. If the replacement runner failed to come up at all (build was mid-flight, binary missing, etc.) the original runner still exited cleanly. Under systemd Restart=on-failure that means no runner is brought back, and the machine drops off the hub until manual intervention. Changes: - persistence.ts: add startedWithArgv?: string[] to RunnerLocallyPersistedState - run.ts: snapshot process.argv.slice(2) at startup, persist it on initial state write and on every heartbeat, replay it as the new runner's argv (default to ['runner','start-sync'] when nothing was captured) - controlClient.ts: new waitForRunnerHandoff(oldPid, {timeoutMs}) polls runner.state.json for a different live PID - run.ts: only clearInterval + process.exit(0) when handoff is confirmed. On spawn failure or 30s timeout, refresh the mtime baseline (so we don't respawn-loop on the same drift) and stay alive so the machine keeps serving. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions

Findings

[Major] Failed handoffs persist the new mtime on the old runner, making stale code look current. Both failure paths assign startedWithCliMtimeMs = installedCliMtimeMs, and the heartbeat later writes that value into runner.state.json; because isRunnerRunningCurrentlyInstalledHappyVersion() prefers mtime when present, future hapi runner start calls will treat the still-old in-memory runner as current after one failed spawn/timeout. Evidence: cli/src/runner/run.ts:854, cli/src/runner/run.ts:867.
Suggested fix:
```
let nextHandoffAttemptAt = 0;

if (installedCliMtimeMs !== startedWithCliMtimeMs && Date.now() >= nextHandoffAttemptAt) {
  // attempt handoff
}

const deferHandoffRetry = () => {
  nextHandoffAttemptAt = Date.now() + 5 * 60_000;
  heartbeatRunning = false;
};

// On spawn failure / handoff timeout, call deferHandoffRetry();
// do not mutate startedWithCliMtimeMs, so state still reports the runner as stale.
```

[Major] HAPI_DISABLE_VERSION_HANDOFF is only checked in the invoking CLI process, not the live runner that wrote the state. Under the documented systemd/custom-supervision use case, the env var is commonly set only on the service; a shell hapi runner start without that env will still see mtime drift and stop the supervised runner during rebuild. Evidence: cli/src/runner/controlClient.ts:192.
Suggested fix:

export interface RunnerLocallyPersistedState {
  // ...
  startedWithVersionHandoffDisabled?: boolean;
}

const versionHandoffDisabled = process.env.HAPI_DISABLE_VERSION_HANDOFF === '1';
writeRunnerState({
  // ...
  startedWithVersionHandoffDisabled: versionHandoffDisabled
});

if (process.env.HAPI_DISABLE_VERSION_HANDOFF === '1' || state.startedWithVersionHandoffDisabled === true) {
  logger.debug('[RUNNER CONTROL] Version handoff disabled, skipping mtime/version drift check');
} else {
  // existing mtime/version check
}

Questions

None.

Summary
Review mode: initial
Two runner self-restart correctness issues found. Both are in the new handoff resilience path and can either mask a stale runner after a failed replacement or fail to honor the new supervision opt-out in the common service-only environment configuration.

Testing

Not run (automation).

HAPI Bot

Two Major correctness fixes flagged by upstream Codex review on PR tiann#814: (1) Stale-mtime poisoning on failed handoff (run.ts:854,867) The previous failure paths assigned startedWithCliMtimeMs = installedCliMtimeMs which the next heartbeat persisted to runner.state.json. Downstream isRunnerRunningCurrentlyInstalledHappyVersion() then reported the still-stale runner as current, masking the failure until the *next* genuine mtime change. Symptom: an mtime change that briefly failed to hand off would be silently forgotten. Fix: leave startedWithCliMtimeMs immutable. Gate handoff entry on a new nextHandoffAttemptAt timestamp; failure paths bump it by HANDOFF_RETRY_BACKOFF_MS (5 min) via deferHandoffRetry(). The heartbeat continues to write the honest "still on the old code" mtime, and the runner naturally re-attempts after the cooldown. (2) HAPI_DISABLE_VERSION_HANDOFF not honored by live runner (controlClient.ts:192, persistence.ts) The env var was only checked in the invoking CLI process. Under the documented systemd use case the env is set on the service unit but NOT on the operator's interactive shell - so a shell `hapi runner start` would still treat mtime drift as stale and kill the supervised runner during a rebuild. The exact regression this layer was built to prevent. Fix: capture HAPI_DISABLE_VERSION_HANDOFF at runner start time into state.startedWithVersionHandoffDisabled, persisted via the heartbeat. The controlClient mtime check now OR's the live env var with the persisted snapshot, so any caller honours the running runner's opt-out regardless of their own environment. Tests: cli typecheck clean; 14/14 runner unit tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>

heavygee · 2026-06-06T15:33:03Z

~~[superseded] This top-level comment was the wrong shape - this project responds to bot review findings on the threads themselves, not via standalone PR comments.~~

All four Codex Major findings are now addressed via thread replies (resolved):

run.ts stale-mtime - replied + resolved (a49fc57)
controlClient.ts HAPI_DISABLE_VERSION_HANDOFF persistence - replied + resolved (a49fc57)
run.ts:672 argv slice in compiled binary mode - replied + resolved (4a03f42)
run.ts:892 waitForRunnerHandoff race with child stopRunner() - replied + resolved (4a03f42)

Leaving this struck-through rather than deleting so the conversation thread record is preserved. Apologies for the off-protocol original.

github-actions

Findings

[Major] Raw argv capture drops the top-level command in compiled runner handoff — process.argv.slice(2) is not the normalized CLI argv this repo uses elsewhere; in compiled mode it persists ['start-sync', ...], so self-restart spawns hapi start-sync ... and resolveCommand() falls back to the default Claude command instead of runner start-sync. Evidence: cli/src/runner/run.ts:672.
Suggested fix:
```
import { getCliArgs } from '@/utils/cliArgs';

const startedWithArgv = getCliArgs();
const handoffArgv = startedWithArgv[0] === 'runner'
  ? startedWithArgv
  : ['runner', 'start-sync'];
```

[Major] Handoff wait is bypassed by replacement startup killing the old runner — the spawned replacement still enters startRunner(), sees the old runner as stale, and calls stopRunner() before it can acquire the lock/write new state. That means the old process can exit via /stop before waitForRunnerHandoff() confirms a different PID; if the replacement then fails before writing state, the machine still goes offline. Evidence: cli/src/runner/run.ts:892.
Suggested fix:

// Replacement startup: in explicit handoff mode, do not ask the child to stop the old runner.
const handoffFromPid = Number(process.env.HAPI_RUNNER_HANDOFF_FROM_PID);
if (!runningRunnerVersionMatches) {
  const state = await readRunnerState();
  if (!Number.isFinite(handoffFromPid) || state?.pid !== handoffFromPid) {
    await stopRunner();
  }
}

// Old runner stale-version path: only use waitForRunnerHandoff with a mode
// where the old runner owns shutdown/retry, not the replacement child.
const child = spawnHappyCLI(handoffArgv, {
  detached: true,
  stdio: 'ignore',
  env: { ...process.env, HAPI_RUNNER_HANDOFF_FROM_PID: String(process.pid) }
});
child.unref();

Questions

None.

Summary

Review mode: follow-up after new commits.
The previous bot findings were addressed directly, but the latest full diff still leaves self-restart broken for compiled binaries and does not actually guarantee the old runner stays alive until replacement state is written.

Testing

Not run (automation).

HAPI Bot

Two additional Major findings on the runner self-restart layer that were not addressed in a49fc57: 1. run.ts:672 - process.argv.slice(2) returns ['start-sync', ...] in compiled binary mode (raw argv is [hapi, runner, start-sync, ...]), so the handoff spawned `hapi start-sync ...` which resolveCommand treats as an unknown top-level and falls back to Claude. Replaced with getCliArgs() (the project's canonical argv normalizer) plus a defensive guard that falls back to ['runner', 'start-sync'] if the captured argv does not begin with 'runner'. 2. run.ts:892 - waitForRunnerHandoff did not actually keep the old runner alive. The child's startRunner() unconditionally called stopRunner() before acquiring the lock or writing its own state, so the parent's /stop handler resolved shutdown and exited BEFORE the child committed. If the child then failed (lock contention, auth error, anything between stopRunner and writeRunnerState), the machine went offline with no runner at all. New handoff protocol: - Parent sets HAPI_RUNNER_HANDOFF_FROM_PID=<pid> on the spawned child's env, then releases the lock BEFORE entering waitForRunnerHandoff (breaks the parent-holds-lock / child-needs-lock-to-write-state deadlock). - On wait-timeout the parent re-acquires the lock (long-retry, 30s) and defers retry; if re-acquire fails (third party took the lock) the parent exits cleanly so it does not stay alive without the lock invariant. - Child detects the env signal; if state.pid matches and that pid is alive, this is an authorized handoff: skip stopRunner(), skip the version-match early-exit, and acquire the lock with a longer retry window (60 attempts x 500ms) so it waits through the parent's asynchronous release. CLI typecheck clean. 14/14 runner unit tests still pass. The wider 46/664 failures in the CLI suite are pre-existing in this branch (unrelated: AppServerEventConverter, cursorEventConverter, hook server, Query) - baseline before this commit has 42+; my changes do not regress them. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions

Findings

None.

Summary

Review mode: follow-up after new commits.
No reportable issues found in the latest diff at >=80% confidence.
Residual risk: the new self-restart handoff and HAPI_DISABLE_VERSION_HANDOFF persistence paths do not appear to have automated coverage in the touched tests.

Testing

Not run (automation).

HAPI Bot

Postmortem context: tiann#814 #issuecomment-4639449666 (2026-06-06). Top-level PR comment was created instead of replying to the bot's review threads and resolving them. New canonical helper: scripts/tooling/hapi-pr-reply.sh (shim at ~/.local/bin/hapi-pr-reply) Atomically: 1. POST reply to the review thread containing <comment_id>. 2. Run resolveReviewThread GraphQL mutation on that thread. Looks up the GraphQL thread id by REST comment id, validates inputs, and aborts at the right step on any failure so you never end up with "reply succeeded but thread looks unresolved" drift. Companion doc: docs/tooling/pr-reply.md explains the helper, the underlying API calls (for when the helper cannot be used), and the Cursor beforeShellExecution hard-blocks that now refuse gh pr comment and git push origin on a PR with unresolved review threads (personal hooks live outside this repo and are not committed here). Co-authored-by: Cursor <cursoragent@cursor.com>

heavygee and others added 2 commits June 6, 2026 12:19

github-actions Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread cli/src/runner/run.ts Outdated

Comment thread cli/src/runner/controlClient.ts Outdated

heavygee mentioned this pull request Jun 6, 2026

bug(runner): mtime-driven self-restart kills systemd-supervised runners and loses argv #816

Open

github-actions Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread cli/src/runner/run.ts Outdated

Comment thread cli/src/runner/run.ts

github-actions Bot reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(runner): self-restart resilience under systemd / external process supervision#814

fix(runner): self-restart resilience under systemd / external process supervision#814
heavygee wants to merge 4 commits into
tiann:mainfrom
heavygee:fix/runner-systemd-self-restart-resilience

heavygee commented Jun 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

heavygee commented Jun 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

heavygee commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

1. HAPI_DISABLE_VERSION_HANDOFF=1 opt-out (commit 1)

2. Preserve argv across self-restart + verify handoff (commit 2)

Files

Compatibility

Tested

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

heavygee commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

heavygee commented Jun 6, 2026 •

edited

Loading

1. `HAPI_DISABLE_VERSION_HANDOFF=1` opt-out (commit 1)

heavygee commented Jun 6, 2026 •

edited

Loading