Problem
The runner's heartbeat in cli/src/runner/run.ts triggers a self-restart whenever getInstalledCliMtimeMs() differs from startedWithCliMtimeMs. The same mtime guard fires in controlClient.isRunnerRunningCurrentlyInstalledHappyVersion. This is the right behavior for the npm consumer path - zero-downtime upgrade when a new version lands on disk.
It breaks badly when the runner is owned by an external process supervisor (systemd, tmux, supervisord, etc.) and source-file mtimes shift for reasons unrelated to an actual npm upgrade - e.g. local builds, file syncs, custom rebuild pipelines. Two failure modes hit at the same time:
1. systemd cannot restart the runner
The runner spawns hapi runner start and unconditionally process.exit(0)s. systemd's Restart=on-failure does not restart on a clean exit. The machine drops off the hub until manual systemctl restart.
2. argv is silently lost across the handoff
The replacement runner is spawned with no argv. Anything the operator passed at the original invocation (--workspace-root, --port, etc.) is dropped. Browse + spawn degrade to "no workspace roots" without any warning.
In practice these compound: even after patching the unit to restart on clean exit, the new runner comes up with the wrong configuration.
Repro
- Run a HAPI runner under systemd with
Restart=on-failure (or Restart=always is fine - the bug is broader than this one unit).
- Pass
--workspace-root=/some/path at launch.
- While the runner is alive, touch any file under the installed CLI tree (e.g. via a local rebuild that updates dist mtimes without a real version bump).
- Wait ~10 s for the heartbeat to fire.
Observed:
- Runner logs the version-handoff path and exits 0.
- systemd does not restart it under
Restart=on-failure.
- Even if it does restart (manually or under
Restart=always), the new runner has no workspace roots and the machine appears empty in the hub UI.
Expected:
- Runner either stays alive (when external supervision is in play and the mtime drift is benign), or properly hands off with argv preserved and the old runner only exits after confirming the new one is live.
Fix
PR #814 proposes a two-commit fix:
HAPI_DISABLE_VERSION_HANDOFF=1 opt-out. When set in the runner's environment, both mtime/version drift checks are skipped. Heartbeat continues normally (session pruning, state-file persistence still run). Default behavior is unchanged for npm consumers.
- Preserve argv + verify handoff. Even with version-handoff enabled, snapshot
process.argv.slice(2) at startup, persist it, and replay it as the new runner's argv. New waitForRunnerHandoff(oldPid, {timeoutMs}) polls runner.state.json for a different live PID; only process.exit(0) after handoff is confirmed. On spawn failure or 30 s timeout, refresh the mtime baseline and stay alive.
In-production on a systemd-supervised runner for ~6 days with HAPI_DISABLE_VERSION_HANDOFF=1 in the unit file - no self-kills since.
Linked in PR #814.
Problem
The runner's heartbeat in
cli/src/runner/run.tstriggers a self-restart whenevergetInstalledCliMtimeMs()differs fromstartedWithCliMtimeMs. The same mtime guard fires incontrolClient.isRunnerRunningCurrentlyInstalledHappyVersion. This is the right behavior for the npm consumer path - zero-downtime upgrade when a new version lands on disk.It breaks badly when the runner is owned by an external process supervisor (systemd, tmux, supervisord, etc.) and source-file mtimes shift for reasons unrelated to an actual npm upgrade - e.g. local builds, file syncs, custom rebuild pipelines. Two failure modes hit at the same time:
1. systemd cannot restart the runner
The runner spawns
hapi runner startand unconditionallyprocess.exit(0)s. systemd'sRestart=on-failuredoes not restart on a clean exit. The machine drops off the hub until manualsystemctl restart.2. argv is silently lost across the handoff
The replacement runner is spawned with no argv. Anything the operator passed at the original invocation (
--workspace-root,--port, etc.) is dropped. Browse + spawn degrade to "no workspace roots" without any warning.In practice these compound: even after patching the unit to restart on clean exit, the new runner comes up with the wrong configuration.
Repro
Restart=on-failure(orRestart=alwaysis fine - the bug is broader than this one unit).--workspace-root=/some/pathat launch.Observed:
Restart=on-failure.Restart=always), the new runner has no workspace roots and the machine appears empty in the hub UI.Expected:
Fix
PR #814 proposes a two-commit fix:
HAPI_DISABLE_VERSION_HANDOFF=1opt-out. When set in the runner's environment, both mtime/version drift checks are skipped. Heartbeat continues normally (session pruning, state-file persistence still run). Default behavior is unchanged for npm consumers.process.argv.slice(2)at startup, persist it, and replay it as the new runner's argv. NewwaitForRunnerHandoff(oldPid, {timeoutMs})pollsrunner.state.jsonfor a different live PID; onlyprocess.exit(0)after handoff is confirmed. On spawn failure or 30 s timeout, refresh the mtime baseline and stay alive.In-production on a systemd-supervised runner for ~6 days with
HAPI_DISABLE_VERSION_HANDOFF=1in the unit file - no self-kills since.Linked in PR #814.