bug(runner): mtime-driven self-restart kills systemd-supervised runners and loses argv

## Problem

The runner's heartbeat in `cli/src/runner/run.ts` triggers a self-restart whenever `getInstalledCliMtimeMs()` differs from `startedWithCliMtimeMs`. The same mtime guard fires in `controlClient.isRunnerRunningCurrentlyInstalledHappyVersion`. This is the right behavior for the npm consumer path - zero-downtime upgrade when a new version lands on disk.

It breaks badly when the runner is owned by an external process supervisor (systemd, tmux, supervisord, etc.) and source-file mtimes shift for reasons **unrelated to an actual npm upgrade** - e.g. local builds, file syncs, custom rebuild pipelines. Two failure modes hit at the same time:

### 1. systemd cannot restart the runner

The runner spawns `hapi runner start` and unconditionally `process.exit(0)`s. systemd's `Restart=on-failure` does **not** restart on a clean exit. The machine drops off the hub until manual `systemctl restart`.

### 2. argv is silently lost across the handoff

The replacement runner is spawned with **no argv**. Anything the operator passed at the original invocation (`--workspace-root`, `--port`, etc.) is dropped. Browse + spawn degrade to \"no workspace roots\" without any warning.

In practice these compound: even after patching the unit to restart on clean exit, the new runner comes up with the wrong configuration.

## Repro

1. Run a HAPI runner under systemd with `Restart=on-failure` (or `Restart=always` is fine - the bug is broader than this one unit).
2. Pass `--workspace-root=/some/path` at launch.
3. While the runner is alive, touch any file under the installed CLI tree (e.g. via a local rebuild that updates dist mtimes without a real version bump).
4. Wait ~10 s for the heartbeat to fire.

Observed:
- Runner logs the version-handoff path and exits 0.
- systemd does not restart it under `Restart=on-failure`.
- Even if it does restart (manually or under `Restart=always`), the new runner has no workspace roots and the machine appears empty in the hub UI.

Expected:
- Runner either stays alive (when external supervision is in play and the mtime drift is benign), or properly hands off with argv preserved and the old runner only exits after confirming the new one is live.

## Fix

PR #814 proposes a two-commit fix:

1. **`HAPI_DISABLE_VERSION_HANDOFF=1` opt-out.** When set in the runner's environment, both mtime/version drift checks are skipped. Heartbeat continues normally (session pruning, state-file persistence still run). Default behavior is unchanged for npm consumers.
2. **Preserve argv + verify handoff.** Even with version-handoff enabled, snapshot `process.argv.slice(2)` at startup, persist it, and replay it as the new runner's argv. New `waitForRunnerHandoff(oldPid, {timeoutMs})` polls `runner.state.json` for a different live PID; only `process.exit(0)` after handoff is confirmed. On spawn failure or 30 s timeout, refresh the mtime baseline and stay alive.

In-production on a systemd-supervised runner for ~6 days with `HAPI_DISABLE_VERSION_HANDOFF=1` in the unit file - no self-kills since.

Linked in PR #814.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug(runner): mtime-driven self-restart kills systemd-supervised runners and loses argv #816

Problem

1. systemd cannot restart the runner

2. argv is silently lost across the handoff

Repro

Fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

bug(runner): mtime-driven self-restart kills systemd-supervised runners and loses argv #816

Description

Problem

1. systemd cannot restart the runner

2. argv is silently lost across the handoff

Repro

Fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions