Skip to content

bug(runner): mtime-driven self-restart kills systemd-supervised runners and loses argv #816

@heavygee

Description

@heavygee

Problem

The runner's heartbeat in cli/src/runner/run.ts triggers a self-restart whenever getInstalledCliMtimeMs() differs from startedWithCliMtimeMs. The same mtime guard fires in controlClient.isRunnerRunningCurrentlyInstalledHappyVersion. This is the right behavior for the npm consumer path - zero-downtime upgrade when a new version lands on disk.

It breaks badly when the runner is owned by an external process supervisor (systemd, tmux, supervisord, etc.) and source-file mtimes shift for reasons unrelated to an actual npm upgrade - e.g. local builds, file syncs, custom rebuild pipelines. Two failure modes hit at the same time:

1. systemd cannot restart the runner

The runner spawns hapi runner start and unconditionally process.exit(0)s. systemd's Restart=on-failure does not restart on a clean exit. The machine drops off the hub until manual systemctl restart.

2. argv is silently lost across the handoff

The replacement runner is spawned with no argv. Anything the operator passed at the original invocation (--workspace-root, --port, etc.) is dropped. Browse + spawn degrade to "no workspace roots" without any warning.

In practice these compound: even after patching the unit to restart on clean exit, the new runner comes up with the wrong configuration.

Repro

  1. Run a HAPI runner under systemd with Restart=on-failure (or Restart=always is fine - the bug is broader than this one unit).
  2. Pass --workspace-root=/some/path at launch.
  3. While the runner is alive, touch any file under the installed CLI tree (e.g. via a local rebuild that updates dist mtimes without a real version bump).
  4. Wait ~10 s for the heartbeat to fire.

Observed:

  • Runner logs the version-handoff path and exits 0.
  • systemd does not restart it under Restart=on-failure.
  • Even if it does restart (manually or under Restart=always), the new runner has no workspace roots and the machine appears empty in the hub UI.

Expected:

  • Runner either stays alive (when external supervision is in play and the mtime drift is benign), or properly hands off with argv preserved and the old runner only exits after confirming the new one is live.

Fix

PR #814 proposes a two-commit fix:

  1. HAPI_DISABLE_VERSION_HANDOFF=1 opt-out. When set in the runner's environment, both mtime/version drift checks are skipped. Heartbeat continues normally (session pruning, state-file persistence still run). Default behavior is unchanged for npm consumers.
  2. Preserve argv + verify handoff. Even with version-handoff enabled, snapshot process.argv.slice(2) at startup, persist it, and replay it as the new runner's argv. New waitForRunnerHandoff(oldPid, {timeoutMs}) polls runner.state.json for a different live PID; only process.exit(0) after handoff is confirmed. On spawn failure or 30 s timeout, refresh the mtime baseline and stay alive.

In-production on a systemd-supervised runner for ~6 days with HAPI_DISABLE_VERSION_HANDOFF=1 in the unit file - no self-kills since.

Linked in PR #814.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions