fix(hub,cli): four hub-restart-cascade cleanup bugs (#913 #914 #916 #919)#923
fix(hub,cli): four hub-restart-cascade cleanup bugs (#913 #914 #916 #919)#923heavygee wants to merge 5 commits into
Conversation
tiann#916 tiann#919) These four contained bugs were uncovered by a 2026-06-15 hub-restart incident where `hapi-restart-hub` SIGTERMed 23 cursor ACP sessions. Each fix lands independently of the architectural tiann#915 (hub-restart cascade-archive) and the hypothesis-pending tiann#917 (reopen creates dead session); audit-trail correctness and idempotency wins stand on their own. tiann#913 - cursor ACP fresh-session: cursorSessionId persistence race. Fresh ACP sessions could be SIGTERMed during the async `update-metadata` ACK round-trip, stranding the on-disk ACP store with no DB handle. Add `ApiSessionClient.flushMetadata()` and await it after `onSessionFoundWithProtocol` on the fresh-session branch. Resume-path pre-registration (PR tiann#834) is unchanged. tiann#914 - default archiveReason 'User terminated' polluted audit trail. Hub-restart-cascade SIGTERMs went through the same path as web-UI Archive clicks, both writing archiveReason='User terminated'. New default is 'Hub restart'; the KillSession RPC handler (the authoritative user-archive signal) now explicitly stamps 'User terminated' before cleanupAndExit. SIGINT (local-terminal Ctrl-C) keeps the 'User terminated' label too. tiann#916 - POST /api/sessions/:id/archive returned 500 when CLI is gone. `rpcGateway.killSession` threw a generic Error when no target socket was registered, and the archive route surfaced that as 500. Add typed `RpcTargetMissingError`, narrow on it in `syncEngine.archiveSession`, fall back to a hub-side `markSessionArchivedFromHub` write so lifecycleState still flips to 'archived'. Drop the requireActive guard on the route and 2xx-noop for already-archived rows. tiann#919 - sessionCache: three metadata writers threw on version-mismatch without refresh, producing forever-409 on rename/reopen until an unrelated event triggered a cache refresh. `renameSession`, `clearSessionArchiveMetadata`, `restoreSessionArchiveMetadata` now retry-with-refresh (5 attempts, then throw) mirroring the existing good pattern in `mergeSessions`. Refs tiann#913 Refs tiann#914 Refs tiann#916 Refs tiann#919 AI disclosure: implementation by Claude Sonnet 4.5 (Cursor agent peer) under operator supervision. Issue triage by a sibling discovery agent. Per CONTRIBUTING.md AI-assisted contributions policy. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Findings
- [Major] Runner-driven stop paths now get mislabeled as hub restarts — the new default makes every SIGTERM archive as
Hub restartunless a caller stamps a different reason first. Web Archive now does that throughregisterKillSessionHandler, buthapi runner stop-session, runner webhook timeout cleanup, and orphan cleanup terminate child sessions with SIGTERM directly (cli/src/runner/run.ts:267,cli/src/runner/run.ts:587). Those are operator/runner actions, not hub restarts, yet the changed SIGTERM handler keeps the new default atcli/src/agent/runnerLifecycle.ts:122, so archived metadata becomes misleading and the audit-trail fix regresses another supported termination path.
Suggested fix:// propagate an explicit reason from runner stop/cleanup paths, e.g. via env on spawn const archiveReason = process.env.HAPI_ARCHIVE_REASON if (archiveReason) { lifecycle.setArchiveReason(archiveReason) }
Summary
- Review mode: initial
- Found one audit-trail regression introduced by the SIGTERM default change. Remaining risk: I did not run the full test suite in this automation pass.
Testing
- Not run (automation)
HAPI Bot
| // production. If a future code path needs to distinguish "operator | ||
| // killed the host process" from "hub restart", it can call | ||
| // setArchiveReason() before the runner exits. | ||
| process.on('SIGTERM', () => { |
There was a problem hiding this comment.
[Major] Runner-driven stop paths now get mislabeled as hub restarts. The new default makes every SIGTERM archive as Hub restart unless a caller stamps a different reason first. Web Archive now does that through registerKillSessionHandler, but hapi runner stop-session, runner webhook timeout cleanup, and orphan cleanup terminate child sessions with SIGTERM directly (cli/src/runner/run.ts:267, cli/src/runner/run.ts:587). Those are operator/runner actions, not hub restarts, yet this SIGTERM handler keeps the new default, so archived metadata becomes misleading and the audit-trail fix regresses another supported termination path.
Suggested fix:
// propagate an explicit reason from runner stop/cleanup paths, e.g. via env on spawn
const archiveReason = process.env.HAPI_ARCHIVE_REASON
if (archiveReason) {
lifecycle.setArchiveReason(archiveReason)
}…archive reason Addresses bot review of tiann#923: with the tiann#914 default-archiveReason flip to 'Hub restart', runner-driven SIGTERM paths (`hapi runner stop-session`, webhook-timeout cleanup at run.ts:587, orphan-cleanup at run.ts:267) all mislabel as 'Hub restart' which is also inaccurate audit-trail noise. Smallest defensible change: parameterise the lifecycle default via HAPI_DEFAULT_ARCHIVE_REASON env, and have the runner set 'Stopped by runner' on spawn. Terminal-launched sessions (no runner parent, no env var) still default to 'Hub restart' since hub-restart cascade documented at tiann#915 is the most plausible SIGTERM source for those. Explicit overrides via setArchiveReason (KillSession RPC, SIGINT Ctrl-C, markCrash uncaught exception) still win. Two new unit tests cover the env-var default and the override precedence. Refs tiann#914. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Bot review (Codex PR Review run 27567766623) caught a real Major: with the #914 default flipped to Verified active paths (hub never calls
Pushed Out of scope for this PR (flagging for maintainer): |
There was a problem hiding this comment.
Findings
- [Major] Missing-CLI archive can report success even when the hub archive metadata write fails —
markSessionArchivedFromHubreturns silently onupdateSessionMetadataerrors or exhausted version retries, andarchiveSessionthen still marks the in-memory session ended and lets/archivereturn{ ok: true }. That regresses the stated #916 contract that non-RPC failures still surface as 5xx; after a DB/write failure the operator sees success while the persisted row can remain unarchived/running. Evidencehub/src/sync/sessionCache.ts:532, caller continues athub/src/sync/syncEngine.ts:445andhub/src/sync/syncEngine.ts:450.
Suggested fix:markSessionArchivedFromHub(sessionId: string, reason: string): void { // ... if (result.result === 'error') { throw new Error('Failed to archive session metadata from hub') } // ... throw new Error('Session was modified concurrently while archiving from hub') }
Summary
- Review mode: follow-up after new commits
- Found one archive idempotency/data-persistence issue. Remaining risk: I did not run the full test suite in this automation pass.
Testing
- Not run (automation)
HAPI Bot
|
|
||
| if (result.result === 'error') { | ||
| throw new Error('Failed to update session metadata') | ||
| if (result.result === 'error') { |
There was a problem hiding this comment.
[Major] Missing-CLI archive can report success even when the hub archive metadata write fails. This branch catches RpcTargetMissingError, calls markSessionArchivedFromHub, then always continues to handleSessionEnd, but this helper returns silently when updateSessionMetadata returns error and also returns silently after exhausting version retries. The route will still return { ok: true }, leaving the persisted lifecycle row unarchived while the operator sees success; that also contradicts the PR's intended behavior that non-RPC errors (DB write failure, etc.) still propagate as 5xx.
Suggested fix:
markSessionArchivedFromHub(sessionId: string, reason: string): void {
// ...
if (result.result === 'error') {
throw new Error('Failed to archive session metadata from hub')
}
// ...
throw new Error('Session was modified concurrently while archiving from hub')
}… 5xx Addresses second-round bot review of tiann#923 (Major): `markSessionArchivedFromHub` silently returned on DB write errors and on exhausted version-retry attempts, which would let `/archive` claim 200 OK while the row stayed unarchived. That regresses the tiann#916 acceptance criterion that non-RPC errors during archive must still propagate as 5xx. Both fall-through paths now throw, matching the contract of the sibling writers in this file (renameSession, mergeSessions). The sessionModel test suite gains two cases that spy on `store.sessions.updateSessionMetadata` to force `error` and `version-mismatch` shapes and asserts the helper throws. The existing route test at `hub/src/web/routes/sessions.test.ts:1015` already covers the route-level 500 propagation for any error thrown out of `archiveSession`, so no new route test is needed. Imports `spyOn` from `bun:test` to match this test file's runtime (the rest of the hub package uses bun:test, not vitest). Refs tiann#916. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Bot follow-up review on Pushed
|
There was a problem hiding this comment.
Findings
- [Major] Runner-spawned sessions are mislabeled on hub restart —
cli/src/runner/run.ts:468now injectsHAPI_DEFAULT_ARCHIVE_REASON='Stopped by runner'into every runner-spawned child. BecauserunnerLifecycleuses that env var as the child’s default archive reason, a systemd/hub restart that SIGTERMs the runner and its children directly still archives those child sessions as “Stopped by runner”, not “Hub restart”. That reintroduces audit-trail pollution for the same restart-cascade path this PR is trying to fix; the runner’s own SIGTERM handler only shuts the runner down and does not selectively restamp already-spawned child env before the external signal reaches them.
Suggested fix:// Keep the lifecycle default as Hub restart for externally delivered SIGTERM. // Stamp runner-owned stops at the kill sites instead of via static spawn env. const stopSession = (sessionId: string): boolean => { // before killProcessByChildProcess(...) void session.childProcess?.send?.({ type: 'set-archive-reason', reason: 'Stopped by runner' }) void killProcessByChildProcess(session.childProcess) return true }
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous
markSessionArchivedFromHubpersistence failure finding is addressed, but the follow-up runner env override creates a new audit-attribution regression for hub-restart SIGTERM cascades.
Testing
- Not run (automation). Suggested coverage: runner-spawned child receiving plain SIGTERM without a stop-session request archives as
Hub restart; stop-session/webhook-timeout paths archive asStopped by runner.
HAPI Bot
| // runner. Genuine user actions (web Archive via KillSession | ||
| // RPC, terminal Ctrl-C via SIGINT) still override to | ||
| // 'User terminated' in runnerLifecycle. | ||
| HAPI_DEFAULT_ARCHIVE_REASON: 'Stopped by runner' |
There was a problem hiding this comment.
[Major] This static env override applies to every runner-spawned child, including children killed by the same external hub/systemd restart cascade this PR is trying to classify as Hub restart. runnerLifecycle reads HAPI_DEFAULT_ARCHIVE_REASON as the default, and the runner's own SIGTERM handler only shuts the runner down; it does not restamp already-spawned children before the external SIGTERM reaches them. Result: runner-spawned sessions in the restart cascade are archived as Stopped by runner, polluting the audit trail again.
Suggested fix:
// Keep the child default as Hub restart for externally delivered SIGTERM.
// Stamp runner-owned stops at the kill sites instead of via static spawn env.
const stopSession = (sessionId: string): boolean => {
// before killProcessByChildProcess(...)
void session.childProcess?.send?.({
type: 'set-archive-reason',
reason: 'Stopped by runner'
})
void killProcessByChildProcess(session.childProcess)
return true
}Reverts `1c8972a3`. Bot review round 3 surfaced that the env-on-spawn approach (the bot's own round-1 suggestion shape) mislabels hub-restart-cascade SIGTERMs against runner-spawned children: systemd killcgroup on `hapi-runner.service` stop sends SIGTERM to all runner-children directly, and those would archive as 'Stopped by runner' instead of 'Hub restart'. The two suggestions are mutually incompatible without adding an IPC channel (stdio: 'ipc' on spawn) so the runner can stamp setArchiveReason via childProcess.send() before SIGTERMing. That is a refactor, not a smallest-defensible change. Going back to the simple shape: SIGTERM default is 'Hub restart' for everyone, runner-internal stop paths share that label. The audit-trail-correctness criterion from the tiann#914 issue is met (SIGTERM no longer falsely labels as 'User terminated'). Finer attribution between cascade vs runner-stop is deferred as a follow-up. Refs tiann#914. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Bot review round 3 on `123c625e` flagged that `HAPI_DEFAULT_ARCHIVE_REASON='Stopped by runner'` on spawn mislabels hub-restart-cascade SIGTERMs against runner-spawned children: systemd cgroup-kill on `hapi-runner.service` stop hits runner-children directly with SIGTERM, and those now archive as 'Stopped by runner' instead of the 'Hub restart' that the cascade narrative wants. This is the mirror image of the bot's round-1 finding. The two suggestions are mutually incompatible without adding an IPC channel (`stdio: 'ipc'` on spawn + `childProcess.send()`) so the runner can stamp `setArchiveReason` via IPC before SIGTERMing - which is a refactor, not a smallest-defensible change. Pushed `a97b9dca` to revert the env-var override. The SIGTERM default is now plain 'Hub restart' for everyone, including runner-internal stop paths (`hapi runner stop-session`, webhook-timeout cleanup, orphan cleanup). The original #914 acceptance criterion - "Hub-restart cascade archives recorded with a reason that is NOT 'User terminated'" - is still met. Finer attribution between cascade vs runner-stop is deferred as a follow-up that needs the IPC channel. `bun typecheck` clean. `bun run test` 982/983 (same pre-existing `runner.integration.test.ts` flake on upstream/main). @tiann happy to add the IPC channel as a separate PR if you'd rather have the finer attribution in-flight; just wanted to keep this bundle scoped to the four issues it's titled for. |
There was a problem hiding this comment.
Findings
- [Major] Clean completions are archived as hub restarts — changing the lifecycle default to
Hub restartmeans every normal agent-loop completion that only callssetSessionEndReason('completed')now writesarchiveReason: 'Hub restart'duringarchiveAndClose. For example, Codex/Claude/Cursor/Gemini/Kimi/OpenCode all set the end reason tocompletedbeforecleanupAndExit, but none set an archive reason on that path, so successful exits pollute the audit trail as restart cascades. Evidence:cli/src/agent/runnerLifecycle.ts:46(changed default), with the write atcli/src/agent/runnerLifecycle.ts:59and completion call sites such ascli/src/codex/runCodex.ts:377.
Suggested fix:const setSessionEndReason = (reason: SessionEndReason) => { sessionEndReason = reason if (reason === 'completed' && archiveReason === 'Hub restart') { archiveReason = 'Session completed' } }
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous runner env override finding is addressed, but the new lifecycle default still leaks into clean completion paths and misattributes successful exits as hub restarts.
Testing
- Not run (automation). Suggested coverage: a lifecycle cleanup after
setSessionEndReason('completed')writes a non-restart archive reason while SIGTERM/default cleanup still writesHub restart.
HAPI Bot
| // the runner can stamp `setArchiveReason` before SIGTERMing; tracked as | ||
| // a follow-up to keep this PR focussed on the user-action lie that | ||
| // motivated #914. | ||
| let archiveReason = 'Hub restart' |
There was a problem hiding this comment.
[Major] This changed default is also used by ordinary cleanup() / cleanupAndExit() calls, not only by the SIGTERM handler. The clean completion paths in the launchers set only sessionEndReason to completed before cleanup (for example runCodex.ts:377), and archiveAndClose persists whatever archiveReason currently holds. Result: successful agent exits now get archived with archiveReason: 'Hub restart', which pollutes the audit trail this PR is trying to correct.
Suggested fix:
const setSessionEndReason = (reason: SessionEndReason) => {
sessionEndReason = reason
if (reason === 'completed' && archiveReason === 'Hub restart') {
archiveReason = 'Session completed'
}
}Addresses bot review round 4 of tiann#923 (Major): every agent runner (runClaude, runCodex, runCursor, runGemini, runKimi, runOpencode) calls setSessionEndReason('completed') on the natural exit path without touching archiveReason. With the SIGTERM default flipped to 'Hub restart', clean completions were now archived as restart cascades. Fix: setSessionEndReason flips archiveReason to 'Session completed' when it transitions to 'completed' AND no caller has already overridden the archive reason. This covers all six agent runners with a single setter change (no per-runner edits). Two new tests cover the natural-completion default and the override precedence (explicit setArchiveReason still wins). Refs tiann#914. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Round 4 bot finding is legitimate and was pre-existing — even before this PR, clean completions wrote `archiveReason: 'User terminated'` (the OLD default), which was also wrong. My default-flip just changed the noise from one wrong label to another. Worth fixing while we're here. Verified all six agent runners call `setSessionEndReason('completed')` before `cleanupAndExit()` on the natural exit path without setting an archive reason: `runClaude.ts:461`, `runCodex.ts:377`, `runCursor.ts:203`, `runGemini.ts:195`, `runKimi.ts:185`, `runOpencode.ts:308`. Pushed `98b10319`. Applied the bot's suggested shape: the setter now flips `archiveReason` to 'Session completed' when transitioning to 'completed' AND no caller has overridden first. One change covers all six runners. Two new tests cover the natural-completion default and the override-precedence guard (explicit `setArchiveReason('User terminated')` before `setSessionEndReason('completed')` still wins). `bun typecheck` clean, 984/985 (same pre-existing `runner.integration.test.ts` flake). |
There was a problem hiding this comment.
Findings
- No issues found.
Questions
- None.
Summary
- Review mode: follow-up after new commits
- No high-confidence correctness, security, data-loss, performance, or maintainability issues found in the current diff. The previous bot finding about clean completions being archived as hub restarts is addressed in this head.
Testing
- Not run (automation).
HAPI Bot
Summary
Four contained bugs uncovered by the 2026-06-15 hub-restart cascade incident
where
systemctl restart hapi-hub.serviceSIGTERMed 23 cursor ACP sessions.Each fix is small and lands independently of the architectural #915
(cascade-archive root cause) and the hypothesis-pending #917. They produce
audit-trail correctness and idempotency wins that stand on their own.
Refs #913, #914, #916, #919.
#913 - cursor ACP fresh-session
cursorSessionIdpersistence raceFresh ACP sessions could be SIGTERMed mid-flight in the async
update-metadataACK round-trip, stranding the on-disk ACP store with no DBhandle. Distinct from #834 (which fixed the resume path via
pre-registration before
session/load).Fix: add
ApiSessionClient.flushMetadata()(drains the metadata lock,5s timeout) and
awaitit afteronSessionFoundWithProtocolon thefresh-session branch in
cursorAcpRemoteLauncher.ts. The resume-pathpre-registration shape stays intact. A 5s no-ACK timeout logs a warning so
the operator has a diagnostic if hub backpressure ever drops the write.
#914 - default
archiveReasonpolluted audit trail with "User terminated"runnerLifecycle.tsdefaultedarchiveReason = 'User terminated'. SIGTERMfrom a hub-restart cascade hit the same cleanup path that the web-UI
Archive click did, both recording "User terminated" — but only one was
intentional.
Fix:
runnerLifecycle.tsflipped to'Hub restart'.registerKillSessionHandler(the hub-driven kill-RPC, sent only onweb-UI Archive) now calls
setArchiveReason('User terminated')beforecleanupAndExit.SIGINT(local-terminal Ctrl-C) keeps'User terminated'.runClaude,runCodex,runCursor,runGemini,runKimi,runOpencode) updated to pass the lifecycle object instead ofthe bare closure. Legacy closure shape still accepted for
back-compat (
runAgentSession.ts).#916 -
POST /api/sessions/:id/archivereturned 500 when CLI is goneArchive route called
engine.archiveSession→rpcGateway.killSession→sessionRpc, which threw a genericErrorwhen no target socket wasregistered. The unhandled throw became 500. Operator clicking Archive on
split-brain zombies (after a cascade) saw nothing but a toast.
Fix:
RpcTargetMissingError(code:handler-not-registeredorsocket-disconnected) thrown fromrpcGateway.rpcCall.syncEngine.archiveSessionnarrows on it: still flipslifecycleStateto
archivedvia a newsessionCache.markSessionArchivedFromHubhelper(writes
archivedBy='hub',archiveReason='Archived from hub (CLI unreachable)'),then
handleSessionEndmarks inactive in cache.requireActive: trueand short-circuits 2xx with{ ok: true, alreadyArchived: true }for already-archived rows.#919 - sessionCache: three metadata writers threw on version-mismatch without refresh
renameSession,clearSessionArchiveMetadata, andrestoreSessionArchiveMetadatafollowed a one-shot-throw pattern onversion-mismatch. The cache stayed stale, so the caller's retry hit thesame mismatch → forever-409 on
PATCH /sessions/:id,POST /sessions/:id/reopen, and the reopen rollback path. The codebasealready had the good pattern:
sessionCache.mergeSessionsretries with afresh snapshot.
Fix: rewrite all three helpers as
for (attempt = 0; attempt < 5; …)retry-with-refresh loops. After 5 attempts of genuine concurrent contention
they still throw the original 409 message, so the existing operator-facing
contract is preserved.
§6 gates
bun typecheck— passes on the branch againstupstream/main@93d00414.bun run test— 982/983 tests pass. The single failure is thepre-existing
runner.integration.test.ts > should detect version mismatch and kill old runnerflake, confirmed to fail on upstream/main before thebranch too (
expected '0.0.0-integration-test-...' to be '0.20.2'—version-substitution issue unrelated to this diff).
cli/src/agent/runnerLifecycle.test.ts— default reason, override,crash override.
cli/src/claude/registerKillSessionHandler.test.ts— RPC stamps'User terminated'before cleanup; back-compat with bare closure.cli/src/cursor/cursorAcpRemoteLauncher.test.ts— fresh-pathflushMetadataordering; resume-path fix(cursor): pre-register cursorSessionId on resume before ACP session/load completes #834 shape preserved.hub/src/sync/rpcGateway.test.ts—RpcTargetMissingErrortypedthrown for both no-target cases.
hub/src/sync/sessionModel.test.ts—markSessionArchivedFromHubidempotency + self-heal; three sessionCache: renameSession / clearSessionArchiveMetadata / restoreSessionArchiveMetadata throw on version-mismatch without refresh; forever-409 until out-of-band cache refresh #919 self-heal tests for the metadata
writers.
hub/src/web/routes/sessions.test.ts—/archiveroute: active 2xx,already-archived 2xx idempotent, missing-CLI 2xx (via engine mock),
DB-failure 500, unknown id 404.
driver/integration; 529/529 hub tests + 409/409 relevant cli testspass on the integrated soup build.
Coordination notes
mergeSessionDatacascade-delete) is handed off to PR feat(web,hub): scratchlist v2 - hub sync via typed table + session-updated piggyback #896 peer.This branch touches
sessionCache.tsbut in different methods(
renameSession,clearSessionArchiveMetadata,restoreSessionArchiveMetadata, newmarkSessionArchivedFromHub) — thefeat(web,hub): scratchlist v2 - hub sync via typed table + session-updated piggyback #896 peer's
mergeSessionDatawork is untouched.session) intentionally NOT addressed in this PR; if architecture: hub restart cascade-archives every idle agent session; sessions should be resilient to hub bouncing #915 lands later and
reshapes the cascade behavior, the four fixes here still stand
independently because their wins are audit-trail correctness and
idempotency, not preventing the cascade.
AI disclosure
Per
CONTRIBUTING.md: implementation by Claude Sonnet 4.5 (Cursor IDEagent, auto-model) acting as a feature peer in a multi-agent HAPI fork
workflow. Issue triage and codepath verification were done by a sibling
discovery agent (
c8ac5e86). Operator supervised; tests + cold review +soup verification gates passed before raising the PR. The code follows the
operator's friction-mode disposition: smallest defensible change per bug.
Made with Cursor