Skip to content

fix: use completed state for successful clusters (visible to heroshot)#416

Open
EivMeyer wants to merge 10 commits into
devfrom
fix/completed-state-lifecycle
Open

fix: use completed state for successful clusters (visible to heroshot)#416
EivMeyer wants to merge 10 commits into
devfrom
fix/completed-state-lifecycle

Conversation

@EivMeyer

Copy link
Copy Markdown
Collaborator

Summary

  • Successful --ship clusters now transition to state=completed instead of state=killed
  • Completed clusters remain in zeroshot list so heroshot can detect success
  • Prevents heroshot from marking shipped items as FAILED (which blocked the entire downstream DAG)

Root Cause

When a --ship cluster finished its work (PR merged, issue closed), zeroshot set state=killed and removed the cluster from memory + disk. This made it invisible to zeroshot list. Heroshot's markShippedFromCompletedClusters() checked for state=completed which never existed — dead code. Result: heroshot respawned 3x → FAILED → all dependents blocked.

Changes

  • src/orchestrator.js: stop() uses completed state when completedSuccessfully=true
  • src/orchestrator.js: _saveClusters() persists completed clusters (not deleted)
  • src/orchestrator.js: resume() guards against resuming completed clusters
  • tests/unit/worktree-auto-cleanup.test.js: Updated state assertions

Test plan

  • 8/8 worktree auto-cleanup tests pass
  • 22/22 heroshot supervisor tests pass (including new regression test)

🤖 Generated with Claude Code

EivMeyer and others added 6 commits February 16, 2026 16:22
Sonnet validators produced false positives on codebase conventions
(ESM .js imports flagged as bugs, correct routing targets flagged as
inconsistent). Opus has better nuance for convention-aware validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mpletion

Worktrees were preserved indefinitely after runs completed, even with
--ship (PR merged). stop() always kept worktrees for resume capability,
but resume makes no sense after a successful PR merge.

Now CLUSTER_COMPLETE passes { completedSuccessfully: true } to stop().
When completedSuccessfully + autoPr + worktree are all true, stop()
removes the worktree (kill behavior) instead of preserving it. Failed
runs and user-initiated stops still preserve worktrees for resume.

Also adds _teardownWorktreeCompose() to free host ports on stop, and
fixes execSync → safe-exec to comply with ESLint rules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes from flying-jungle-51 postmortem:

1. verify_github_pr hook: When git-pusher structured output has no
   pr_number/pr_url (agent failed to create PR), throw clear error
   instead of falling through to `gh pr view` which picks up
   unrelated open PRs and reports misleading "Agent LIED" error.

2. isolation-manager: Run configurable setup command after worktree
   creation. Reads `worktree.setup` from .zeroshot/settings.json
   (e.g., "npm ci") so all agents have dependencies available.

Also includes previously uncommitted polling logic for GitHub API
eventual consistency (gentle-hydra-56 regression).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Planning requires deeper reasoning about architecture and approach.
Level2 (Sonnet) is sufficient for implementation but planners benefit
from the stronger model to produce better implementation plans.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Successfully completed --ship clusters now transition to state=completed
(visible in zeroshot list) instead of state=killed (removed from memory).

This fixes a critical heroshot integration bug where completed clusters
vanished from zeroshot list, causing heroshot to exhaust retries and
mark shipped items as failed, blocking the entire downstream DAG.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@EivMeyer EivMeyer enabled auto-merge February 23, 2026 09:22
EivMeyer and others added 4 commits February 23, 2026 10:31
Tests that wait for clusters reaching terminal state after CLUSTER_COMPLETE
now expect 'completed' instead of 'stopped', matching the orchestrator
change where stop({completedSuccessfully: true}) sets state to 'completed'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract 5 new modules from the 960-line monolith:
- pr-verification.js: GitHub PR verification with retry logic
- hook-sandbox.js: Shared VM sandbox builder (deduplicated)
- hook-transform.js: Transform script execution
- hook-template.js: Template variable substitution
- hook-logic.js: Hook logic evaluation

Main file becomes slim dispatcher (116 lines) importing from modules.
All files pass strict ESLint (300 max lines, complexity 10, depth 3).

Key bug fix: PR fetch now retries 6 times (5s apart) for concrete PR
numbers, preventing false HALLUCINATED errors caused by GitHub API
eventual consistency after merge + branch delete.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d branches

Three root causes fixed:

1. merge_commit_sha was required in structured output, but after
   --delete-branch the remote branch is gone and gh pr view hangs
   trying to fetch it → agent stuck in retry loop. Now optional.

2. Non-merge-queue path used --auto which requires branch protection.
   On unprotected branches (dev), --auto silently does nothing.
   Changed to --delete-branch (direct merge + cleanup).

3. Merge queue poll loop had no timeout on gh pr view and no
   iteration cap → could hang forever. Added timeout 30 per call
   and bounded to 90 iterations (30 min max).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant