Skip to content

Crash recovery: detect and clean up stale running experiments #6

@alokwhitewolf

Description

@alokwhitewolf

If Claude Code crashes mid-run, or the shell dies, or the machine reboots while evo run <id> is in flight, the experiment is left with status: "running" in graph.json forever. There's no command to detect this or recover.

What you have to do today:

  1. Notice via evo scratchpad or evo status that something is stuck on running
  2. Manually figure out whether the underlying subprocess is actually still alive
  3. If not, evo discard <id> to mark it dead and let the orchestrator move on
  4. Lose any partial trace data because the run never completed

That's all manual and easy to forget. For autonomous setups, it can leave the optimization loop spinning on phantom in-flight experiments.

What I'd want:

  • evo recover command that walks all running nodes, checks whether their subprocess is alive (PID-based or timeout-based), and either (a) marks them failed with a "process disappeared" reason, or (b) prompts to retry.
  • evo status to surface stale running nodes prominently rather than burying them in the dump.
  • Optionally: a heartbeat file the runner touches periodically, so detection doesn't depend on tracking PIDs across crashes.

Related but separate: concurrent Claude Code sessions on the same workspace can race on next_id allocation in evo new even with advisory locking. Edge case but real for power users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions