Crash recovery: detect and clean up stale running experiments

If Claude Code crashes mid-run, or the shell dies, or the machine reboots while `evo run <id>` is in flight, the experiment is left with `status: "running"` in `graph.json` forever. There's no command to detect this or recover.

What you have to do today:

1. Notice via `evo scratchpad` or `evo status` that something is stuck on running
2. Manually figure out whether the underlying subprocess is actually still alive
3. If not, `evo discard <id>` to mark it dead and let the orchestrator move on
4. Lose any partial trace data because the run never completed

That's all manual and easy to forget. For autonomous setups, it can leave the optimization loop spinning on phantom in-flight experiments.

What I'd want:

- `evo recover` command that walks all `running` nodes, checks whether their subprocess is alive (PID-based or timeout-based), and either (a) marks them `failed` with a "process disappeared" reason, or (b) prompts to retry.
- `evo status` to surface stale `running` nodes prominently rather than burying them in the dump.
- Optionally: a heartbeat file the runner touches periodically, so detection doesn't depend on tracking PIDs across crashes.

Related but separate: concurrent Claude Code sessions on the same workspace can race on `next_id` allocation in `evo new` even with advisory locking. Edge case but real for power users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash recovery: detect and clean up stale running experiments #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crash recovery: detect and clean up stale running experiments #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions