feat: add test-both override for empirical deadlock resolution#1
feat: add test-both override for empirical deadlock resolution#1claytona500 wants to merge 1 commit into
Conversation
When the critique loop stagnates (ESCALATE), the only options today are add-note, force-proceed, or abort — all of which punt the decision to the human without evidence. This adds a test-both override that invokes a judge agent to evaluate the current plan against an alternative approach, then renders a verdict (approach_a, approach_b, or synthesis) based on empirical assessment. Changes: - New test-both.json schema for structured judge output - Judge prompt in prompts.py that evaluates both approaches against unresolved flags - _override_test_both handler in cli.py with full state machine integration - Mock worker for test-both in workers.py - Default agent routing (claude) in _core.py - Updated infer_next_steps to surface test-both for ESCALATE/ABORT - Documentation in instructions.md - 15 new tests covering all verdict paths, state transitions, and schema Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hey Clayton — nice idea, the problem you're targeting is real. Some thoughts before we go deeper: v0.3.0 contextWe just shipped v0.3.0 on main which changes how the orchestrator handles ESCALATE. The orchestrator now has:
This covers the most common stagnation case: the critic keeps flagging things that are implementation details or phantom concerns. The orchestrator investigates and force-proceeds. Where does the alternative come from?The judge prompt asks the judge to both propose approach B and rule between A and B. That's a conflict of interest — it's judging its own proposal. The "empirical" framing implies independent evaluation, but it's really one agent arguing with itself. Does stagnation actually need a whole new approach?When the loop stagnates, it's usually because:
Case 3 is the only one where an alternative plan would help. But generating a credible alternative in a single agent call is a lot to ask — it's doing in one step what the full clarify→plan→critique loop does in many. Simpler alternatives that feed into the existing systemA few things that might solve the same problem with less machinery:
Not saying noThe core insight — "break deadlocks with evidence instead of just punting" — is good. I'm just wondering if we can get there by making the existing actors smarter (orchestrator judgment, fresh critic sessions, robustness tuning) rather than adding a new actor. What do you think? The PR will also need a rebase against v0.3.0 — |
Security Review: PR #1 — "test-both override for empirical deadlock resolution"Repository: peteromallet/megaplan SummaryThis PR adds a The override is structurally similar to existing overrides ( Findings1. Prompt Injection via Untrusted File Content (Medium)OWASP Category: Injection def _test_both_prompt(state: PlanState, plan_dir: Path) -> str:
latest_plan = latest_plan_path(plan_dir, state).read_text(encoding="utf-8")
latest_meta = read_json(latest_plan_meta_path(plan_dir, state))
...
return textwrap.dedent(
f"""
...
Current plan (Approach A):
{latest_plan}
Plan metadata:
{json_dump(latest_meta).strip()}
Unresolved flags from the critic:
{json_dump(open_flags).strip()}
...
"""
).strip()Risk: The prompt template embeds Impact: An attacker could inject instructions that cause the LLM to:
Recommendation:
2. Unvalidated LLM Output Used in Control-Flow Branching (Medium)OWASP Category: Insecure Design / Security Misconfiguration verdict = worker.payload["verdict"]
rationale = worker.payload["verdict_rationale"]
...
if verdict == "approach_a":
gate = run_gate_checks(plan_dir, state)
atomic_write_json(plan_dir / "gate.json", gate)
if gate["passed"]:
final_plan = latest_plan_path(plan_dir, state).read_text(encoding="utf-8")
atomic_write_text(plan_dir / "final.md", final_plan)
state["current_state"] = STATE_GATED
next_step = "execute" if gate["passed"] else "integrate"
elif verdict == "approach_b":
next_step = "integrate"
else:
next_step = "integrate"
evaluation = copy.deepcopy(state["last_evaluation"])
evaluation["recommendation"] = "SKIP" if verdict == "approach_a" else "CONTINUE"Risk: The LLM's The payload is validated against the JSON schema only after the entire flow, but the schema only checks structure, not semantic correctness. There is no bounds-checking or sanity-checking on the verdict value before it drives control flow. Impact: An attacker who can manipulate LLM output (via the prompt injection vector described above) could:
Recommendation:
3. Hidden Root Bypass via Private Attribute (Low)OWASP Category: Broken Access Control root = args._test_both_root if hasattr(args, "_test_both_root") else Path.cwd()Risk: Existing overrides ( Impact: Low — requires code-level access. However, the asymmetry with other override functions makes this an inconsistency worth noting. Recommendation:
4. Unvalidated LLM Output Written to Disk (Low)OWASP Category: Security Misconfiguration atomic_write_json(plan_dir / test_both_filename, worker.payload)Risk: The full worker payload (LLM output) is written to Other overrides also write LLM output to disk, so this is consistent with the existing codebase. But Impact: Low — file is in Recommendation:
5. TOCTOU Race Condition in State Management (Low)OWASP Category: Security Misconfiguration # Line 1022: state is read before function call (in caller)
# Line 1027: long-running LLM call via run_step_with_worker
worker, agent, mode, refreshed = run_step_with_worker("test-both", state, plan_dir, args, root=root)
# Line 1078: state is saved after LLM returns
save_state(plan_dir, state)Risk: The state is read from disk before the LLM call (which can take 30+ seconds) and written back after. Between read and write, another process or concurrent invocation could modify the state file, and those changes would be silently overwritten. The This is a pre-existing pattern in all override functions (not unique to
Impact: Low — concurrent access to the same plan directory is unlikely in the normal workflow. In CI/CD pipelines where plans run sequentially, this is not exploitable. Recommendation:
6. Injection Surface via
|
| # | Severity | Category | Finding | File |
|---|---|---|---|---|
| 1 | Medium | Injection | Prompt injection via unvalidated plan content embedded in LLM prompt | prompts.py:259-295 |
| 2 | Medium | Insecure Design | LLM verdict field drives control-flow branching without validation | cli.py:1035-1055 |
| 3 | Low | Access Control | Hidden _test_both_root parameter bypasses default root directory |
cli.py:1026 |
| 4 | Low | Security Misconfiguration | LLM output written to disk before schema validation | cli.py:1033 |
| 5 | Low | Security Misconfiguration | TOCTOU race window in state management during long LLM call | cli.py:1022-1085 |
| 6 | Info | Injection | project_dir embedded unsanitized in LLM prompt |
prompts.py:270 |
| 7 | Info | Security Misconfiguration | args.reason stored unsanitized in metadata |
cli.py:1070 |
Verdict
PASS with recommendations. The PR adds a well-structured feature that follows existing code patterns. No critical vulnerabilities were found. The two Medium-severity findings (prompt injection and unvalidated verdict control flow) are the main concerns and should be addressed before deployment to production environments where the plan files may contain untrusted content.
Recommended action items:
- Add input sanitization to prompt interpolation in
_test_both_prompt - Add enum validation for the
verdictfield before control-flow branching - Validate payload against schema before writing to disk
|
Closing — stale, conflicting with main; test-both override approach not pursued. |
Security Review: PR #1 - Test-Both Override for Deadlock ResolutionExecutive SummaryThis PR introduces a new Overall Risk: HIGH Findings Summary
Critical Vulnerabilities1. Path Traversal via
|
| Finding | OWASP Category | Severity | Fix Complexity |
|---|---|---|---|
Path Traversal via _test_both_root |
Injection / Access Control | Critical | Medium |
| TOCTOU Race Condition | Broken Access Control | High | High |
| Evaluation Mutation Without Audit | Insufficient Logging | High | Low |
| Unsafe JSON Deserialization | Insecure Deserialization | Medium | Low |
| Hardcoded Agent Routing | Security Misconfiguration | Medium | Medium |
| Missing Authorization Check | Broken Access Control | Low | Medium |
| Inconsistent Error Handling | N/A | Info | Low |
Recommendations Priority
- IMMEDIATE (Critical): Fix path traversal by removing
hasattr()pattern and adding proper argument validation - HIGH PRIORITY: Implement optimistic locking to prevent TOCTOU race conditions
- HIGH PRIORITY: Preserve original evaluation in audit trail before mutation
- MEDIUM PRIORITY: Add maxLength constraints to schema and validate payload sizes
- MEDIUM PRIORITY: Move agent routing to configuration with validation
- LOW PRIORITY: Add authorization checks for override operations
Conclusion
The test-both override mechanism introduces significant security risks primarily around:
- Path traversal enabling malicious schema loading
- Race conditions allowing state machine bypass
- Audit trail gaps hiding security decisions
While the feature serves a legitimate purpose (breaking deadlocks), the implementation requires hardening before production use. The attack surface is especially concerning because this is an override mechanism designed to bypass normal safety checks - making it a high-value target for attackers.
Recommendation: Do not merge until Critical and High severity issues are resolved.
Summary
When the critique loop hits ESCALATE, the current options are
add-note,force-proceed, orabort— all of which punt the decision to the human without evidence. This PR adds atest-bothoverride that breaks deadlocks empirically:approach_a,approach_b, orsynthesisMotivation
Inspired by adversarial convergence patterns where competing approaches are tested empirically rather than debated endlessly. When the same critique concerns recur across iterations and neither force-proceed nor add-note resolves the impasse,
test-bothgives the orchestrator an evidence-based path forward.Changes
schemas.py— Newtest-both.jsonschema with structured approach comparison and verdict enumprompts.py— Judge prompt that evaluates both approaches against unresolved flagsworkers.py— Mock worker, schema filename mapping, session key fortest-bothstep_core.py— Default agent routing (claude) fortest-bothcli.py—_override_test_bothhandler with full state machine integration; updatedinfer_next_stepsand argparse choicesinstructions.md— Documentation for the new override optiontests/test_test_both.py— 15 new tests covering all verdict paths, state transitions, schema, and mocktests/test_megaplan.py,tests/test_schemas.py— Updated existing parametrized testsUsage
Test plan
test-bothonly available from EVALUATED state with ESCALATE/ABORT recommendation🤖 Generated with Claude Code