Skip to content

fix: run command graders from trial cwd#56

Merged
EdwardIrby merged 21 commits into
mainfrom
fix/run-command-graders
May 7, 2026
Merged

fix: run command graders from trial cwd#56
EdwardIrby merged 21 commits into
mainfrom
fix/run-command-graders

Conversation

@EdwardIrby

Copy link
Copy Markdown
Member

Summary

  • run command graders from the trial cwd instead of the harness process cwd
  • add a regression test for command grader cwd behavior

Tests

  • bun run check
  • bun test src/

EdwardIrby and others added 21 commits March 10, 2026 21:22
Full replacement of src/ and bin/ with trial runner, schemas,
and CLI utilities ported from agent-build. Strips BP-specific
types (no RISK_TAG, SelectionBid, TOOL_STATUS). Inlines
simplified TrajectoryStepSchema with 3 step types.

BREAKING: New CLI entry point, new exports structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These skills taught the old pipeline's concepts (Docker evals,
calibration flow, headless browser adapters) which no longer exist
after the trial-runner replacement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gemini Code Assist MCP server config from the old pipeline.
No longer relevant after the trial-runner replacement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Old integration test wrapper for the previous pipeline's
integration_tests/ directory which no longer exists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete docker-compose.test.yml, Dockerfile.test, and the
test-integration CI job — all referenced integration_tests/
which no longer exists. CI now runs a single test job with
check + test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
API keys were for the deleted Docker integration tests.
The trial runner delegates credential management to adapters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AGENTS.md rewritten to reflect new trial runner architecture:
updated structure diagram, skills table, capabilities, commands.
Removed stale Docker test references and old skill names.

Added src/tests/cli.spec.ts with coverage for:
- CLI routing (trials/compare/calibrate/unknown/no-command)
- parseCli meta flags (--help, -h, --schema input)
- Input validation (valid JSON, invalid JSON, missing input)
- Export contract verification (runTrial, schemas)

All 65 tests pass. Types clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…/run-command-graders

# Conflicts:
#	src/eval.ts
#	src/tests/eval.spec.ts
@EdwardIrby EdwardIrby requested a review from alisonailea as a code owner May 7, 2026 04:46
@EdwardIrby EdwardIrby merged commit 16735bd into main May 7, 2026
1 check passed
@EdwardIrby EdwardIrby deleted the fix/run-command-graders branch May 7, 2026 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant