Skip to content

feat(issue-22): manual inference validation script + workflow#23

Merged
rycerzes merged 1 commit into
mainfrom
feat/inference-validation
Apr 26, 2026
Merged

feat(issue-22): manual inference validation script + workflow#23
rycerzes merged 1 commit into
mainfrom
feat/inference-validation

Conversation

@SourasishBasu
Copy link
Copy Markdown
Contributor

Closes #22.

What

  • inference.py at repo root: drives one LLM-backed episode per deployed Space via the FrontierSweEnv WebSocket client. Pi inside the container holds the agent + grader credentials and runs a multi-turn loop behind /step; this script keeps the WS session open, sends a natural-language nudge per outer step, and reads back the observation. Emits structured [PREFLIGHT] / [START] / [STEP] / [END] lines on stdout with score clamped to the (0.01, 0.99) open interval.
  • .github/workflows/validate-inference.yml: workflow_dispatch only. Four-shard matrix across notebook, postgres, type-checker, libexpat-to-x86asm. Per-shard timeout-minutes: 20. Each shard waits for the Space's /health to return 200 before running inference.
  • .env.example: documents FSWE_SPACE_URL + optional knobs for local runs.

Why manual-only

A full run is ~3-10 min wall time per Space and ~$0.15-0.45 in HF Router tokens. Running this on every main push would burn ~$1.50-4.50/day during active dev with no proportional signal — pi inside the Space is the same code that already passed /health checks. We'll trigger this manually before submission to confirm the full agent loop works end-to-end.

What's needed to run

Where Variable Type Notes
GitHub repo HF_OWNER variable already set; workflow constructs Space URL from ${HF_OWNER}-frontier-swe-${task}.hf.space
Local FSWE_SPACE_URL env var required; placeholder in .env.example
Local MAX_STEPS, TASK_COUNT, MESSAGE_TIMEOUT env vars optional

The runner does not need any agent/grader API keys. Pi inside each Space already has them (propagated by sync-hf-spaces at deploy time).

Local validation

Space Wall time done output
type-checker 17-26s warm, 425s cold valid [START]/[STEP]/[END]
notebook ~100s valid [START]/[STEP]/[END]
postgres ~170s valid [START]/[STEP]/[END]
libexpat-to-x86asm ~592s valid [START]/[STEP]/[END]

MESSAGE_TIMEOUT=900 covers the worst observed latency.

Test plan

  • After merge: trigger validate-inference from the Actions UI, confirm all 4 shards green
  • Confirm no automatic invocations on subsequent pushes / merges

🤖 Generated with Claude Code

inference.py drives one LLM-backed episode per Space via the FrontierSweEnv
WebSocket client. Pi inside the container holds the agent + grader keys
(propagated by sync-hf-spaces) and runs a multi-turn LLM loop behind /step.
This script keeps a WS session open, sends a natural-language nudge, and
reads back the observation. Emits a structured
[PREFLIGHT] / [START] / [STEP] / [END] log format with score clamped to
the (0.01, 0.99) open interval.

validate-inference.yml is workflow_dispatch only — manually triggered before
submission. Four-shard matrix across notebook, postgres, type-checker, and
libexpat-to-x86asm. Each shard waits for /health before running inference.py.
20-min per-shard timeout. MESSAGE_TIMEOUT=900 covers HF Router cold-start
latency (observed up to ~600s).

.env.example documents FSWE_SPACE_URL + the optional knobs for local runs;
the runner does not need agent/grader API keys locally.

Local proofs across all 4 deployed Spaces produce valid log output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rycerzes rycerzes merged commit 10f8583 into main Apr 26, 2026
1 check passed
@rycerzes rycerzes deleted the feat/inference-validation branch April 26, 2026 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inference validation script + manual CI workflow

2 participants