feat(issue-22): manual inference validation script + workflow by SourasishBasu · Pull Request #23 · 3xcaffeine/frontier-swe-openenv

SourasishBasu · 2026-04-26T05:52:02Z

Closes #22.

What

inference.py at repo root: drives one LLM-backed episode per deployed Space via the FrontierSweEnv WebSocket client. Pi inside the container holds the agent + grader credentials and runs a multi-turn loop behind /step; this script keeps the WS session open, sends a natural-language nudge per outer step, and reads back the observation. Emits structured [PREFLIGHT] / [START] / [STEP] / [END] lines on stdout with score clamped to the (0.01, 0.99) open interval.
.github/workflows/validate-inference.yml: workflow_dispatch only. Four-shard matrix across notebook, postgres, type-checker, libexpat-to-x86asm. Per-shard timeout-minutes: 20. Each shard waits for the Space's /health to return 200 before running inference.
.env.example: documents FSWE_SPACE_URL + optional knobs for local runs.

Why manual-only

A full run is ~3-10 min wall time per Space and ~$0.15-0.45 in HF Router tokens. Running this on every main push would burn ~$1.50-4.50/day during active dev with no proportional signal — pi inside the Space is the same code that already passed /health checks. We'll trigger this manually before submission to confirm the full agent loop works end-to-end.

What's needed to run

Where	Variable	Type	Notes
GitHub repo	`HF_OWNER`	variable	already set; workflow constructs Space URL from `${HF_OWNER}-frontier-swe-${task}.hf.space`
Local	`FSWE_SPACE_URL`	env var	required; placeholder in `.env.example`
Local	`MAX_STEPS`, `TASK_COUNT`, `MESSAGE_TIMEOUT`	env vars	optional

The runner does not need any agent/grader API keys. Pi inside each Space already has them (propagated by sync-hf-spaces at deploy time).

Local validation

Space	Wall time	done	output
type-checker	17-26s warm, 425s cold	✅	valid `[START]/[STEP]/[END]`
notebook	~100s	✅	valid `[START]/[STEP]/[END]`
postgres	~170s	✅	valid `[START]/[STEP]/[END]`
libexpat-to-x86asm	~592s	✅	valid `[START]/[STEP]/[END]`

MESSAGE_TIMEOUT=900 covers the worst observed latency.

Test plan

After merge: trigger validate-inference from the Actions UI, confirm all 4 shards green
Confirm no automatic invocations on subsequent pushes / merges

🤖 Generated with Claude Code

inference.py drives one LLM-backed episode per Space via the FrontierSweEnv WebSocket client. Pi inside the container holds the agent + grader keys (propagated by sync-hf-spaces) and runs a multi-turn LLM loop behind /step. This script keeps a WS session open, sends a natural-language nudge, and reads back the observation. Emits a structured [PREFLIGHT] / [START] / [STEP] / [END] log format with score clamped to the (0.01, 0.99) open interval. validate-inference.yml is workflow_dispatch only — manually triggered before submission. Four-shard matrix across notebook, postgres, type-checker, and libexpat-to-x86asm. Each shard waits for /health before running inference.py. 20-min per-shard timeout. MESSAGE_TIMEOUT=900 covers HF Router cold-start latency (observed up to ~600s). .env.example documents FSWE_SPACE_URL + the optional knobs for local runs; the runner does not need agent/grader API keys locally. Local proofs across all 4 deployed Spaces produce valid log output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rycerzes merged commit 10f8583 into main Apr 26, 2026
1 check passed

rycerzes deleted the feat/inference-validation branch April 26, 2026 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(issue-22): manual inference validation script + workflow#23

feat(issue-22): manual inference validation script + workflow#23
rycerzes merged 1 commit into
mainfrom
feat/inference-validation

SourasishBasu commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SourasishBasu commented Apr 26, 2026

What

Why manual-only

What's needed to run

Local validation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants