feat(issue-22): manual inference validation script + workflow#23
Merged
Conversation
inference.py drives one LLM-backed episode per Space via the FrontierSweEnv WebSocket client. Pi inside the container holds the agent + grader keys (propagated by sync-hf-spaces) and runs a multi-turn LLM loop behind /step. This script keeps a WS session open, sends a natural-language nudge, and reads back the observation. Emits a structured [PREFLIGHT] / [START] / [STEP] / [END] log format with score clamped to the (0.01, 0.99) open interval. validate-inference.yml is workflow_dispatch only — manually triggered before submission. Four-shard matrix across notebook, postgres, type-checker, and libexpat-to-x86asm. Each shard waits for /health before running inference.py. 20-min per-shard timeout. MESSAGE_TIMEOUT=900 covers HF Router cold-start latency (observed up to ~600s). .env.example documents FSWE_SPACE_URL + the optional knobs for local runs; the runner does not need agent/grader API keys locally. Local proofs across all 4 deployed Spaces produce valid log output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #22.
What
inference.pyat repo root: drives one LLM-backed episode per deployed Space via theFrontierSweEnvWebSocket client. Pi inside the container holds the agent + grader credentials and runs a multi-turn loop behind/step; this script keeps the WS session open, sends a natural-language nudge per outer step, and reads back the observation. Emits structured[PREFLIGHT] / [START] / [STEP] / [END]lines on stdout with score clamped to the (0.01, 0.99) open interval..github/workflows/validate-inference.yml:workflow_dispatchonly. Four-shard matrix acrossnotebook,postgres,type-checker,libexpat-to-x86asm. Per-shardtimeout-minutes: 20. Each shard waits for the Space's/healthto return 200 before running inference..env.example: documentsFSWE_SPACE_URL+ optional knobs for local runs.Why manual-only
A full run is ~3-10 min wall time per Space and ~$0.15-0.45 in HF Router tokens. Running this on every main push would burn ~$1.50-4.50/day during active dev with no proportional signal — pi inside the Space is the same code that already passed
/healthchecks. We'll trigger this manually before submission to confirm the full agent loop works end-to-end.What's needed to run
HF_OWNER${HF_OWNER}-frontier-swe-${task}.hf.spaceFSWE_SPACE_URL.env.exampleMAX_STEPS,TASK_COUNT,MESSAGE_TIMEOUTThe runner does not need any agent/grader API keys. Pi inside each Space already has them (propagated by
sync-hf-spacesat deploy time).Local validation
[START]/[STEP]/[END][START]/[STEP]/[END][START]/[STEP]/[END][START]/[STEP]/[END]MESSAGE_TIMEOUT=900covers the worst observed latency.Test plan
validate-inferencefrom the Actions UI, confirm all 4 shards green🤖 Generated with Claude Code