feat(#231): LLM benchmark v2 — typed assertions, skill-trace, LLM-as-Judge, CI migration#1
feat(#231): LLM benchmark v2 — typed assertions, skill-trace, LLM-as-Judge, CI migration#1ZhouChaunge wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances the LLM benchmarking suite and improves Python environment management. Key updates include the introduction of v2 assertions for benchmarks—supporting LLM-as-Judge and skill-hit tracing—and a more robust Python dependency synchronization check that parses requirements files and verifies installed versions. Feedback focuses on improving the JSON extraction logic for nested objects in the judge component, addressing the lack of support for nested requirements files in the parser, and mitigating potential command-line length limitations when running the Python check script on Windows.
There was a problem hiding this comment.
Pull request overview
Implements the v2 LLM benchmark framework (typed assertions + skill-hit tracing + LLM-as-Judge), upgrades benchmark scenarios to the v2 schema with retries, migrates CI coverage from llm-integration to a dedicated llm-benchmark workflow, and tightens CLI analysis-Python setup by checking pinned requirement versions.
Changes:
- Added skill-tracing + LLM-as-Judge modules and refactored benchmark evaluation to async, type-dispatched v2 assertions (with v1 auto-upgrade).
- Upgraded benchmark scenarios to v2 schema (
category,tags,maxRetries,skills, typedassertions). - Introduced
llm-benchmarkCI workflow and simplifiedllm-integrationworkflow to routing-only; enhanced CLI Python setup to detect requirements drift.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/runner.mjs |
Updates CLI help text for routing-only llm-integration and new v2 llm-benchmark command details. |
tests/llm-benchmark/scenarios/beam.json |
Migrates beam scenarios to v2 schema, adds retries and natural_language assertions. |
tests/llm-benchmark/scenarios/double-span-beam.json |
Migrates scenario to v2 schema and adds skill routing assertions. |
tests/llm-benchmark/scenarios/frame.json |
Migrates frame scenarios to v2 schema, adds retries and natural_language assertions. |
tests/llm-benchmark/scenarios/portal-frame.json |
Migrates portal-frame scenarios to v2 schema with routing assertions. |
tests/llm-benchmark/scenarios/truss.json |
Migrates truss scenario to v2 schema with routing assertions. |
tests/llm-benchmark/runner.cjs |
Adds per-scenario retry support and awaits async evaluation. |
tests/llm-benchmark/lib/evaluate.cjs |
Implements v2 assertion dispatch, v1→v2 upgrade, and async natural_language evaluation. |
tests/llm-benchmark/lib/skill-trace.cjs |
Extracts skillId / structureType from detect_structure_type tool result messages. |
tests/llm-benchmark/lib/judge.cjs |
Adds HTTP-based LLM judge client + prompt building and JSON response parsing. |
scripts/cli/runtime.js |
Adds requirements parsing + runtime check helpers to detect Python pinned-version drift. |
scripts/cli/main.js |
Uses the new requirements drift checks when determining whether analysis Python env is “ready”. |
backend/tests/analysis-python-setup.test.mjs |
Adds tests covering requirements drift reinstalls + requirements parsing/script generation. |
AGENTS.md |
Documents new benchmark runner commands and clarifies routing-only integration tests. |
.github/workflows/llm-integration.yml |
Renames to routing-only and hardwires CI execution to routing. |
.github/workflows/llm-benchmark.yml |
Adds a dedicated benchmark workflow with PR-comment trigger, artifact upload, and PR reporting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…race, LLM-as-Judge, CI migration Build the LLM-as-Judge benchmark evaluation framework as specified in structureclaw#231: - **v2 assertion types**: structural_type | has_model | has_analysis | has_report | skill_match | natural_language - **Skill-hit tracing** (lib/skill-trace.cjs): parse detect_structure_type tool call from Agent messages to verify skill routing accuracy - **LLM-as-Judge** (lib/judge.cjs): natural_language assertion evaluator with temperature=0, timeout=30s, max_tokens=500; model via LLM_JUDGE_MODEL env var - **Refactored evaluate.cjs**: async type-dispatched evaluators; v1 scenarios auto-upgraded to v2 assertions - **maxRetries support** in runner.cjs: retry failed scenarios up to N times - **Scenario schema v2**: all 5 scenario files updated with category, tags, maxRetries, skills section, and typed assertions array - **CI migration**: new llm-benchmark.yml workflow; llm-integration.yml updated to routing-only (extraction/pipeline/clarification removed) - **Docs**: tests/runner.mjs help text and AGENTS.md updated - tests/llm-benchmark/lib/skill-trace.cjs (new) - tests/llm-benchmark/lib/judge.cjs (new) - tests/llm-benchmark/lib/evaluate.cjs (refactored) - tests/llm-benchmark/runner.cjs (maxRetries + async evaluateScenario) - tests/llm-benchmark/scenarios/*.json (v1 -> v2 schema upgrade) - .github/workflows/llm-benchmark.yml (new CI workflow) - .github/workflows/llm-integration.yml (routing-only) - tests/runner.mjs (updated help) - AGENTS.md (updated test docs) Closes structureclaw#231
CRITICAL: - Replace shell env injection of comment body with github-script for safe parsing (prevents command injection) - Replace template interpolation in github-script with env vars (prevents script injection) - Force HTTPS-only in judge.cjs, remove HTTP downgrade that leaked API keys HIGH: - Remove push trigger on master (was running full benchmark on every merge) - Add 100KB response body limit in judge HTTP client - Fix JSON extraction regex to handle nested braces correctly - Fix timeout race condition in judge with settled flag - Exclude /test-llm-benchmark from /test-llm trigger MEDIUM: - Add try/catch around dispatchAssertion to preserve partial metrics - Fix evalHasAnalysis false positive on any non-empty object - Fix evalSkillMatch vacuous assertion (empty allowed = match any) - Clamp maxRetries >= 0, add null guard after retry loop - Skip retry on execution errors (only retry LLM variance) - Remove unused `skills` from upgradeExpect return
…lm-integration - Add validateSkillRouting to backend-validations.js (reuses discovery/fixtures modules, runs detectStructuralType assertions) - Register "Skill routing regression" as a backend-regression step - Remove llm-integration runner and dead code (extraction/pipeline/ clarification executors, context, selection, trace, retry, reporting, real-llm-client, server, summarizer) - Remove llm-integration.yml CI workflow - Clean fixture files to routing-only scenarios (beam, double-span-beam, frame, portal-frame, truss) - Replace resolveIntegrationContext with resolveRegressionContext in llm-benchmark runner (removes dependency on deleted context.js) - Remove llm-integration and llm-summary commands from runner.mjs - Update AGENTS.md test documentation Net: -2348 lines. Routing tests now run as part of every backend-regression without needing LLM_API_KEY.
- Fail fast with clear error when judge API key is missing - Handle LLM_BASE_URL that already includes /v1 (avoid double /v1/v1) - Align upload-artifact to v7 (consistent with other workflows)
cf9681b to
bf3e26b
Compare
…lysis results Analysis engine returns results nested under analysisResult.data with dict-typed displacement/reaction fields instead of arrays. Check both top-level and data sub-object, and accept non-empty objects in addition to arrays.
…I workflow consistency
- Add turns-based scenario format for multi-turn conversations with
backward-compatible v1 (single message) auto-wrapping
- Add has_interaction_questions assertion to detect agent prompting
- Add normalizeScenario, mergeTurnResults and multi-turn execution loop
with shared conversationId across turns
- Update report to display per-turn assertion results
- Add beam-multi-turn-incomplete scenario as template
- Fix e2e workflow: convert curl+jq to github-script, fix injection
risk by using env: + process.env instead of inline ${{ }}
- Fix benchmark workflow: remove dead push condition, add explicit
LLM_JUDGE_API_KEY env var
- Fix judge URL path regex to handle any versioned base URL (/v4, etc)
…ti-turn benchmark - Fix has_interaction_questions to detect ask_user_clarification tool (not handler methods build_questions/compute_missing) - Fix extractSkillTrace to scan from end for latest routing decision - Fix LOG_LEVEL restore to avoid setting "undefined" string - Show failed toolCalls in multi-turn report output - Extract ANALYSIS_RESULT_KEYS as module-level constant
Octokit request() expects a method+route string, not a bare URL. Switch to the typed REST client for reliability in both e2e and benchmark workflows.
Summary / 概述
Implements #231: LLM-as-Judge benchmark evaluation framework — v2 assertions, skill-hit tracing, LLM judge evaluator, scenario schema upgrade, and CI migration from
llm-integrationtollm-benchmark.What Changed / 变更内容
New files / 新增文件
tests/llm-benchmark/lib/skill-trace.cjsmessagesby parsingdetect_structure_typetool call responsesmessages中解析detect_structure_type工具调用,提取skillId和structureTypetests/llm-benchmark/lib/judge.cjsnatural_languageassertions (temperature=0,timeout=30s,max_tokens=500, model viaLLM_JUDGE_MODEL)natural_language断言的 LLM 评判器,参数固定(temperature=0, 30s 超时),模型通过LLM_JUDGE_MODEL配置.github/workflows/llm-benchmark.ymlbenchmark-results.jsonartifact; triggered by/test-llm-benchmarkcommentbenchmark-results.jsonartifact,通过/test-llm-benchmark评论触发Refactored / 重构文件
tests/llm-benchmark/lib/evaluate.cjstype分发到类型化评估器(6 种 v2 类型);v1 场景自动升级tests/llm-benchmark/runner.cjsmaxRetries: retries failed scenarios up to N times, uses final resultmaxRetries:失败后最多重试 N 次,取最终结果Schema v1 → v2 upgrade / 场景格式升级
All 5 scenario files updated from flat v1 fields to the new v2 structure with
category,tags,maxRetries,skills, and typedassertionsarray. Backward compatible — v1 format still works.5 个场景文件从扁平 v1 字段升级为新的 v2 结构,含
category、tags、maxRetries、skills和类型化assertions数组。向后兼容 — v1 格式仍可使用。CI Migration / CI 迁移
llm-benchmark.yml: full benchmark CI with artifact upload / 新增llm-benchmark.yml:完整基准测试 CI,含 artifact 上传llm-integration.yml: routing-only (deterministic, no LLM key needed) / 更新llm-integration.yml:仅 routing 类别(确定性,CI 无需 LLM Key)V2 Assertion Types / V2 断言类型
structural_typestate.structuralTypeKeyhas_modelhas_analysishas_reportskill_matchdetect_structure_typetool call to verify routingnatural_languageTesting / 测试
Done When / 完成标准 ✅
llm-benchmarksupports all v2 assertion types includingnatural_languagellm-integrationtollm-benchmarkCloses structureclaw#231