feat(#231): LLM benchmark v2 — typed assertions, skill-trace, LLM-as-Judge, CI migration by ZhouChaunge · Pull Request #1 · ZhouChaunge/structureclaw

ZhouChaunge · 2026-05-07T16:07:24Z

Summary / 概述

Implements #231: LLM-as-Judge benchmark evaluation framework — v2 assertions, skill-hit tracing, LLM judge evaluator, scenario schema upgrade, and CI migration from llm-integration to llm-benchmark.

What Changed / 变更内容

New files / 新增文件

File	Purpose (EN)	用途 (中文)
`tests/llm-benchmark/lib/skill-trace.cjs`	Extracts skill match results from Agent `messages` by parsing `detect_structure_type` tool call responses	从 Agent `messages` 中解析 `detect_structure_type` 工具调用，提取 `skillId` 和 `structureType`
`tests/llm-benchmark/lib/judge.cjs`	LLM-as-Judge evaluator for `natural_language` assertions (`temperature=0`, `timeout=30s`, `max_tokens=500`, model via `LLM_JUDGE_MODEL`)	`natural_language` 断言的 LLM 评判器，参数固定（temperature=0, 30s 超时），模型通过 `LLM_JUDGE_MODEL` 配置
`.github/workflows/llm-benchmark.yml`	CI workflow for the benchmark suite; uploads `benchmark-results.json` artifact; triggered by `/test-llm-benchmark` comment	基准测试 CI workflow，上传 `benchmark-results.json` artifact，通过 `/test-llm-benchmark` 评论触发

Refactored / 重构文件

File	Change (EN)	变更说明 (中文)
`tests/llm-benchmark/lib/evaluate.cjs`	Converted to async; type-dispatched evaluators for all 6 v2 assertion types; v1 scenarios auto-upgraded	改为 async；按断言 `type` 分发到类型化评估器（6 种 v2 类型）；v1 场景自动升级
`tests/llm-benchmark/runner.cjs`	Supports `maxRetries`: retries failed scenarios up to N times, uses final result	支持 `maxRetries`：失败后最多重试 N 次，取最终结果

Schema v1 → v2 upgrade / 场景格式升级

All 5 scenario files updated from flat v1 fields to the new v2 structure with category, tags, maxRetries, skills, and typed assertions array. Backward compatible — v1 format still works.

5 个场景文件从扁平 v1 字段升级为新的 v2 结构，含 category、tags、maxRetries、skills 和类型化 assertions 数组。向后兼容 — v1 格式仍可使用。

CI Migration / CI 迁移

Added llm-benchmark.yml: full benchmark CI with artifact upload / 新增 llm-benchmark.yml：完整基准测试 CI，含 artifact 上传
Updated llm-integration.yml: routing-only (deterministic, no LLM key needed) / 更新 llm-integration.yml：仅 routing 类别（确定性，CI 无需 LLM Key）

V2 Assertion Types / V2 断言类型

Type	Description (EN)	说明 (中文)
`structural_type`	Checks `state.structuralTypeKey`	检查结构类型
`has_model`	Verifies node/element counts	验证节点/单元数量
`has_analysis`	Checks analysis result presence	检查分析结果
`has_report`	Checks report markdown length > 100 chars	检查报告长度
`skill_match`	Parses `detect_structure_type` tool call to verify routing	解析工具调用验证路由
`natural_language`	LLM-as-Judge evaluates free-form criterion	LLM 评判自然语言标准

Testing / 测试

# Module load check / 模块加载检查
node -e "require('./tests/llm-benchmark/lib/skill-trace.cjs'); require('./tests/llm-benchmark/lib/judge.cjs'); require('./tests/llm-benchmark/lib/evaluate.cjs'); console.log('OK')"
# → All modules load OK

# Run benchmark (requires LLM_API_KEY)
node tests/runner.mjs llm-benchmark
node tests/runner.mjs llm-benchmark --scenario beam-static-6m

# Run routing tests (no LLM key needed)
node tests/runner.mjs llm-integration

Done When / 完成标准 ✅

llm-benchmark supports all v2 assertion types including natural_language
Skill-hit tracking verifies skill routing accuracy
v1 scenario format backward compatible
CI migrated from llm-integration to llm-benchmark
AGENTS.md test documentation updated

Closes structureclaw#231

gemini-code-assist

Code Review

This pull request enhances the LLM benchmarking suite and improves Python environment management. Key updates include the introduction of v2 assertions for benchmarks—supporting LLM-as-Judge and skill-hit tracing—and a more robust Python dependency synchronization check that parses requirements files and verifies installed versions. Feedback focuses on improving the JSON extraction logic for nested objects in the judge component, addressing the lack of support for nested requirements files in the parser, and mitigating potential command-line length limitations when running the Python check script on Windows.

Copilot

Pull request overview

Implements the v2 LLM benchmark framework (typed assertions + skill-hit tracing + LLM-as-Judge), upgrades benchmark scenarios to the v2 schema with retries, migrates CI coverage from llm-integration to a dedicated llm-benchmark workflow, and tightens CLI analysis-Python setup by checking pinned requirement versions.

Changes:

Added skill-tracing + LLM-as-Judge modules and refactored benchmark evaluation to async, type-dispatched v2 assertions (with v1 auto-upgrade).
Upgraded benchmark scenarios to v2 schema (category, tags, maxRetries, skills, typed assertions).
Introduced llm-benchmark CI workflow and simplified llm-integration workflow to routing-only; enhanced CLI Python setup to detect requirements drift.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`tests/runner.mjs`	Updates CLI help text for routing-only `llm-integration` and new v2 `llm-benchmark` command details.
`tests/llm-benchmark/scenarios/beam.json`	Migrates beam scenarios to v2 schema, adds retries and `natural_language` assertions.
`tests/llm-benchmark/scenarios/double-span-beam.json`	Migrates scenario to v2 schema and adds skill routing assertions.
`tests/llm-benchmark/scenarios/frame.json`	Migrates frame scenarios to v2 schema, adds retries and `natural_language` assertions.
`tests/llm-benchmark/scenarios/portal-frame.json`	Migrates portal-frame scenarios to v2 schema with routing assertions.
`tests/llm-benchmark/scenarios/truss.json`	Migrates truss scenario to v2 schema with routing assertions.
`tests/llm-benchmark/runner.cjs`	Adds per-scenario retry support and awaits async evaluation.
`tests/llm-benchmark/lib/evaluate.cjs`	Implements v2 assertion dispatch, v1→v2 upgrade, and async `natural_language` evaluation.
`tests/llm-benchmark/lib/skill-trace.cjs`	Extracts `skillId` / `structureType` from `detect_structure_type` tool result messages.
`tests/llm-benchmark/lib/judge.cjs`	Adds HTTP-based LLM judge client + prompt building and JSON response parsing.
`scripts/cli/runtime.js`	Adds requirements parsing + runtime check helpers to detect Python pinned-version drift.
`scripts/cli/main.js`	Uses the new requirements drift checks when determining whether analysis Python env is “ready”.
`backend/tests/analysis-python-setup.test.mjs`	Adds tests covering requirements drift reinstalls + requirements parsing/script generation.
`AGENTS.md`	Documents new benchmark runner commands and clarifies routing-only integration tests.
`.github/workflows/llm-integration.yml`	Renames to routing-only and hardwires CI execution to `routing`.
`.github/workflows/llm-benchmark.yml`	Adds a dedicated benchmark workflow with PR-comment trigger, artifact upload, and PR reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…race, LLM-as-Judge, CI migration Build the LLM-as-Judge benchmark evaluation framework as specified in structureclaw#231: - **v2 assertion types**: structural_type | has_model | has_analysis | has_report | skill_match | natural_language - **Skill-hit tracing** (lib/skill-trace.cjs): parse detect_structure_type tool call from Agent messages to verify skill routing accuracy - **LLM-as-Judge** (lib/judge.cjs): natural_language assertion evaluator with temperature=0, timeout=30s, max_tokens=500; model via LLM_JUDGE_MODEL env var - **Refactored evaluate.cjs**: async type-dispatched evaluators; v1 scenarios auto-upgraded to v2 assertions - **maxRetries support** in runner.cjs: retry failed scenarios up to N times - **Scenario schema v2**: all 5 scenario files updated with category, tags, maxRetries, skills section, and typed assertions array - **CI migration**: new llm-benchmark.yml workflow; llm-integration.yml updated to routing-only (extraction/pipeline/clarification removed) - **Docs**: tests/runner.mjs help text and AGENTS.md updated - tests/llm-benchmark/lib/skill-trace.cjs (new) - tests/llm-benchmark/lib/judge.cjs (new) - tests/llm-benchmark/lib/evaluate.cjs (refactored) - tests/llm-benchmark/runner.cjs (maxRetries + async evaluateScenario) - tests/llm-benchmark/scenarios/*.json (v1 -> v2 schema upgrade) - .github/workflows/llm-benchmark.yml (new CI workflow) - .github/workflows/llm-integration.yml (routing-only) - tests/runner.mjs (updated help) - AGENTS.md (updated test docs) Closes structureclaw#231

CRITICAL: - Replace shell env injection of comment body with github-script for safe parsing (prevents command injection) - Replace template interpolation in github-script with env vars (prevents script injection) - Force HTTPS-only in judge.cjs, remove HTTP downgrade that leaked API keys HIGH: - Remove push trigger on master (was running full benchmark on every merge) - Add 100KB response body limit in judge HTTP client - Fix JSON extraction regex to handle nested braces correctly - Fix timeout race condition in judge with settled flag - Exclude /test-llm-benchmark from /test-llm trigger MEDIUM: - Add try/catch around dispatchAssertion to preserve partial metrics - Fix evalHasAnalysis false positive on any non-empty object - Fix evalSkillMatch vacuous assertion (empty allowed = match any) - Clamp maxRetries >= 0, add null guard after retry loop - Skip retry on execution errors (only retry LLM variance) - Remove unused `skills` from upgradeExpect return

…lm-integration - Add validateSkillRouting to backend-validations.js (reuses discovery/fixtures modules, runs detectStructuralType assertions) - Register "Skill routing regression" as a backend-regression step - Remove llm-integration runner and dead code (extraction/pipeline/ clarification executors, context, selection, trace, retry, reporting, real-llm-client, server, summarizer) - Remove llm-integration.yml CI workflow - Clean fixture files to routing-only scenarios (beam, double-span-beam, frame, portal-frame, truss) - Replace resolveIntegrationContext with resolveRegressionContext in llm-benchmark runner (removes dependency on deleted context.js) - Remove llm-integration and llm-summary commands from runner.mjs - Update AGENTS.md test documentation Net: -2348 lines. Routing tests now run as part of every backend-regression without needing LLM_API_KEY.

- Fail fast with clear error when judge API key is missing - Handle LLM_BASE_URL that already includes /v1 (avoid double /v1/v1) - Align upload-artifact to v7 (consistent with other workflows)

…lysis results Analysis engine returns results nested under analysisResult.data with dict-typed displacement/reaction fields instead of arrays. Check both top-level and data sub-object, and accept non-empty objects in addition to arrays.

…I workflow consistency - Add turns-based scenario format for multi-turn conversations with backward-compatible v1 (single message) auto-wrapping - Add has_interaction_questions assertion to detect agent prompting - Add normalizeScenario, mergeTurnResults and multi-turn execution loop with shared conversationId across turns - Update report to display per-turn assertion results - Add beam-multi-turn-incomplete scenario as template - Fix e2e workflow: convert curl+jq to github-script, fix injection risk by using env: + process.env instead of inline ${{ }} - Fix benchmark workflow: remove dead push condition, add explicit LLM_JUDGE_API_KEY env var - Fix judge URL path regex to handle any versioned base URL (/v4, etc)

…ti-turn benchmark - Fix has_interaction_questions to detect ask_user_clarification tool (not handler methods build_questions/compute_missing) - Fix extractSkillTrace to scan from end for latest routing decision - Fix LOG_LEVEL restore to avoid setting "undefined" string - Show failed toolCalls in multi-turn report output - Extract ANALYSIS_RESULT_KEYS as module-level constant

Octokit request() expects a method+route string, not a bare URL. Switch to the typed REST client for reliability in both e2e and benchmark workflows.

Copilot AI review requested due to automatic review settings May 7, 2026 16:07

Copilot started reviewing on behalf of ZhouChaunge May 7, 2026 16:08 View session

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Comment thread tests/llm-benchmark/lib/judge.cjs Outdated

Comment thread scripts/cli/runtime.js

Comment thread scripts/cli/runtime.js

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread tests/llm-benchmark/lib/judge.cjs Outdated

Comment thread .github/workflows/llm-integration.yml Outdated

Comment thread .github/workflows/llm-benchmark.yml Outdated

ZhouChaunge and others added 5 commits May 10, 2026 22:43

fix(structureclaw#251): address Copilot/Gemini review feedback

062b93d

- Fail fast with clear error when judge API key is missing - Handle LLM_BASE_URL that already includes /v1 (avoid double /v1/v1) - Align upload-artifact to v7 (consistent with other workflows)

chore: trigger PR merge status recheck

bf3e26b

guyi2000 force-pushed the feat/231-llm-benchmark-v2 branch from cf9681b to bf3e26b Compare May 10, 2026 14:45

guyi2000 added 4 commits May 10, 2026 23:18

fix(ci): use github.rest.pulls.get instead of github.request(prUrl)

9f347d2

Octokit request() expects a method+route string, not a bare URL. Switch to the typed REST client for reliability in both e2e and benchmark workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#231): LLM benchmark v2 — typed assertions, skill-trace, LLM-as-Judge, CI migration#1

feat(#231): LLM benchmark v2 — typed assertions, skill-trace, LLM-as-Judge, CI migration#1
ZhouChaunge wants to merge 9 commits into
masterfrom
feat/231-llm-benchmark-v2

ZhouChaunge commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ZhouChaunge commented May 7, 2026

Summary / 概述

What Changed / 变更内容

New files / 新增文件

Refactored / 重构文件

Schema v1 → v2 upgrade / 场景格式升级

CI Migration / CI 迁移

V2 Assertion Types / V2 断言类型

Testing / 测试

Done When / 完成标准 ✅

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants