Skip to content

feat(#231): LLM benchmark v2 — typed assertions, skill-trace, LLM-as-Judge, CI migration#1

Open
ZhouChaunge wants to merge 9 commits into
masterfrom
feat/231-llm-benchmark-v2
Open

feat(#231): LLM benchmark v2 — typed assertions, skill-trace, LLM-as-Judge, CI migration#1
ZhouChaunge wants to merge 9 commits into
masterfrom
feat/231-llm-benchmark-v2

Conversation

@ZhouChaunge
Copy link
Copy Markdown
Owner

Summary / 概述

Implements #231: LLM-as-Judge benchmark evaluation framework — v2 assertions, skill-hit tracing, LLM judge evaluator, scenario schema upgrade, and CI migration from llm-integration to llm-benchmark.


What Changed / 变更内容

New files / 新增文件

File Purpose (EN) 用途 (中文)
tests/llm-benchmark/lib/skill-trace.cjs Extracts skill match results from Agent messages by parsing detect_structure_type tool call responses 从 Agent messages 中解析 detect_structure_type 工具调用,提取 skillIdstructureType
tests/llm-benchmark/lib/judge.cjs LLM-as-Judge evaluator for natural_language assertions (temperature=0, timeout=30s, max_tokens=500, model via LLM_JUDGE_MODEL) natural_language 断言的 LLM 评判器,参数固定(temperature=0, 30s 超时),模型通过 LLM_JUDGE_MODEL 配置
.github/workflows/llm-benchmark.yml CI workflow for the benchmark suite; uploads benchmark-results.json artifact; triggered by /test-llm-benchmark comment 基准测试 CI workflow,上传 benchmark-results.json artifact,通过 /test-llm-benchmark 评论触发

Refactored / 重构文件

File Change (EN) 变更说明 (中文)
tests/llm-benchmark/lib/evaluate.cjs Converted to async; type-dispatched evaluators for all 6 v2 assertion types; v1 scenarios auto-upgraded 改为 async;按断言 type 分发到类型化评估器(6 种 v2 类型);v1 场景自动升级
tests/llm-benchmark/runner.cjs Supports maxRetries: retries failed scenarios up to N times, uses final result 支持 maxRetries:失败后最多重试 N 次,取最终结果

Schema v1 → v2 upgrade / 场景格式升级

All 5 scenario files updated from flat v1 fields to the new v2 structure with category, tags, maxRetries, skills, and typed assertions array. Backward compatible — v1 format still works.

5 个场景文件从扁平 v1 字段升级为新的 v2 结构,含 categorytagsmaxRetriesskills 和类型化 assertions 数组。向后兼容 — v1 格式仍可使用。

CI Migration / CI 迁移

  • Added llm-benchmark.yml: full benchmark CI with artifact upload / 新增 llm-benchmark.yml:完整基准测试 CI,含 artifact 上传
  • Updated llm-integration.yml: routing-only (deterministic, no LLM key needed) / 更新 llm-integration.yml:仅 routing 类别(确定性,CI 无需 LLM Key)

V2 Assertion Types / V2 断言类型

Type Description (EN) 说明 (中文)
structural_type Checks state.structuralTypeKey 检查结构类型
has_model Verifies node/element counts 验证节点/单元数量
has_analysis Checks analysis result presence 检查分析结果
has_report Checks report markdown length > 100 chars 检查报告长度
skill_match Parses detect_structure_type tool call to verify routing 解析工具调用验证路由
natural_language LLM-as-Judge evaluates free-form criterion LLM 评判自然语言标准

Testing / 测试

# Module load check / 模块加载检查
node -e "require('./tests/llm-benchmark/lib/skill-trace.cjs'); require('./tests/llm-benchmark/lib/judge.cjs'); require('./tests/llm-benchmark/lib/evaluate.cjs'); console.log('OK')"
# → All modules load OK

# Run benchmark (requires LLM_API_KEY)
node tests/runner.mjs llm-benchmark
node tests/runner.mjs llm-benchmark --scenario beam-static-6m

# Run routing tests (no LLM key needed)
node tests/runner.mjs llm-integration

Done When / 完成标准 ✅

  • llm-benchmark supports all v2 assertion types including natural_language
  • Skill-hit tracking verifies skill routing accuracy
  • v1 scenario format backward compatible
  • CI migrated from llm-integration to llm-benchmark
  • AGENTS.md test documentation updated

Closes structureclaw#231

Copilot AI review requested due to automatic review settings May 7, 2026 16:07
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the LLM benchmarking suite and improves Python environment management. Key updates include the introduction of v2 assertions for benchmarks—supporting LLM-as-Judge and skill-hit tracing—and a more robust Python dependency synchronization check that parses requirements files and verifies installed versions. Feedback focuses on improving the JSON extraction logic for nested objects in the judge component, addressing the lack of support for nested requirements files in the parser, and mitigating potential command-line length limitations when running the Python check script on Windows.

Comment thread tests/llm-benchmark/lib/judge.cjs Outdated
Comment thread scripts/cli/runtime.js
Comment thread scripts/cli/runtime.js
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the v2 LLM benchmark framework (typed assertions + skill-hit tracing + LLM-as-Judge), upgrades benchmark scenarios to the v2 schema with retries, migrates CI coverage from llm-integration to a dedicated llm-benchmark workflow, and tightens CLI analysis-Python setup by checking pinned requirement versions.

Changes:

  • Added skill-tracing + LLM-as-Judge modules and refactored benchmark evaluation to async, type-dispatched v2 assertions (with v1 auto-upgrade).
  • Upgraded benchmark scenarios to v2 schema (category, tags, maxRetries, skills, typed assertions).
  • Introduced llm-benchmark CI workflow and simplified llm-integration workflow to routing-only; enhanced CLI Python setup to detect requirements drift.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/runner.mjs Updates CLI help text for routing-only llm-integration and new v2 llm-benchmark command details.
tests/llm-benchmark/scenarios/beam.json Migrates beam scenarios to v2 schema, adds retries and natural_language assertions.
tests/llm-benchmark/scenarios/double-span-beam.json Migrates scenario to v2 schema and adds skill routing assertions.
tests/llm-benchmark/scenarios/frame.json Migrates frame scenarios to v2 schema, adds retries and natural_language assertions.
tests/llm-benchmark/scenarios/portal-frame.json Migrates portal-frame scenarios to v2 schema with routing assertions.
tests/llm-benchmark/scenarios/truss.json Migrates truss scenario to v2 schema with routing assertions.
tests/llm-benchmark/runner.cjs Adds per-scenario retry support and awaits async evaluation.
tests/llm-benchmark/lib/evaluate.cjs Implements v2 assertion dispatch, v1→v2 upgrade, and async natural_language evaluation.
tests/llm-benchmark/lib/skill-trace.cjs Extracts skillId / structureType from detect_structure_type tool result messages.
tests/llm-benchmark/lib/judge.cjs Adds HTTP-based LLM judge client + prompt building and JSON response parsing.
scripts/cli/runtime.js Adds requirements parsing + runtime check helpers to detect Python pinned-version drift.
scripts/cli/main.js Uses the new requirements drift checks when determining whether analysis Python env is “ready”.
backend/tests/analysis-python-setup.test.mjs Adds tests covering requirements drift reinstalls + requirements parsing/script generation.
AGENTS.md Documents new benchmark runner commands and clarifies routing-only integration tests.
.github/workflows/llm-integration.yml Renames to routing-only and hardwires CI execution to routing.
.github/workflows/llm-benchmark.yml Adds a dedicated benchmark workflow with PR-comment trigger, artifact upload, and PR reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/llm-benchmark/lib/judge.cjs Outdated
Comment thread .github/workflows/llm-integration.yml Outdated
Comment thread .github/workflows/llm-benchmark.yml Outdated
ZhouChaunge and others added 5 commits May 10, 2026 22:43
…race, LLM-as-Judge, CI migration

Build the LLM-as-Judge benchmark evaluation framework as specified in structureclaw#231:

- **v2 assertion types**: structural_type | has_model | has_analysis | has_report |
  skill_match | natural_language
- **Skill-hit tracing** (lib/skill-trace.cjs): parse detect_structure_type tool
  call from Agent messages to verify skill routing accuracy
- **LLM-as-Judge** (lib/judge.cjs): natural_language assertion evaluator with
  temperature=0, timeout=30s, max_tokens=500; model via LLM_JUDGE_MODEL env var
- **Refactored evaluate.cjs**: async type-dispatched evaluators; v1 scenarios
  auto-upgraded to v2 assertions
- **maxRetries support** in runner.cjs: retry failed scenarios up to N times
- **Scenario schema v2**: all 5 scenario files updated with category, tags,
  maxRetries, skills section, and typed assertions array
- **CI migration**: new llm-benchmark.yml workflow; llm-integration.yml
  updated to routing-only (extraction/pipeline/clarification removed)
- **Docs**: tests/runner.mjs help text and AGENTS.md updated

- tests/llm-benchmark/lib/skill-trace.cjs  (new)
- tests/llm-benchmark/lib/judge.cjs         (new)
- tests/llm-benchmark/lib/evaluate.cjs      (refactored)
- tests/llm-benchmark/runner.cjs            (maxRetries + async evaluateScenario)
- tests/llm-benchmark/scenarios/*.json      (v1 -> v2 schema upgrade)
- .github/workflows/llm-benchmark.yml       (new CI workflow)
- .github/workflows/llm-integration.yml    (routing-only)
- tests/runner.mjs                          (updated help)
- AGENTS.md                                 (updated test docs)

Closes structureclaw#231
CRITICAL:
- Replace shell env injection of comment body with github-script
  for safe parsing (prevents command injection)
- Replace template interpolation in github-script with env vars
  (prevents script injection)
- Force HTTPS-only in judge.cjs, remove HTTP downgrade that leaked
  API keys

HIGH:
- Remove push trigger on master (was running full benchmark on
  every merge)
- Add 100KB response body limit in judge HTTP client
- Fix JSON extraction regex to handle nested braces correctly
- Fix timeout race condition in judge with settled flag
- Exclude /test-llm-benchmark from /test-llm trigger

MEDIUM:
- Add try/catch around dispatchAssertion to preserve partial metrics
- Fix evalHasAnalysis false positive on any non-empty object
- Fix evalSkillMatch vacuous assertion (empty allowed = match any)
- Clamp maxRetries >= 0, add null guard after retry loop
- Skip retry on execution errors (only retry LLM variance)
- Remove unused `skills` from upgradeExpect return
…lm-integration

- Add validateSkillRouting to backend-validations.js (reuses
  discovery/fixtures modules, runs detectStructuralType assertions)
- Register "Skill routing regression" as a backend-regression step
- Remove llm-integration runner and dead code (extraction/pipeline/
  clarification executors, context, selection, trace, retry, reporting,
  real-llm-client, server, summarizer)
- Remove llm-integration.yml CI workflow
- Clean fixture files to routing-only scenarios (beam, double-span-beam,
  frame, portal-frame, truss)
- Replace resolveIntegrationContext with resolveRegressionContext in
  llm-benchmark runner (removes dependency on deleted context.js)
- Remove llm-integration and llm-summary commands from runner.mjs
- Update AGENTS.md test documentation

Net: -2348 lines. Routing tests now run as part of every
backend-regression without needing LLM_API_KEY.
- Fail fast with clear error when judge API key is missing
- Handle LLM_BASE_URL that already includes /v1 (avoid double /v1/v1)
- Align upload-artifact to v7 (consistent with other workflows)
@guyi2000 guyi2000 force-pushed the feat/231-llm-benchmark-v2 branch from cf9681b to bf3e26b Compare May 10, 2026 14:45
guyi2000 added 4 commits May 10, 2026 23:18
…lysis results

Analysis engine returns results nested under analysisResult.data with
dict-typed displacement/reaction fields instead of arrays. Check both
top-level and data sub-object, and accept non-empty objects in addition
to arrays.
…I workflow consistency

- Add turns-based scenario format for multi-turn conversations with
  backward-compatible v1 (single message) auto-wrapping
- Add has_interaction_questions assertion to detect agent prompting
- Add normalizeScenario, mergeTurnResults and multi-turn execution loop
  with shared conversationId across turns
- Update report to display per-turn assertion results
- Add beam-multi-turn-incomplete scenario as template
- Fix e2e workflow: convert curl+jq to github-script, fix injection
  risk by using env: + process.env instead of inline ${{ }}
- Fix benchmark workflow: remove dead push condition, add explicit
  LLM_JUDGE_API_KEY env var
- Fix judge URL path regex to handle any versioned base URL (/v4, etc)
…ti-turn benchmark

- Fix has_interaction_questions to detect ask_user_clarification tool
  (not handler methods build_questions/compute_missing)
- Fix extractSkillTrace to scan from end for latest routing decision
- Fix LOG_LEVEL restore to avoid setting "undefined" string
- Show failed toolCalls in multi-turn report output
- Extract ANALYSIS_RESULT_KEYS as module-level constant
Octokit request() expects a method+route string, not a bare URL.
Switch to the typed REST client for reliability in both e2e and
benchmark workflows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: LLM benchmark evaluation framework — v2 assertions, judge, skill-trace, CI migration / LLM 基准评估框架

3 participants