Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/eval.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: Run Skill Evaluations

on:
pull_request:
branches: [main]
paths:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] CI workflow is missing copilot CLI install and token secrets — azd waza run crashes with exec: "copilot": executable file not found

Details

Why: The workflow installs Azure Developer CLI and the waza extension, but the copilot-sdk engine requires the GitHub Copilot CLI binary and a GITHUB_TOKEN with Copilot access. The CI run log shows:

[ERROR] copilot failed to start: failed to start CLI server: exec: "copilot": executable file not found in $PATH

This crashes waza's Go runtime (nil-pointer dereference in copilot-sdk/go.(*Client).Stop).

Additionally, the paths filter only triggers on evals/** and skills/** — changes to .waza.yaml or the workflow itself bypass CI.

Fix:

  1. Add a copilot CLI install step:
- name: Install GitHub Copilot CLI
  run: npm install -g @github/copilot
  1. Set GITHUB_TOKEN (or COPILOT_GITHUB_TOKEN) from secrets — the default GITHUB_TOKEN lacks Copilot access.
  2. Add .waza.yaml and .github/workflows/eval.yml to the paths filter.
  3. Alternatively, use waza's mock executor in CI for config-validation-only runs (no API keys needed).

Ref: GitHub Copilot CLI in Actions, waza README — mock executor

Reviewed at 02fae01

- 'evals/**'
- 'skills/**'

permissions:
contents: read

jobs:
eval:
name: Run Evaluations
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Azure Developer CLI
uses: Azure/setup-azd@v2
- name: Install waza extension
run: |
azd config set alpha.extensions on
azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
azd ext install microsoft.azd.waza
- name: Run evaluations
run: azd waza run --output-dir ./results
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: ./results
retention-days: 30
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,7 @@ build/
.env.*
.DS_Store
.claude/worktrees/

# waza eval outputs and caches (local to each run; not source-of-truth)
.waza-results/
.waza-cache/
31 changes: 31 additions & 0 deletions .waza.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/config.schema.json

paths:
skills: skills
evals: evals
results: .waza-results
defaults:
engine: copilot-sdk
model: claude-sonnet-4.6
timeout: 300
parallel: false
workers: 4
verbose: false
sessionLog: false
cache:
enabled: false
dir: .waza-cache
server:
port: 3000
resultsDir: results/
dev:
model: claude-sonnet-4-20250514
target: medium-high
maxIterations: 5
tokens:
warningThreshold: 500
fallbackLimit: 1000
graders:
programTimeout: 30
storage:
containerName: waza-results
32 changes: 32 additions & 0 deletions evals/create-agent-tui/eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: create-agent-tui-eval
description: |
TODO: scaffolding only — tasks are generic stubs. Author real tasks +
graders before running baseline. See evals/openrouter-tts for a worked
example. Per project memory, this skill's graders need to drive the
generated TUI via pilotty, not just assert on file contents.
skill: create-agent-tui
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
executor: copilot-sdk
model: claude-sonnet-4.6
metrics:
- name: task_completion
weight: 1.0
threshold: 0.8
description: Did the skill complete the assigned task?
graders:
- type: code
name: has_output
config:
assertions:
- "len(output) > 0"
- type: text
name: relevant_content
config:
regex_match:
- "(?i)(explain|describe|analyze|implement)"
tasks:
- "tasks/*.yaml"
3 changes: 3 additions & 0 deletions evals/create-agent-tui/fixtures/sample.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
def hello(name):
"""Greet someone by name."""
return f"Hello, {name}!"
16 changes: 16 additions & 0 deletions evals/create-agent-tui/tasks/basic-usage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
id: basic-usage-001
name: Basic Usage
description: |
Test that the skill handles a typical request correctly.
tags:
- basic
- happy-path
inputs:
prompt: "Help me with this task"
files:
- path: sample.py
expected:
output_contains:
- "function"
outcomes:
- type: task_completed
11 changes: 11 additions & 0 deletions evals/create-agent-tui/tasks/edge-case.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
id: edge-case-001
name: Edge Case - Empty Input
description: |
Test that the skill handles edge cases gracefully.
tags:
- edge-case
inputs:
prompt: ""
expected:
outcomes:
- type: task_completed
13 changes: 13 additions & 0 deletions evals/create-agent-tui/tasks/should-not-trigger.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
id: should-not-trigger-001
name: Should Not Trigger
description: |
Test that the skill does NOT activate on unrelated prompts.
This validates trigger specificity.
tags:
- anti-trigger
- negative-test
inputs:
prompt: "What is the weather today?"
expected:
output_not_contains:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion][codex] Negative test only forbids the literal phrase "skill activated" — actual script/tool invocation passes silently

Details

Why: output_not_contains: ["skill activated"] is far too narrow. An agent that correctly activates the skill (runs scripts, generates code) but never outputs the exact string "skill activated" will pass this test — even though the skill should NOT have activated for "What is the weather today?" The same pattern exists in create-headless-agent and openrouter-agent-migration stubs.

Fix: Add assertions that check no skill-specific scripts or commands were invoked:

expected:
  output_not_contains:
    - "skill activated"
  code_assertions:
    - '"create-tui" not in " ".join([tc["arguments"].get("command", "") for tc in tool_calls if tc["name"] == "bash"])'
    - '"inquirer" not in " ".join([tc["arguments"].get("command", "") for tc in tool_calls if tc["name"] == "bash"]).lower()'

Reviewed at 02fae01

- "skill activated"
31 changes: 31 additions & 0 deletions evals/create-headless-agent/eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: create-headless-agent-eval
description: |
TODO: scaffolding only — tasks are generic stubs. Author real tasks +
graders before running baseline. See evals/openrouter-tts for a worked
example.
skill: create-headless-agent
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
executor: copilot-sdk
model: claude-sonnet-4.6
metrics:
- name: task_completion
weight: 1.0
threshold: 0.8
description: Did the skill complete the assigned task?
graders:
- type: code
name: has_output
config:
assertions:
- "len(output) > 0"
- type: text
name: relevant_content
config:
regex_match:
- "(?i)(explain|describe|analyze|implement)"
tasks:
- "tasks/*.yaml"
3 changes: 3 additions & 0 deletions evals/create-headless-agent/fixtures/sample.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
def hello(name):
"""Greet someone by name."""
return f"Hello, {name}!"
16 changes: 16 additions & 0 deletions evals/create-headless-agent/tasks/basic-usage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
id: basic-usage-001
name: Basic Usage
description: |
Test that the skill handles a typical request correctly.
tags:
- basic
- happy-path
inputs:
prompt: "Help me with this task"
files:
- path: sample.py
expected:
output_contains:
- "function"
outcomes:
- type: task_completed
11 changes: 11 additions & 0 deletions evals/create-headless-agent/tasks/edge-case.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
id: edge-case-001
name: Edge Case - Empty Input
description: |
Test that the skill handles edge cases gracefully.
tags:
- edge-case
inputs:
prompt: ""
expected:
outcomes:
- type: task_completed
13 changes: 13 additions & 0 deletions evals/create-headless-agent/tasks/should-not-trigger.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
id: should-not-trigger-001
name: Should Not Trigger
description: |
Test that the skill does NOT activate on unrelated prompts.
This validates trigger specificity.
tags:
- anti-trigger
- negative-test
inputs:
prompt: "What is the weather today?"
expected:
output_not_contains:
- "skill activated"
31 changes: 31 additions & 0 deletions evals/openrouter-agent-migration/eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: openrouter-agent-migration-eval
description: |
TODO: scaffolding only — tasks are generic stubs. Author real tasks +
graders before running baseline. See evals/openrouter-tts for a worked
example.
skill: openrouter-agent-migration
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
executor: copilot-sdk
model: claude-sonnet-4.6
metrics:
- name: task_completion
weight: 1.0
threshold: 0.8
description: Did the skill complete the assigned task?
graders:
- type: code
name: has_output
config:
assertions:
- "len(output) > 0"
- type: text
name: relevant_content
config:
regex_match:
- "(?i)(explain|describe|analyze|implement)"
tasks:
- "tasks/*.yaml"
3 changes: 3 additions & 0 deletions evals/openrouter-agent-migration/fixtures/sample.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
def hello(name):
"""Greet someone by name."""
return f"Hello, {name}!"
16 changes: 16 additions & 0 deletions evals/openrouter-agent-migration/tasks/basic-usage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
id: basic-usage-001
name: Basic Usage
description: |
Test that the skill handles a typical request correctly.
tags:
- basic
- happy-path
inputs:
prompt: "Help me with this task"
files:
- path: sample.py
expected:
output_contains:
- "function"
outcomes:
- type: task_completed
11 changes: 11 additions & 0 deletions evals/openrouter-agent-migration/tasks/edge-case.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
id: edge-case-001
name: Edge Case - Empty Input
description: |
Test that the skill handles edge cases gracefully.
tags:
- edge-case
inputs:
prompt: ""
expected:
outcomes:
- type: task_completed
13 changes: 13 additions & 0 deletions evals/openrouter-agent-migration/tasks/should-not-trigger.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
id: should-not-trigger-001
name: Should Not Trigger
description: |
Test that the skill does NOT activate on unrelated prompts.
This validates trigger specificity.
tags:
- anti-trigger
- negative-test
inputs:
prompt: "What is the weather today?"
expected:
output_not_contains:
- "skill activated"
33 changes: 33 additions & 0 deletions evals/openrouter-images/eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: openrouter-images-eval
description: |
Evaluation suite for the openrouter-images skill. Validates that the
agent picks the right bundled script (generate.ts for new images,
edit.ts for modifications) and invokes it with correct flags.
skill: openrouter-images
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
executor: copilot-sdk
model: claude-opus-4.7
metrics:
- name: task_completion
weight: 1.0
threshold: 0.8
description: Did the agent pick the right script and flags?

hooks:
before_run:
- command: "mkdir -p ~/.agents/skills && rsync -a --delete /Users/matt.apperson/Development/skills/.worktrees/setup-waza/skills/openrouter-images/ /Users/matt.apperson/.agents/skills/openrouter-images/"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] before_run hardcodes developer-local /Users/matt.apperson paths — eval fails on any other machine or CI runner

Details

Why: The rsync -a --delete source path /Users/matt.apperson/Development/skills/.worktrees/setup-waza/skills/openrouter-images/ exists only on the author's laptop. On CI runners or other developers' machines, rsync produces a silent empty directory (or errors), so the agent never loads the skill content and every task fails.

The same pattern appears in evals/openrouter-stt/eval.yaml:23, evals/openrouter-tts/eval.yaml:26, evals/openrouter-video/eval.yaml:23, and evals/openrouter-models/eval.yaml:36-39.

Fix: Use a repo-relative path derived from the checkout location. Waza runs eval.yaml from the eval directory, so the repo root is ../../..:

before_run:
  - command: "mkdir -p ~/.agents/skills && rsync -a ../../../skills/openrouter-images/ ~/.agents/skills/openrouter-images/"
  - command: "cd ~/.agents/skills/openrouter-images/scripts && npm install --silent"

Or use waza's skill_directories config (already used in openrouter-models) to point at ../../../skills.

Ref: waza README — skill_directories

Reviewed at 02fae01

- command: "cd /Users/matt.apperson/.agents/skills/openrouter-images/scripts && npm install --silent"

graders:
- type: code
name: has_output
config:
assertions:
- "len(output) > 50"

tasks:
- "tasks/*.yaml"
46 changes: 46 additions & 0 deletions evals/openrouter-images/tasks/01-generate-basic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
id: generate-basic-001
name: Generate Basic Image
description: |
Decision tree says "generate from text" → generate.ts. Agent should
invoke it, not call the Responses API directly.
tags:
- happy-path
- generate

inputs:
prompt: |
Generate an image of a red panda wearing sunglasses and save it
somewhere reasonable.

graders:
- type: code
name: invoked_generate_script
config:
language: python
assertions:
- '"generate.ts" in " ".join([tc["arguments"].get("command", "") for tc in tool_calls if tc["name"] == "bash"])'
- '"red panda" in " ".join([tc["arguments"].get("command", "") for tc in tool_calls if tc["name"] == "bash"]).lower()'

- type: prompt
name: generate_quality
config:
model: openai/gpt-chat-latest
continue_session: true
prompt: |
The user asked for a basic image generation. Call
set_waza_grade_pass or set_waza_grade_fail once per criterion
(3 calls total).

1) Used generate.ts: invoked the skill's generate.ts script
(not edit.ts, not a raw curl to /api/v1/responses).

2) Correct prompt: passed "a red panda wearing sunglasses" or
very close as the script's positional prompt argument.

3) Reports the result: tells the user the model used and where
the image was saved (per the skill's Presenting Results
guidance).

expected:
outcomes:
- type: task_completed
Loading
Loading