[wip] Add eval framework for payload agent skills#480
Conversation
Adds a hermetic evaluation framework for the payload analysis skill with 14 test cases covering rejected payloads, accepted-with-failures, revert candidates, false positives, infrastructure-only failures, and install failures. Key components: - eval/cases/: 15 test cases (14 payload + 1 install) with input.yaml and annotations.yaml defining expected outcomes - eval/shims/: curl, gcloud, gh shims that intercept external calls and serve from local archives for fully reproducible offline runs - eval/scripts/: archive compression, trimming (65GB->15GB), and GCS artifact download utilities - eval.yaml: eval config with 6 judges (analysis_quality, revert_scoring_accuracy, output file validators, HTML structure) - plugins/ci/skills/archive-payload-result/: skill + script to build hermetic archives from production payload agent runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a reproducible payload-analysis evaluation harness: registers a new ChangesPayload Analysis Evaluation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (9)
plugins/ci/skills/archive-payload-result/SKILL.md (1)
131-135: 💤 Low valueSpecify language identifier for fenced code block.
The code block shows YAML content but lacks a language identifier, which reduces readability and breaks some Markdown processors.
📝 Proposed fix
2. Parse each JUnit XML to extract underlying job URLs from `<system-out>` blocks. The `<system-out>` contains YAML with a `humanurl` field: - ``` + ```yaml humanurl: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{underlying-job-name}/{build-id} ```🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ci/skills/archive-payload-result/SKILL.md` around lines 131 - 135, The fenced code block in SKILL.md that shows the YAML snippet starting with "humanurl: https://prow.ci.openshift.org/..." lacks a language identifier; update that fenced block to use the YAML language tag (i.e., add "yaml" immediately after the opening triple backticks) so the snippet is rendered and highlighted correctly.plugins/ci/skills/archive-payload-result/extract_session_data.py (7)
165-165: 💤 Low valueRename unused loop variables.
The loop variables
dirsandfilesare not used in the loop body. Consider renaming them to_dirsand_filesto indicate they're intentionally unused.♻️ Proposed fix
- for root, dirs, files in os.walk(session_dir): + for root, _dirs, _files in os.walk(session_dir): candidate = os.path.join(root, tr_match.group(1))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line 165, The os.walk loop in extract_session_data.py uses unused loop variables named dirs and files; rename them to _dirs and _files in the for root, dirs, files in os.walk(session_dir) statement (or equivalent loop) to indicate they're intentionally unused and avoid linter warnings, ensuring the rest of the loop continues to use root as before.
202-204: 💤 Low valueConsider logging parse failures.
The broad exception catch for malformed JSONL is acceptable, but adding a debug/warning log would aid troubleshooting when extraction fails silently.
📊 Suggested improvement
try: msg = json.loads(line) - except Exception: + except json.JSONDecodeError: continueOr add logging:
except Exception as e: # Optional: log parse failures for debugging # print(f"DEBUG: Failed to parse line in {fpath}: {e}", file=sys.stderr) continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` around lines 202 - 204, Change the bare except that swallows JSON parse errors to capture the exception and log a debug/warning message before continuing: replace "except Exception: continue" with "except Exception as e:" and log the failure (including fpath, the problematic line variable and e) using the module/logger used in this file so malformed JSONL lines are visible for debugging while continuing to the next line.
429-437: 💤 Low valueRemove unnecessary f-string prefixes.
Lines 429, 436, and 437 use f-strings without any placeholders.
♻️ Proposed fix
print("\nDone. Next steps:") - print(f" 1. Download GCS artifacts:") + print(" 1. Download GCS artifacts:") print( f" while read p; do gcloud storage cp -r \"gs://$p/*\" \"{output_dir}/$p/\"; done < {output_dir}/failed-direct-jobs.txt" ) print( f" while read p; do gcloud storage cp -r \"gs://$p/*\" \"{output_dir}/$p/\"; done < {output_dir}/failed-aggregated-jobs.txt" ) - print(f" 2. Download reference outputs from the session's GCS artifacts") - print(f" 3. Create eval case in eval/cases/ with input.yaml and annotations.yaml") + print(" 2. Download reference outputs from the session's GCS artifacts") + print(" 3. Create eval case in eval/cases/ with input.yaml and annotations.yaml")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` around lines 429 - 437, The three print calls in extract_session_data.py that currently use f-strings without placeholders should be changed to normal string literals: replace the f-prefixed strings in the prints that output the numbered steps (the line printing " 1. Download GCS artifacts:" and the two multi-line gcloud cp instruction prints, plus the line printing " 2. Download reference outputs..." or " 3. Create eval case...") by removing the leading f so they are plain strings; update the print statements referencing those exact string literals in the file accordingly to avoid unnecessary f-string usage.
190-190: 💤 Low valueRename unused loop variables.
The
dirsvariable is not used in the loop body.♻️ Proposed fix
- for root, dirs, files in os.walk(session_dir): + for root, _dirs, files in os.walk(session_dir): for fname in sorted(files):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line 190, The for-loop over os.walk(session_dir) uses an unused variable named dirs; rename it to a conventional underscore placeholder (e.g., _ or _dirs) to signal it's intentionally unused and avoid linter complaints; update the loop header "for root, dirs, files in os.walk(session_dir):" to "for root, _, files in os.walk(session_dir):" (or similar) while keeping the rest of the logic that references root and files unchanged.
293-293: 💤 Low valueClarify operator precedence with parentheses.
The mixed
orandandoperators can be ambiguous. Adding parentheses improves readability.♻️ Proposed fix
- if cmd.strip().startswith("gh ") or "&&" in cmd and "gh " in cmd: + if cmd.strip().startswith("gh ") or ("&&" in cmd and "gh " in cmd): _extract_gh_responses(cmd, rc, output_dir, found)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line 293, The condition checking for GitHub CLI usage is ambiguous due to mixed or/and precedence; update the conditional that uses cmd (the line containing if cmd.strip().startswith("gh ") or "&&" in cmd and "gh " in cmd:) to use explicit parentheses, e.g. if cmd.strip().startswith("gh ") or (("&&" in cmd) and ("gh " in cmd)): so the intent is clear and evaluation order is unambiguous.
378-378: 💤 Low valueRename unused loop variable.
The
dirsvariable is not used in the loop body.♻️ Proposed fix
- for root, dirs, files in os.walk(tmpdir): + for root, _dirs, files in os.walk(tmpdir): for fname in files:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line 378, The loop variable `dirs` in the `for root, dirs, files in os.walk(tmpdir):` line is unused; update that tuple to use a throwaway name (e.g., `for root, _, files in os.walk(tmpdir):` or `for root, _dirs, files in os.walk(tmpdir):`) in the `extract_session_data.py` code so the intent is clear and linter warnings are resolved; ensure you rename only the unused variable and leave `root` and `files` unchanged.
375-375: 💤 Low valueTarball extraction is trusted-source only.
The
extractall()call is safe here since session tarballs come from the controlled GCS location. If extracting from untrusted sources in the future, consider using thefilterparameter (Python 3.12+) or validating paths.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line 375, The tar extraction call tf.extractall(tmpdir) in extract_session_data.py trusts input; replace it with a safe extraction routine that either (a) on Python 3.12+ uses the tarfile.extractall(..., filter=...) to skip unsafe members, or (b) manually iterates tf.getmembers(), for each member compute target = os.path.join(tmpdir, member.name), normalize with os.path.realpath/os.path.abspath and ensure target startswith the tmpdir realpath (reject members with absolute paths or containing ../) before calling tf.extract(member, tmpdir) (or extractfile/write for directories), so path-traversal attacks are prevented while keeping extraction for trusted GCS session tarballs.eval/scripts/compress-archives.sh (1)
76-78: 💤 Low valueConsider handling
statfailures explicitly.If both
statformats fail (unlikely but possible for filesystem edge cases), the function returns an empty string, which could cause arithmetic errors in the calling code (lines 94, 124). Consider adding a fallback or validation.🛡️ Suggested defensive fix
file_size() { - stat --format='%s' "$1" 2>/dev/null || stat -f '%z' "$1" 2>/dev/null + stat --format='%s' "$1" 2>/dev/null || stat -f '%z' "$1" 2>/dev/null || echo "0" }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@eval/scripts/compress-archives.sh` around lines 76 - 78, The file_size() helper can return an empty string if both stat variants fail, causing arithmetic errors downstream; update file_size to detect stat failures and return a safe numeric fallback (e.g., "0") or non-zero exit status and ensure it always echoes a number, and then add simple validation where it's used (the callers that consume file_size) to check the value is numeric before performing arithmetic and either coerce to 0 or abort with a clear error; reference the file_size function name for the change and the calling sites that use its result for arithmetic so they validate/handle non-numeric output.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@eval.yaml`:
- Around line 53-54: Remove the stale guidance string "No test cases exist yet.
Run /eval-dataset to generate cases with real payload tags from recent CI
history." from the schema text so it no longer contradicts the added test cases;
locate the exact YAML value containing that sentence and replace it with an
accurate description or remove the line entirely (update the human-readable
description field that currently contains that sentence).
- Around line 165-167: The current check that fails when len(data) == 0 should
be relaxed to accept an empty JSON array as a valid "no-candidate" outcome;
change the logic so that when len(data) == 0 you return a successful/valid
result (e.g., (True, "no candidates") or equivalent in this function) and only
perform the required-field validation (the required = [...] list and subsequent
checks) when data is non-empty, ensuring the required-field loop/validation code
runs only for actual rows.
In `@eval/cases/case-006-archived-4.22-rejected/annotations.yaml`:
- Around line 12-14: The expected_failing_jobs entries are inconsistent with the
notes: update the job names so they match exactly between the
expected_failing_jobs list and the notes block — e.g., either change
"e2e-aws-ovn-techpreview" and "e2e-aws-ovn-techpreview-serial" in the
expected_failing_jobs array to "aws-ovn-techpreview" and
"aws-ovn-techpreview-serial-3of3", or change the notes to include the "e2e-"
prefixed names; ensure all occurrences (expected_failing_jobs and notes) use the
same exact strings so the validator can match them.
In `@eval/cases/case-007-archived-4.22-ci-cco-revert/annotations.yaml`:
- Around line 12-14: The expected_failing_jobs entries include versioned
suffixes that don't match the unversioned names referenced in the notes; update
either the expected_failing_jobs array or the notes so they consistently
reference the same full job names found in the payload—specifically ensure
"expected_failing_jobs" contains the exact blocking job names (e.g.
aggregated-aws-ovn-upgrade-4.22-minor and aggregated-gcp-ovn-upgrade-4.22-micro)
or change the notes to the unversioned names to match the payload, and make the
same consistency fix for the related entries referenced in the notes block.
In `@eval/cases/case-010-5.0-ci-cno-networkpolicy-revert/annotations.yaml`:
- Around line 2-15: The declared expected_failed_job_count
(expected_failed_job_count) does not match the number of entries under
expected_candidates[0].expected_failing_jobs (three entries: "e2e-aws-ovn",
"e2e-azure-ovn", "e2e-gcp-ovn"); update expected_failed_job_count to 3 (or
remove one of the failing job entries if that was the intent) so the numeric
count and the expected_failing_jobs list are consistent.
In `@eval/README.md`:
- Around line 40-62: The README has multiple fenced code blocks missing language
tags (causing MD040); update each fenced block by adding an appropriate language
identifier (e.g., text or bash) for the shown blocks such as the archives tree
starting with "archives/{payload-tag}/", the payload marker line
"/ci:archive-payload-result 4.22.0-0.nightly-2026-03-20-053450", the eval cases
tree "eval/cases/{case-name}/", the eval command "/eval-mlflow --action
log-results --run-id <id>" and the top-level "eval/" block; ensure you change
the opening ``` to ```text or ```bash as appropriate for each occurrence (also
apply same fixes where noted around the other ranges mentioned in the review).
In `@eval/scripts/download-gcs-artifacts.sh`:
- Around line 12-14: The script sets BUCKET, PREFIX and DEST without validating
inputs, so add argument validation at the top: check that two args are provided
(e.g. if [ "$#" -ne 2 ] || [ -z "$1" ] || [ -z "$2" ]; then echo "Usage: $0
<prefix> <dest>"; exit 2; fi), only then set PREFIX="${1%/}" and DEST="$2", and
ensure the script exits non-zero on missing args to avoid creating unexpected
directories; reference the variables BUCKET, PREFIX and DEST when locating where
to add the check.
In `@eval/scripts/trim-archives.sh`:
- Around line 49-52: The default case in the case statement prints "Unknown
option" and calls the usage function but returns exit code 0; update the
unknown-option branch (the "*" case) to return a non-zero status by either
making usage emit a nonzero exit code or by calling usage followed by exit 1.
Locate the "*" case in trim-archives.sh and change it to: echo "Unknown option:
$1" >&2; usage; exit 1 (or modify usage to exit 1) so invalid options produce
exit code 1.
In `@eval/TODO.md`:
- Line 5: Update the typos in the TODO.md text: correct "peformance" to
"performance", "recocmend" to "recommend", and "insatll" to "install", and
search for and correct the same misspellings at the other occurrences referenced
(lines with the same words, e.g., the occurrences noted at 65 and 79); ensure
the file's task instructions read clearly and consistently after these fixes.
In `@plugins/ci/skills/archive-payload-result/SKILL.md`:
- Around line 77-78: Update the command that runs the extractor in SKILL.md to
use the correct script path: replace the incorrect
"scripts/extract_session_data.py" reference with
"plugins/ci/skills/archive-payload-result/extract_session_data.py" so the
command python3 extracts using the actual file (referenced by the script name
extract_session_data.py in the markdown snippet).
---
Nitpick comments:
In `@eval/scripts/compress-archives.sh`:
- Around line 76-78: The file_size() helper can return an empty string if both
stat variants fail, causing arithmetic errors downstream; update file_size to
detect stat failures and return a safe numeric fallback (e.g., "0") or non-zero
exit status and ensure it always echoes a number, and then add simple validation
where it's used (the callers that consume file_size) to check the value is
numeric before performing arithmetic and either coerce to 0 or abort with a
clear error; reference the file_size function name for the change and the
calling sites that use its result for arithmetic so they validate/handle
non-numeric output.
In `@plugins/ci/skills/archive-payload-result/extract_session_data.py`:
- Line 165: The os.walk loop in extract_session_data.py uses unused loop
variables named dirs and files; rename them to _dirs and _files in the for root,
dirs, files in os.walk(session_dir) statement (or equivalent loop) to indicate
they're intentionally unused and avoid linter warnings, ensuring the rest of the
loop continues to use root as before.
- Around line 202-204: Change the bare except that swallows JSON parse errors to
capture the exception and log a debug/warning message before continuing: replace
"except Exception: continue" with "except Exception as e:" and log the failure
(including fpath, the problematic line variable and e) using the module/logger
used in this file so malformed JSONL lines are visible for debugging while
continuing to the next line.
- Around line 429-437: The three print calls in extract_session_data.py that
currently use f-strings without placeholders should be changed to normal string
literals: replace the f-prefixed strings in the prints that output the numbered
steps (the line printing " 1. Download GCS artifacts:" and the two multi-line
gcloud cp instruction prints, plus the line printing " 2. Download reference
outputs..." or " 3. Create eval case...") by removing the leading f so they are
plain strings; update the print statements referencing those exact string
literals in the file accordingly to avoid unnecessary f-string usage.
- Line 190: The for-loop over os.walk(session_dir) uses an unused variable named
dirs; rename it to a conventional underscore placeholder (e.g., _ or _dirs) to
signal it's intentionally unused and avoid linter complaints; update the loop
header "for root, dirs, files in os.walk(session_dir):" to "for root, _, files
in os.walk(session_dir):" (or similar) while keeping the rest of the logic that
references root and files unchanged.
- Line 293: The condition checking for GitHub CLI usage is ambiguous due to
mixed or/and precedence; update the conditional that uses cmd (the line
containing if cmd.strip().startswith("gh ") or "&&" in cmd and "gh " in cmd:) to
use explicit parentheses, e.g. if cmd.strip().startswith("gh ") or (("&&" in
cmd) and ("gh " in cmd)): so the intent is clear and evaluation order is
unambiguous.
- Line 378: The loop variable `dirs` in the `for root, dirs, files in
os.walk(tmpdir):` line is unused; update that tuple to use a throwaway name
(e.g., `for root, _, files in os.walk(tmpdir):` or `for root, _dirs, files in
os.walk(tmpdir):`) in the `extract_session_data.py` code so the intent is clear
and linter warnings are resolved; ensure you rename only the unused variable and
leave `root` and `files` unchanged.
- Line 375: The tar extraction call tf.extractall(tmpdir) in
extract_session_data.py trusts input; replace it with a safe extraction routine
that either (a) on Python 3.12+ uses the tarfile.extractall(..., filter=...) to
skip unsafe members, or (b) manually iterates tf.getmembers(), for each member
compute target = os.path.join(tmpdir, member.name), normalize with
os.path.realpath/os.path.abspath and ensure target startswith the tmpdir
realpath (reject members with absolute paths or containing ../) before calling
tf.extract(member, tmpdir) (or extractfile/write for directories), so
path-traversal attacks are prevented while keeping extraction for trusted GCS
session tarballs.
In `@plugins/ci/skills/archive-payload-result/SKILL.md`:
- Around line 131-135: The fenced code block in SKILL.md that shows the YAML
snippet starting with "humanurl: https://prow.ci.openshift.org/..." lacks a
language identifier; update that fenced block to use the YAML language tag
(i.e., add "yaml" immediately after the opening triple backticks) so the snippet
is rendered and highlighted correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0909acc0-1956-402e-801b-ade5c03e5c62
📒 Files selected for processing (46)
.claude-plugin/marketplace.jsondocs/data.jsoneval.yamleval/README.mdeval/TODO.mdeval/cases/case-001-rejected-single-failure/annotations.yamleval/cases/case-001-rejected-single-failure/input.yamleval/cases/case-002-rejected-multiple-failures/annotations.yamleval/cases/case-002-rejected-multiple-failures/input.yamleval/cases/case-003-accepted-with-failures/annotations.yamleval/cases/case-003-accepted-with-failures/input.yamleval/cases/case-004-ready-in-progress/annotations.yamleval/cases/case-004-ready-in-progress/input.yamleval/cases/case-005-rejected-streak/annotations.yamleval/cases/case-005-rejected-streak/input.yamleval/cases/case-006-archived-4.22-rejected/annotations.yamleval/cases/case-006-archived-4.22-rejected/input.yamleval/cases/case-007-archived-4.22-ci-cco-revert/annotations.yamleval/cases/case-007-archived-4.22-ci-cco-revert/input.yamleval/cases/case-008-archived-4.22-ci-hypershift-revert/annotations.yamleval/cases/case-008-archived-4.22-ci-hypershift-revert/input.yamleval/cases/case-009-archived-4.22-cvo-revert/annotations.yamleval/cases/case-009-archived-4.22-cvo-revert/input.yamleval/cases/case-010-5.0-ci-cno-networkpolicy-revert/annotations.yamleval/cases/case-010-5.0-ci-cno-networkpolicy-revert/input.yamleval/cases/case-011-5.0-nightly-cmo-monitoring-revert/annotations.yamleval/cases/case-011-5.0-nightly-cmo-monitoring-revert/input.yamleval/cases/case-012-5.0-ci-hypershift-builder-fp/annotations.yamleval/cases/case-012-5.0-ci-hypershift-builder-fp/input.yamleval/cases/case-013-5.0-nightly-nto-testdata-revert/annotations.yamleval/cases/case-013-5.0-nightly-nto-testdata-revert/input.yamleval/cases/case-014-5.0-ci-infra-only-no-candidates/annotations.yamleval/cases/case-014-5.0-ci-infra-only-no-candidates/input.yamleval/cases/case-install-001-metal-ipi-ipv6-ckao/annotations.yamleval/cases/case-install-001-metal-ipi-ipv6-ckao/input.yamleval/reports/improvement-recommendations.mdeval/reports/payload-agent-analysis.mdeval/scripts/compress-archives.sheval/scripts/download-gcs-artifacts.sheval/scripts/trim-archives.sheval/shims/curleval/shims/gcloudeval/shims/ghplugins/ci/.claude-plugin/plugin.jsonplugins/ci/skills/archive-payload-result/SKILL.mdplugins/ci/skills/archive-payload-result/extract_session_data.py
| No test cases exist yet. Run /eval-dataset to generate cases with real | ||
| payload tags from recent CI history. |
There was a problem hiding this comment.
Remove stale dataset guidance from schema text.
This says no test cases exist, but this PR adds 15 cases. Keeping this text will confuse dataset generation workflow.
Proposed fix
- No test cases exist yet. Run /eval-dataset to generate cases with real
- payload tags from recent CI history.
+ Test cases are stored under eval/cases. Add new ones with real payload tags
+ from CI history when expanding coverage.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| No test cases exist yet. Run /eval-dataset to generate cases with real | |
| payload tags from recent CI history. | |
| Test cases are stored under eval/cases. Add new ones with real payload tags | |
| from CI history when expanding coverage. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@eval.yaml` around lines 53 - 54, Remove the stale guidance string "No test
cases exist yet. Run /eval-dataset to generate cases with real payload tags from
recent CI history." from the schema text so it no longer contradicts the added
test cases; locate the exact YAML value containing that sentence and replace it
with an accurate description or remove the line entirely (update the
human-readable description field that currently contains that sentence).
| if len(data) == 0: | ||
| return (False, "JSON array is empty") | ||
| required = ["payload_tag", "job_name", "failure_type", "root_cause_summary"] |
There was a problem hiding this comment.
Allow valid zero-row autodl output for no-candidate cases.
The judge currently hard-fails on an empty JSON array, but this can be a valid outcome when no revert candidates are found (per your own pair-based schema).
Proposed fix
- if len(data) == 0:
- return (False, "JSON array is empty")
+ if len(data) == 0:
+ # Valid when there are no candidate PRs (e.g., infra-only/no-revert cases)
+ return (True, "Valid JSON: 0 rows (no candidate pairs)")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if len(data) == 0: | |
| return (False, "JSON array is empty") | |
| required = ["payload_tag", "job_name", "failure_type", "root_cause_summary"] | |
| if len(data) == 0: | |
| # Valid when there are no candidate PRs (e.g., infra-only/no-revert cases) | |
| return (True, "Valid JSON: 0 rows (no candidate pairs)") | |
| required = ["payload_tag", "job_name", "failure_type", "root_cause_summary"] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@eval.yaml` around lines 165 - 167, The current check that fails when
len(data) == 0 should be relaxed to accept an empty JSON array as a valid
"no-candidate" outcome; change the logic so that when len(data) == 0 you return
a successful/valid result (e.g., (True, "no candidates") or equivalent in this
function) and only perform the required-field validation (the required = [...]
list and subsequent checks) when data is non-empty, ensuring the required-field
loop/validation code runs only for actual rows.
| expected_failing_jobs: | ||
| - "aggregated-aws-ovn-upgrade-4.22-minor" | ||
| - "aggregated-gcp-ovn-upgrade-4.22-micro" |
There was a problem hiding this comment.
Job name inconsistency between expected_failing_jobs and notes.
The job names in Lines 12-14 (expected_failing_jobs) include version-specific suffixes that are absent from the notes (Lines 18-19):
- Notes: "aggregated-aws-ovn-upgrade" vs. expected: "aggregated-aws-ovn-upgrade-4.22-minor"
- Notes: "aggregated-gcp-ovn-upgrade" vs. expected: "aggregated-gcp-ovn-upgrade-4.22-micro"
Verify that the full job names in expected_failing_jobs match the actual blocking job names in the payload. Update the notes to match, or vice versa, to avoid confusion during validation.
Also applies to: 16-23
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@eval/cases/case-007-archived-4.22-ci-cco-revert/annotations.yaml` around
lines 12 - 14, The expected_failing_jobs entries include versioned suffixes that
don't match the unversioned names referenced in the notes; update either the
expected_failing_jobs array or the notes so they consistently reference the same
full job names found in the payload—specifically ensure "expected_failing_jobs"
contains the exact blocking job names (e.g.
aggregated-aws-ovn-upgrade-4.22-minor and aggregated-gcp-ovn-upgrade-4.22-micro)
or change the notes to the unversioned names to match the payload, and make the
same consistency fix for the related entries referenced in the notes block.
| python3 scripts/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG" | ||
| ``` |
There was a problem hiding this comment.
Correct the script path.
The extraction script is located at plugins/ci/skills/archive-payload-result/extract_session_data.py, not scripts/extract_session_data.py. This path mismatch will cause execution failures.
📝 Proposed fix
-python3 scripts/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG"
+python3 plugins/ci/skills/archive-payload-result/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| python3 scripts/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG" | |
| ``` | |
| python3 plugins/ci/skills/archive-payload-result/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/ci/skills/archive-payload-result/SKILL.md` around lines 77 - 78,
Update the command that runs the extractor in SKILL.md to use the correct script
path: replace the incorrect "scripts/extract_session_data.py" reference with
"plugins/ci/skills/archive-payload-result/extract_session_data.py" so the
command python3 extracts using the actual file (referenced by the script name
extract_session_data.py in the markdown snippet).
Update shims to use EVAL_ARCHIVES_DIR directly instead of searching
across multiple archives. Each eval case now gets its own isolated
archive via the templated path /archive/{payload_tag}/artifacts.
- Remove multi-archive glob patterns from gcloud, curl, gh shims
- Remove tarball extraction code (using OCI volumes now)
- Update eval.yaml to set EVAL_ARCHIVES_DIR per case using {payload_tag}
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
eval/shims/gcloud (1)
59-59: 💤 Low valueUnused variable
args.The
argsarray is declared but never used in the function.Proposed fix
local recursive=false local gs_path="" - local args=()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@eval/shims/gcloud` at line 59, Remove the unused local variable by deleting the declaration "local args=()" in eval/shims/gcloud (or, if it was intended to collect parameters, replace its declaration with proper usage where the function populates and uses "args"); ensure no other code references "args" remain so there are no unused-variable warnings.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@eval/shims/curl`:
- Around line 81-83: The grouped case arm "--compressed|--retry|--retry-delay)"
treats all three as flags but --retry and --retry-delay require an argument;
update the option parsing in the case statement so --retry and --retry-delay are
handled in their own arms (e.g., "--retry") shift; RETRY="$1"; shift;; and
("--retry-delay") shift; RETRY_DELAY="$1"; shift;; or, if keeping grouping,
detect which option is present and perform an extra shift and assignment for its
argument instead of shifting only once; ensure the argument is consumed so it
isn't interpreted later as the URL.
---
Nitpick comments:
In `@eval/shims/gcloud`:
- Line 59: Remove the unused local variable by deleting the declaration "local
args=()" in eval/shims/gcloud (or, if it was intended to collect parameters,
replace its declaration with proper usage where the function populates and uses
"args"); ensure no other code references "args" remain so there are no
unused-variable warnings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 62d78959-616b-4207-9127-e83b0585f449
📒 Files selected for processing (4)
eval.yamleval/shims/curleval/shims/gcloudeval/shims/gh
🚧 Files skipped from review as they are similar to previous changes (2)
- eval/shims/gh
- eval.yaml
| --compressed|--retry|--retry-delay) | ||
| shift | ||
| ;; |
There was a problem hiding this comment.
--retry and --retry-delay require arguments but only shift once.
Both --retry <num> and --retry-delay <seconds> take arguments. Grouping them with --compressed (which is a boolean flag) causes incorrect parsing—the argument value is left on the stack and may be misinterpreted as the URL.
Proposed fix
- --compressed|--retry|--retry-delay)
- shift
+ --compressed)
+ shift
+ ;;
+ --retry|--retry-delay)
+ shift 2
;;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| --compressed|--retry|--retry-delay) | |
| shift | |
| ;; | |
| --compressed) | |
| shift | |
| ;; | |
| --retry|--retry-delay) | |
| shift 2 | |
| ;; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@eval/shims/curl` around lines 81 - 83, The grouped case arm
"--compressed|--retry|--retry-delay)" treats all three as flags but --retry and
--retry-delay require an argument; update the option parsing in the case
statement so --retry and --retry-delay are handled in their own arms (e.g.,
"--retry") shift; RETRY="$1"; shift;; and ("--retry-delay") shift;
RETRY_DELAY="$1"; shift;; or, if keeping grouping, detect which option is
present and perform an extra shift and assignment for its argument instead of
shifting only once; ensure the argument is consumed so it isn't interpreted
later as the URL.
The archive directories are read-only when mounted as OCI volumes. Move shim logs to /tmp and add error suppression to prevent failures.
Models are now configurable via environment variables: - EVAL_MODEL: model for skill execution (default: claude-sonnet-4-6) - EVAL_JUDGE_MODEL: model for judging (default: claude-sonnet-4-6)
The shims need to be in PATH for hermetic evaluation. Use
${PWD}/eval/shims:${PATH} to prepend them properly.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
eval-install-failure.yaml (1)
295-305: Consider the impact of 100% pass rate requirements.The
min_pass_rate: 1.0thresholds foroutput_files_existandfailure_classificationmean that if even a single test case fails these checks, the entire evaluation fails. While this enforces high standards, it may make the eval framework brittle during development or when adding new challenging test cases.Consider whether these should allow for a small failure rate (e.g., 0.9 or 0.95) to accommodate edge cases while still maintaining quality standards, or document that these strict requirements are intentional for release gating.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@eval-install-failure.yaml` around lines 295 - 305, The thresholds file currently enforces a 100% pass rate for output_files_exist and failure_classification via the keys output_files_exist.min_pass_rate and failure_classification.min_pass_rate; change these values to a slightly lower tolerance (for example 0.95 or 0.9) if you want to allow occasional edge-case failures, or add a short comment near the thresholds block documenting that the 1.0 values are intentional release gates; update the min_pass_rate values or the documentation accordingly and keep the other threshold keys (root_cause_identification, pr_correlation, analysis_quality) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@eval-install-failure.yaml`:
- Around line 59-61: Documentation references the wrong environment variable
name; update the doc text that mentions EVAL_ARTIFACT_ARCHIVE to use the actual
variable EVAL_ARCHIVES_DIR used by the gcloud shim so the shim can find local
archives for hermetic evaluation (reference the gcloud shim and the env var
symbol EVAL_ARCHIVES_DIR in your change).
---
Nitpick comments:
In `@eval-install-failure.yaml`:
- Around line 295-305: The thresholds file currently enforces a 100% pass rate
for output_files_exist and failure_classification via the keys
output_files_exist.min_pass_rate and failure_classification.min_pass_rate;
change these values to a slightly lower tolerance (for example 0.95 or 0.9) if
you want to allow occasional edge-case failures, or add a short comment near the
thresholds block documenting that the 1.0 values are intentional release gates;
update the min_pass_rate values or the documentation accordingly and keep the
other threshold keys (root_cause_identification, pr_correlation,
analysis_quality) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c8ac4c64-15d1-43cc-a671-ec5fd2ca2c8f
📒 Files selected for processing (2)
eval-install-failure.yamleval.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- eval.yaml
The env section in eval.yaml doesn't expand shell variables like
${PWD} at parse time. PATH must be set by the runner/deployment
before invoking the skill.
Simple eval that tests the hello-world:echo command. Used to verify the eval harness works without waiting 10+ minutes for complex skill evals. - eval-hello-world.yaml: Config with inline check + LLM judge - case-001-default: Tests "Hello world" (no args) - case-002-named: Tests "Hello Claude" (with name arg) Expected runtime: ~30 seconds per case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Make name argument optional ({name?}) since case-001 uses empty string
- Change output path from "." to "output" (harness rejects project root)
- Add traces.stdout for capturing skill output
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Hardcode judge to claude-opus-4-6, drop skill model (comes from CLI --model)
- Fix env var syntax: ${VAR} → $VAR (harness uses $VAR, not shell expansion)
- Fix inline checks: outputs key is "conversation", not "transcript"
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- plugins/ci/evals/cases/payload-agent/ — payload analysis cases - plugins/ci/evals/cases/install-analysis/ — install failure cases - plugins/ci/evals/shims/ — shared shims for ci evals - plugins/hello-world/evals/ — hello-world eval cases - Eval configs live alongside their cases - Updated dataset.path in all configs to match new locations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cases, shims, and configs moved to plugins/ in previous commit. Historical runs, reports, and scripts archived separately. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove "archived" prefix from payload-agent cases 006-009 and add docs/evals.md covering directory layout, authoring, running, and known gotchas. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2a8e0fb to
32e4c3b
Compare
32e4c3b to
f9e28b8
Compare
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Adds an evaluation framework for Claude Code skills using agent-eval-harness. Includes three
eval suites (hello-world, install failure analysis, payload analysis) with hermetic test cases, CLI shims for offline execution, and an OpenShift deployment for automated runs with MLflow tracking.
What this PR does / why we need it:
Eval suites
plugins/hello-world/evals/): Minimal eval to verify the harness works. 2 cases, inline check + LLM judge.plugins/ci/evals/eval-install-analysis.yaml): Evaluatesci:analyze-prow-job-install-failureagainst archived Prowjob artifacts. 1 case, 4 judges covering output files, root cause identification, PR correlation, failure classification, and analysis quality.
plugins/ci/evals/eval-payload-agent.yaml): Evaluatesci:analyze-payloadagainst archived payload data. 14 cases coveringrejected payloads, revert candidates, false positives, infrastructure-only failures. 6 judges validating output files, YAML/JSON schema, HTML report
structure, analysis quality, and revert scoring accuracy.
Infrastructure
plugins/ci/evals/shims/):gcloud,curl,ghshims intercept external API calls and serve from local archives whenEVAL_ARCHIVES_DIRis set, making evals fully reproducible offline.quay.io/stbenjam/payload-data, mounted as image volumes in the deployment.
plugins/ci/skills/archive-payload-result/): Builds hermetic archives from production payload agent runs.Documentation
docs/evals.md: Guide covering directory layout, authoring new evals, running locally and on OpenShift, MLflow integration, and known gotchas.Special notes for your reviewer:
All eval configs use
claude-opus-4-6as the judge model (hardcoded, independent of the skill model under test). Key harness compatibility notes aredocumented in
docs/evals.md— notably that env vars use$VARsyntax (not${VAR}), inline checks use theconversationkey (nottranscript),and
outputs[0].pathmust be a named subdirectory.Checklist: