[wip] Add eval framework for payload agent skills by stbenjam · Pull Request #480 · openshift-eng/ai-helpers

stbenjam · 2026-05-15T14:02:55Z

Adds an evaluation framework for Claude Code skills using agent-eval-harness. Includes three
eval suites (hello-world, install failure analysis, payload analysis) with hermetic test cases, CLI shims for offline execution, and an OpenShift deployment for automated runs with MLflow tracking.

What this PR does / why we need it:

Eval suites

hello-world (plugins/hello-world/evals/): Minimal eval to verify the harness works. 2 cases, inline check + LLM judge.
install failure analysis (plugins/ci/evals/eval-install-analysis.yaml): Evaluates ci:analyze-prow-job-install-failure against archived Prow
job artifacts. 1 case, 4 judges covering output files, root cause identification, PR correlation, failure classification, and analysis quality.
payload analysis (plugins/ci/evals/eval-payload-agent.yaml): Evaluates ci:analyze-payload against archived payload data. 14 cases covering
rejected payloads, revert candidates, false positives, infrastructure-only failures. 6 judges validating output files, YAML/JSON schema, HTML report
structure, analysis quality, and revert scoring accuracy.

Infrastructure

Shims (plugins/ci/evals/shims/): gcloud, curl, gh shims intercept external API calls and serve from local archives when
EVAL_ARCHIVES_DIR is set, making evals fully reproducible offline.
Payload archives: Archived CI artifacts stored as OCI images on
quay.io/stbenjam/payload-data, mounted as image volumes in the deployment.
Archive skill (plugins/ci/skills/archive-payload-result/): Builds hermetic archives from production payload agent runs.

Documentation

docs/evals.md: Guide covering directory layout, authoring new evals, running locally and on OpenShift, MLflow integration, and known gotchas.

Special notes for your reviewer:

All eval configs use claude-opus-4-6 as the judge model (hardcoded, independent of the skill model under test). Key harness compatibility notes are
documented in docs/evals.md — notably that env vars use $VAR syntax (not ${VAR}), inline checks use the conversation key (not transcript),
and outputs[0].path must be a named subdirectory.

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.

Adds a hermetic evaluation framework for the payload analysis skill with 14 test cases covering rejected payloads, accepted-with-failures, revert candidates, false positives, infrastructure-only failures, and install failures. Key components: - eval/cases/: 15 test cases (14 payload + 1 install) with input.yaml and annotations.yaml defining expected outcomes - eval/shims/: curl, gcloud, gh shims that intercept external calls and serve from local archives for fully reproducible offline runs - eval/scripts/: archive compression, trimming (65GB->15GB), and GCS artifact download utilities - eval.yaml: eval config with 6 judges (analysis_quality, revert_scoring_accuracy, output file validators, HTML structure) - plugins/ci/skills/archive-payload-result/: skill + script to build hermetic archives from production payload agent runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-05-15T14:03:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [stbenjam]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-15T14:03:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a reproducible payload-analysis evaluation harness: registers a new archive-payload-result skill, implements a Python extractor and archive builder workflow, provides hermetic shims and archive-management scripts, adds eval configurations and judges (including LLM-based judges), supplies 15 annotated test cases, README/TODO, and analysis reports; bumps plugin versions and docs.

Changes

Payload Analysis Evaluation

Layer / File(s)	Summary
Plugin registry & marketplace `.claude-plugin/marketplace.json`, `docs/data.json`, `plugins/ci/.claude-plugin/plugin.json`	Register `archive-payload-result` in docs and bump `ci` plugin version to `0.0.41`.
Skill docs & extractor `plugins/ci/skills/archive-payload-result/SKILL.md`, `plugins/ci/skills/archive-payload-result/extract_session_data.py`	Skill docs describe archive build steps; Python CLI extracts API/gh/curl responses and failed-job lists from Claude session tarballs.
Eval configuration and judges `eval.yaml`, `eval-install-failure.yaml`	Defines case-mode eval for `{payload_tag}`, required outputs (HTML/YAML/JSON), structural validators, two LLM judges (`analysis_quality`, `revert_scoring_accuracy`), and thresholds; adds separate install-failure eval.
Eval docs, reports, TODO and cases `eval/README.md`, `eval/TODO.md`, `eval/reports/`, `eval/cases/`	README/TODO explain archive layout, shims, case creation, MLflow, costs; reports analyze agent performance; 15 case directories added/updated with `input.yaml` and `annotations.yaml`.
Hermetic shims `eval/shims/curl`, `eval/shims/gcloud`, `eval/shims/gh`	Shims intercept network/CLI calls when `EVAL_ARCHIVES_DIR` is set, serving cached Release Controller/Sippy/GitHub/raw responses and mapping `gs://` paths to archived files.
Archive management scripts `eval/scripts/compress-archives.sh`, `eval/scripts/download-gcs-artifacts.sh`, `eval/scripts/trim-archives.sh`	Bash utilities to compress/decompress archives, download GCS artifacts with rclone, and trim large/unneeded archive artifacts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

openshift-eng/ai-helpers#433: Related output schema additions (metadata.force_accept_recommended) referenced by judges.
openshift-eng/ai-helpers#341: Related HTML revert/recommendation layout used by eval HTML structure judge.
openshift-eng/ai-helpers#356: Related analyze-payload skill/docs changes and deeper history handling.

Suggested labels

do-not-merge/hold

Suggested reviewers

dgoodwin
bryan-cox

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
No Real People Names In Style References	✅ Passed	No real people names used as style references in plugin commands, skill docs, prompts, or style guidance. All content uses objective technical criteria.
No Assumed Git Remote Names	✅ Passed	No hardcoded git remote names found. Search across all PR files for patterns like "git push origin", "git fetch upstream", etc. returned zero violations.
Git Push Safety Rules	✅ Passed	No git push operations, force push variants, or pushes to main/master branches found. The PR adds evaluation framework components with no autonomous push workflows.
No Untrusted Mcp Servers	✅ Passed	No MCP server installations from untrusted sources found. PR introduces only local evaluation framework files, shell utilities, and standard library Python code.
Ai-Helpers Overlap Detection	✅ Passed	New skill archive-payload-result for eval archives. No overlap with existing skills. Similarity <15% with related skills.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[wip] Add eval framework for payload agent skills' accurately describes the main change: introducing a comprehensive evaluation framework with test cases, shims, scripts, configuration, and supporting skill/documentation for payload analysis.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (9)

plugins/ci/skills/archive-payload-result/SKILL.md (1)

131-135: 💤 Low value

Specify language identifier for fenced code block.

The code block shows YAML content but lacks a language identifier, which reduces readability and breaks some Markdown processors.

📝 Proposed fix

 2. Parse each JUnit XML to extract underlying job URLs from `<system-out>` blocks. The `<system-out>` contains YAML with a `humanurl` field:
-   ```
+   ```yaml
    humanurl: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{underlying-job-name}/{build-id}
    ```

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ci/skills/archive-payload-result/SKILL.md` around lines 131 - 135,
The fenced code block in SKILL.md that shows the YAML snippet starting with
"humanurl: https://prow.ci.openshift.org/..." lacks a language identifier;
update that fenced block to use the YAML language tag (i.e., add "yaml"
immediately after the opening triple backticks) so the snippet is rendered and
highlighted correctly.

plugins/ci/skills/archive-payload-result/extract_session_data.py (7)

165-165: 💤 Low value

Rename unused loop variables.

The loop variables dirs and files are not used in the loop body. Consider renaming them to _dirs and _files to indicate they're intentionally unused.

♻️ Proposed fix

-    for root, dirs, files in os.walk(session_dir):
+    for root, _dirs, _files in os.walk(session_dir):
         candidate = os.path.join(root, tr_match.group(1))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line
165, The os.walk loop in extract_session_data.py uses unused loop variables
named dirs and files; rename them to _dirs and _files in the for root, dirs,
files in os.walk(session_dir) statement (or equivalent loop) to indicate they're
intentionally unused and avoid linter warnings, ensuring the rest of the loop
continues to use root as before.

202-204: 💤 Low value

Consider logging parse failures.

The broad exception catch for malformed JSONL is acceptable, but adding a debug/warning log would aid troubleshooting when extraction fails silently.

📊 Suggested improvement

                 try:
                     msg = json.loads(line)
-                except Exception:
+                except json.JSONDecodeError:
                     continue

Or add logging:

except Exception as e:
    # Optional: log parse failures for debugging
    # print(f"DEBUG: Failed to parse line in {fpath}: {e}", file=sys.stderr)
    continue

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` around
lines 202 - 204, Change the bare except that swallows JSON parse errors to
capture the exception and log a debug/warning message before continuing: replace
"except Exception: continue" with "except Exception as e:" and log the failure
(including fpath, the problematic line variable and e) using the module/logger
used in this file so malformed JSONL lines are visible for debugging while
continuing to the next line.

429-437: 💤 Low value

Remove unnecessary f-string prefixes.

Lines 429, 436, and 437 use f-strings without any placeholders.

♻️ Proposed fix

     print("\nDone. Next steps:")
-    print(f"  1. Download GCS artifacts:")
+    print("  1. Download GCS artifacts:")
     print(
         f"     while read p; do gcloud storage cp -r \"gs://$p/*\" \"{output_dir}/$p/\"; done < {output_dir}/failed-direct-jobs.txt"
     )
     print(
         f"     while read p; do gcloud storage cp -r \"gs://$p/*\" \"{output_dir}/$p/\"; done < {output_dir}/failed-aggregated-jobs.txt"
     )
-    print(f"  2. Download reference outputs from the session's GCS artifacts")
-    print(f"  3. Create eval case in eval/cases/ with input.yaml and annotations.yaml")
+    print("  2. Download reference outputs from the session's GCS artifacts")
+    print("  3. Create eval case in eval/cases/ with input.yaml and annotations.yaml")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` around
lines 429 - 437, The three print calls in extract_session_data.py that currently
use f-strings without placeholders should be changed to normal string literals:
replace the f-prefixed strings in the prints that output the numbered steps (the
line printing "  1. Download GCS artifacts:" and the two multi-line gcloud cp
instruction prints, plus the line printing "  2. Download reference outputs..."
or "  3. Create eval case...") by removing the leading f so they are plain
strings; update the print statements referencing those exact string literals in
the file accordingly to avoid unnecessary f-string usage.

190-190: 💤 Low value

Rename unused loop variables.

The dirs variable is not used in the loop body.

♻️ Proposed fix

-    for root, dirs, files in os.walk(session_dir):
+    for root, _dirs, files in os.walk(session_dir):
         for fname in sorted(files):

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line
190, The for-loop over os.walk(session_dir) uses an unused variable named dirs;
rename it to a conventional underscore placeholder (e.g., _ or _dirs) to signal
it's intentionally unused and avoid linter complaints; update the loop header
"for root, dirs, files in os.walk(session_dir):" to "for root, _, files in
os.walk(session_dir):" (or similar) while keeping the rest of the logic that
references root and files unchanged.

293-293: 💤 Low value

Clarify operator precedence with parentheses.

The mixed or and and operators can be ambiguous. Adding parentheses improves readability.

♻️ Proposed fix

-                            if cmd.strip().startswith("gh ") or "&&" in cmd and "gh " in cmd:
+                            if cmd.strip().startswith("gh ") or ("&&" in cmd and "gh " in cmd):
                                 _extract_gh_responses(cmd, rc, output_dir, found)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line
293, The condition checking for GitHub CLI usage is ambiguous due to mixed
or/and precedence; update the conditional that uses cmd (the line containing if
cmd.strip().startswith("gh ") or "&&" in cmd and "gh " in cmd:) to use explicit
parentheses, e.g. if cmd.strip().startswith("gh ") or (("&&" in cmd) and ("gh "
in cmd)): so the intent is clear and evaluation order is unambiguous.

378-378: 💤 Low value

Rename unused loop variable.

The dirs variable is not used in the loop body.

♻️ Proposed fix

-            for root, dirs, files in os.walk(tmpdir):
+            for root, _dirs, files in os.walk(tmpdir):
                 for fname in files:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line
378, The loop variable `dirs` in the `for root, dirs, files in os.walk(tmpdir):`
line is unused; update that tuple to use a throwaway name (e.g., `for root, _,
files in os.walk(tmpdir):` or `for root, _dirs, files in os.walk(tmpdir):`) in
the `extract_session_data.py` code so the intent is clear and linter warnings
are resolved; ensure you rename only the unused variable and leave `root` and
`files` unchanged.

375-375: 💤 Low value

Tarball extraction is trusted-source only.

The extractall() call is safe here since session tarballs come from the controlled GCS location. If extracting from untrusted sources in the future, consider using the filter parameter (Python 3.12+) or validating paths.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ci/skills/archive-payload-result/extract_session_data.py` at line
375, The tar extraction call tf.extractall(tmpdir) in extract_session_data.py
trusts input; replace it with a safe extraction routine that either (a) on
Python 3.12+ uses the tarfile.extractall(..., filter=...) to skip unsafe
members, or (b) manually iterates tf.getmembers(), for each member compute
target = os.path.join(tmpdir, member.name), normalize with
os.path.realpath/os.path.abspath and ensure target startswith the tmpdir
realpath (reject members with absolute paths or containing ../) before calling
tf.extract(member, tmpdir) (or extractfile/write for directories), so
path-traversal attacks are prevented while keeping extraction for trusted GCS
session tarballs.

eval/scripts/compress-archives.sh (1)

76-78: 💤 Low value

Consider handling stat failures explicitly.

If both stat formats fail (unlikely but possible for filesystem edge cases), the function returns an empty string, which could cause arithmetic errors in the calling code (lines 94, 124). Consider adding a fallback or validation.

🛡️ Suggested defensive fix

 file_size() {
-    stat --format='%s' "$1" 2>/dev/null || stat -f '%z' "$1" 2>/dev/null
+    stat --format='%s' "$1" 2>/dev/null || stat -f '%z' "$1" 2>/dev/null || echo "0"
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eval/scripts/compress-archives.sh` around lines 76 - 78, The file_size()
helper can return an empty string if both stat variants fail, causing arithmetic
errors downstream; update file_size to detect stat failures and return a safe
numeric fallback (e.g., "0") or non-zero exit status and ensure it always echoes
a number, and then add simple validation where it's used (the callers that
consume file_size) to check the value is numeric before performing arithmetic
and either coerce to 0 or abort with a clear error; reference the file_size
function name for the change and the calling sites that use its result for
arithmetic so they validate/handle non-numeric output.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@eval.yaml`:
- Around line 53-54: Remove the stale guidance string "No test cases exist yet.
Run /eval-dataset to generate cases with real payload tags from recent CI
history." from the schema text so it no longer contradicts the added test cases;
locate the exact YAML value containing that sentence and replace it with an
accurate description or remove the line entirely (update the human-readable
description field that currently contains that sentence).
- Around line 165-167: The current check that fails when len(data) == 0 should
be relaxed to accept an empty JSON array as a valid "no-candidate" outcome;
change the logic so that when len(data) == 0 you return a successful/valid
result (e.g., (True, "no candidates") or equivalent in this function) and only
perform the required-field validation (the required = [...] list and subsequent
checks) when data is non-empty, ensuring the required-field loop/validation code
runs only for actual rows.

In `@eval/cases/case-006-archived-4.22-rejected/annotations.yaml`:
- Around line 12-14: The expected_failing_jobs entries are inconsistent with the
notes: update the job names so they match exactly between the
expected_failing_jobs list and the notes block — e.g., either change
"e2e-aws-ovn-techpreview" and "e2e-aws-ovn-techpreview-serial" in the
expected_failing_jobs array to "aws-ovn-techpreview" and
"aws-ovn-techpreview-serial-3of3", or change the notes to include the "e2e-"
prefixed names; ensure all occurrences (expected_failing_jobs and notes) use the
same exact strings so the validator can match them.

In `@eval/cases/case-007-archived-4.22-ci-cco-revert/annotations.yaml`:
- Around line 12-14: The expected_failing_jobs entries include versioned
suffixes that don't match the unversioned names referenced in the notes; update
either the expected_failing_jobs array or the notes so they consistently
reference the same full job names found in the payload—specifically ensure
"expected_failing_jobs" contains the exact blocking job names (e.g.
aggregated-aws-ovn-upgrade-4.22-minor and aggregated-gcp-ovn-upgrade-4.22-micro)
or change the notes to the unversioned names to match the payload, and make the
same consistency fix for the related entries referenced in the notes block.

In `@eval/cases/case-010-5.0-ci-cno-networkpolicy-revert/annotations.yaml`:
- Around line 2-15: The declared expected_failed_job_count
(expected_failed_job_count) does not match the number of entries under
expected_candidates[0].expected_failing_jobs (three entries: "e2e-aws-ovn",
"e2e-azure-ovn", "e2e-gcp-ovn"); update expected_failed_job_count to 3 (or
remove one of the failing job entries if that was the intent) so the numeric
count and the expected_failing_jobs list are consistent.

In `@eval/README.md`:
- Around line 40-62: The README has multiple fenced code blocks missing language
tags (causing MD040); update each fenced block by adding an appropriate language
identifier (e.g., text or bash) for the shown blocks such as the archives tree
starting with "archives/{payload-tag}/", the payload marker line
"/ci:archive-payload-result 4.22.0-0.nightly-2026-03-20-053450", the eval cases
tree "eval/cases/{case-name}/", the eval command "/eval-mlflow --action
log-results --run-id <id>" and the top-level "eval/" block; ensure you change
the opening ``` to ```text or ```bash as appropriate for each occurrence (also
apply same fixes where noted around the other ranges mentioned in the review).

In `@eval/scripts/download-gcs-artifacts.sh`:
- Around line 12-14: The script sets BUCKET, PREFIX and DEST without validating
inputs, so add argument validation at the top: check that two args are provided
(e.g. if [ "$#" -ne 2 ] || [ -z "$1" ] || [ -z "$2" ]; then echo "Usage: $0
<prefix> <dest>"; exit 2; fi), only then set PREFIX="${1%/}" and DEST="$2", and
ensure the script exits non-zero on missing args to avoid creating unexpected
directories; reference the variables BUCKET, PREFIX and DEST when locating where
to add the check.

In `@eval/scripts/trim-archives.sh`:
- Around line 49-52: The default case in the case statement prints "Unknown
option" and calls the usage function but returns exit code 0; update the
unknown-option branch (the "*" case) to return a non-zero status by either
making usage emit a nonzero exit code or by calling usage followed by exit 1.
Locate the "*" case in trim-archives.sh and change it to: echo "Unknown option:
$1" >&2; usage; exit 1 (or modify usage to exit 1) so invalid options produce
exit code 1.

In `@eval/TODO.md`:
- Line 5: Update the typos in the TODO.md text: correct "peformance" to
"performance", "recocmend" to "recommend", and "insatll" to "install", and
search for and correct the same misspellings at the other occurrences referenced
(lines with the same words, e.g., the occurrences noted at 65 and 79); ensure
the file's task instructions read clearly and consistently after these fixes.

In `@plugins/ci/skills/archive-payload-result/SKILL.md`:
- Around line 77-78: Update the command that runs the extractor in SKILL.md to
use the correct script path: replace the incorrect
"scripts/extract_session_data.py" reference with
"plugins/ci/skills/archive-payload-result/extract_session_data.py" so the
command python3 extracts using the actual file (referenced by the script name
extract_session_data.py in the markdown snippet).

---

Nitpick comments:
In `@eval/scripts/compress-archives.sh`:
- Around line 76-78: The file_size() helper can return an empty string if both
stat variants fail, causing arithmetic errors downstream; update file_size to
detect stat failures and return a safe numeric fallback (e.g., "0") or non-zero
exit status and ensure it always echoes a number, and then add simple validation
where it's used (the callers that consume file_size) to check the value is
numeric before performing arithmetic and either coerce to 0 or abort with a
clear error; reference the file_size function name for the change and the
calling sites that use its result for arithmetic so they validate/handle
non-numeric output.

In `@plugins/ci/skills/archive-payload-result/extract_session_data.py`:
- Line 165: The os.walk loop in extract_session_data.py uses unused loop
variables named dirs and files; rename them to _dirs and _files in the for root,
dirs, files in os.walk(session_dir) statement (or equivalent loop) to indicate
they're intentionally unused and avoid linter warnings, ensuring the rest of the
loop continues to use root as before.
- Around line 202-204: Change the bare except that swallows JSON parse errors to
capture the exception and log a debug/warning message before continuing: replace
"except Exception: continue" with "except Exception as e:" and log the failure
(including fpath, the problematic line variable and e) using the module/logger
used in this file so malformed JSONL lines are visible for debugging while
continuing to the next line.
- Around line 429-437: The three print calls in extract_session_data.py that
currently use f-strings without placeholders should be changed to normal string
literals: replace the f-prefixed strings in the prints that output the numbered
steps (the line printing "  1. Download GCS artifacts:" and the two multi-line
gcloud cp instruction prints, plus the line printing "  2. Download reference
outputs..." or "  3. Create eval case...") by removing the leading f so they are
plain strings; update the print statements referencing those exact string
literals in the file accordingly to avoid unnecessary f-string usage.
- Line 190: The for-loop over os.walk(session_dir) uses an unused variable named
dirs; rename it to a conventional underscore placeholder (e.g., _ or _dirs) to
signal it's intentionally unused and avoid linter complaints; update the loop
header "for root, dirs, files in os.walk(session_dir):" to "for root, _, files
in os.walk(session_dir):" (or similar) while keeping the rest of the logic that
references root and files unchanged.
- Line 293: The condition checking for GitHub CLI usage is ambiguous due to
mixed or/and precedence; update the conditional that uses cmd (the line
containing if cmd.strip().startswith("gh ") or "&&" in cmd and "gh " in cmd:) to
use explicit parentheses, e.g. if cmd.strip().startswith("gh ") or (("&&" in
cmd) and ("gh " in cmd)): so the intent is clear and evaluation order is
unambiguous.
- Line 378: The loop variable `dirs` in the `for root, dirs, files in
os.walk(tmpdir):` line is unused; update that tuple to use a throwaway name
(e.g., `for root, _, files in os.walk(tmpdir):` or `for root, _dirs, files in
os.walk(tmpdir):`) in the `extract_session_data.py` code so the intent is clear
and linter warnings are resolved; ensure you rename only the unused variable and
leave `root` and `files` unchanged.
- Line 375: The tar extraction call tf.extractall(tmpdir) in
extract_session_data.py trusts input; replace it with a safe extraction routine
that either (a) on Python 3.12+ uses the tarfile.extractall(..., filter=...) to
skip unsafe members, or (b) manually iterates tf.getmembers(), for each member
compute target = os.path.join(tmpdir, member.name), normalize with
os.path.realpath/os.path.abspath and ensure target startswith the tmpdir
realpath (reject members with absolute paths or containing ../) before calling
tf.extract(member, tmpdir) (or extractfile/write for directories), so
path-traversal attacks are prevented while keeping extraction for trusted GCS
session tarballs.

In `@plugins/ci/skills/archive-payload-result/SKILL.md`:
- Around line 131-135: The fenced code block in SKILL.md that shows the YAML
snippet starting with "humanurl: https://prow.ci.openshift.org/..." lacks a
language identifier; update that fenced block to use the YAML language tag
(i.e., add "yaml" immediately after the opening triple backticks) so the snippet
is rendered and highlighted correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0909acc0-1956-402e-801b-ade5c03e5c62

📥 Commits

Reviewing files that changed from the base of the PR and between 503e009 and dcaa2bb.

📒 Files selected for processing (46)

.claude-plugin/marketplace.json
docs/data.json
eval.yaml
eval/README.md
eval/TODO.md
eval/cases/case-001-rejected-single-failure/annotations.yaml
eval/cases/case-001-rejected-single-failure/input.yaml
eval/cases/case-002-rejected-multiple-failures/annotations.yaml
eval/cases/case-002-rejected-multiple-failures/input.yaml
eval/cases/case-003-accepted-with-failures/annotations.yaml
eval/cases/case-003-accepted-with-failures/input.yaml
eval/cases/case-004-ready-in-progress/annotations.yaml
eval/cases/case-004-ready-in-progress/input.yaml
eval/cases/case-005-rejected-streak/annotations.yaml
eval/cases/case-005-rejected-streak/input.yaml
eval/cases/case-006-archived-4.22-rejected/annotations.yaml
eval/cases/case-006-archived-4.22-rejected/input.yaml
eval/cases/case-007-archived-4.22-ci-cco-revert/annotations.yaml
eval/cases/case-007-archived-4.22-ci-cco-revert/input.yaml
eval/cases/case-008-archived-4.22-ci-hypershift-revert/annotations.yaml
eval/cases/case-008-archived-4.22-ci-hypershift-revert/input.yaml
eval/cases/case-009-archived-4.22-cvo-revert/annotations.yaml
eval/cases/case-009-archived-4.22-cvo-revert/input.yaml
eval/cases/case-010-5.0-ci-cno-networkpolicy-revert/annotations.yaml
eval/cases/case-010-5.0-ci-cno-networkpolicy-revert/input.yaml
eval/cases/case-011-5.0-nightly-cmo-monitoring-revert/annotations.yaml
eval/cases/case-011-5.0-nightly-cmo-monitoring-revert/input.yaml
eval/cases/case-012-5.0-ci-hypershift-builder-fp/annotations.yaml
eval/cases/case-012-5.0-ci-hypershift-builder-fp/input.yaml
eval/cases/case-013-5.0-nightly-nto-testdata-revert/annotations.yaml
eval/cases/case-013-5.0-nightly-nto-testdata-revert/input.yaml
eval/cases/case-014-5.0-ci-infra-only-no-candidates/annotations.yaml
eval/cases/case-014-5.0-ci-infra-only-no-candidates/input.yaml
eval/cases/case-install-001-metal-ipi-ipv6-ckao/annotations.yaml
eval/cases/case-install-001-metal-ipi-ipv6-ckao/input.yaml
eval/reports/improvement-recommendations.md
eval/reports/payload-agent-analysis.md
eval/scripts/compress-archives.sh
eval/scripts/download-gcs-artifacts.sh
eval/scripts/trim-archives.sh
eval/shims/curl
eval/shims/gcloud
eval/shims/gh
plugins/ci/.claude-plugin/plugin.json
plugins/ci/skills/archive-payload-result/SKILL.md
plugins/ci/skills/archive-payload-result/extract_session_data.py

coderabbitai · 2026-05-15T14:13:39Z

+    No test cases exist yet. Run /eval-dataset to generate cases with real
+    payload tags from recent CI history.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove stale dataset guidance from schema text.

This says no test cases exist, but this PR adds 15 cases. Keeping this text will confuse dataset generation workflow.

Proposed fix

- No test cases exist yet. Run /eval-dataset to generate cases with real - payload tags from recent CI history. + Test cases are stored under eval/cases. Add new ones with real payload tags + from CI history when expanding coverage.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

No test cases exist yet. Run /eval-dataset to generate cases with real

payload tags from recent CI history.

Test cases are stored under eval/cases. Add new ones with real payload tags

from CI history when expanding coverage.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@eval.yaml` around lines 53 - 54, Remove the stale guidance string "No test cases exist yet. Run /eval-dataset to generate cases with real payload tags from recent CI history." from the schema text so it no longer contradicts the added test cases; locate the exact YAML value containing that sentence and replace it with an accurate description or remove the line entirely (update the human-readable description field that currently contains that sentence).

coderabbitai · 2026-05-15T14:13:39Z

+      if len(data) == 0:
+          return (False, "JSON array is empty")
+      required = ["payload_tag", "job_name", "failure_type", "root_cause_summary"]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Allow valid zero-row autodl output for no-candidate cases.

The judge currently hard-fails on an empty JSON array, but this can be a valid outcome when no revert candidates are found (per your own pair-based schema).

Proposed fix

- if len(data) == 0: - return (False, "JSON array is empty") + if len(data) == 0: + # Valid when there are no candidate PRs (e.g., infra-only/no-revert cases) + return (True, "Valid JSON: 0 rows (no candidate pairs)")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if len(data) == 0:

return (False, "JSON array is empty")

required = ["payload_tag", "job_name", "failure_type", "root_cause_summary"]

if len(data) == 0:

# Valid when there are no candidate PRs (e.g., infra-only/no-revert cases)

return (True, "Valid JSON: 0 rows (no candidate pairs)")

required = ["payload_tag", "job_name", "failure_type", "root_cause_summary"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@eval.yaml` around lines 165 - 167, The current check that fails when len(data) == 0 should be relaxed to accept an empty JSON array as a valid "no-candidate" outcome; change the logic so that when len(data) == 0 you return a successful/valid result (e.g., (True, "no candidates") or equivalent in this function) and only perform the required-field validation (the required = [...] list and subsequent checks) when data is non-empty, ensuring the required-field loop/validation code runs only for actual rows.

coderabbitai · 2026-05-15T14:13:39Z

+    expected_failing_jobs:
+      - "aggregated-aws-ovn-upgrade-4.22-minor"
+      - "aggregated-gcp-ovn-upgrade-4.22-micro"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Job name inconsistency between expected_failing_jobs and notes.

The job names in Lines 12-14 (expected_failing_jobs) include version-specific suffixes that are absent from the notes (Lines 18-19):

Notes: "aggregated-aws-ovn-upgrade" vs. expected: "aggregated-aws-ovn-upgrade-4.22-minor"

Notes: "aggregated-gcp-ovn-upgrade" vs. expected: "aggregated-gcp-ovn-upgrade-4.22-micro"

Verify that the full job names in expected_failing_jobs match the actual blocking job names in the payload. Update the notes to match, or vice versa, to avoid confusion during validation.

Also applies to: 16-23

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@eval/cases/case-007-archived-4.22-ci-cco-revert/annotations.yaml` around lines 12 - 14, The expected_failing_jobs entries include versioned suffixes that don't match the unversioned names referenced in the notes; update either the expected_failing_jobs array or the notes so they consistently reference the same full job names found in the payload—specifically ensure "expected_failing_jobs" contains the exact blocking job names (e.g. aggregated-aws-ovn-upgrade-4.22-minor and aggregated-gcp-ovn-upgrade-4.22-micro) or change the notes to the unversioned names to match the payload, and make the same consistency fix for the related entries referenced in the notes block.

coderabbitai · 2026-05-15T14:13:40Z

+python3 scripts/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG"
+```


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Correct the script path.

The extraction script is located at plugins/ci/skills/archive-payload-result/extract_session_data.py, not scripts/extract_session_data.py. This path mismatch will cause execution failures.

📝 Proposed fix

-python3 scripts/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG" +python3 plugins/ci/skills/archive-payload-result/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

python3 scripts/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG"

```

python3 plugins/ci/skills/archive-payload-result/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ci/skills/archive-payload-result/SKILL.md` around lines 77 - 78, Update the command that runs the extractor in SKILL.md to use the correct script path: replace the incorrect "scripts/extract_session_data.py" reference with "plugins/ci/skills/archive-payload-result/extract_session_data.py" so the command python3 extracts using the actual file (referenced by the script name extract_session_data.py in the markdown snippet).

Update shims to use EVAL_ARCHIVES_DIR directly instead of searching across multiple archives. Each eval case now gets its own isolated archive via the templated path /archive/{payload_tag}/artifacts. - Remove multi-archive glob patterns from gcloud, curl, gh shims - Remove tarball extraction code (using OCI volumes now) - Update eval.yaml to set EVAL_ARCHIVES_DIR per case using {payload_tag} Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

eval/shims/gcloud (1)

59-59: 💤 Low value

Unused variable args.

The args array is declared but never used in the function.

Proposed fix

     local recursive=false
     local gs_path=""
-    local args=()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eval/shims/gcloud` at line 59, Remove the unused local variable by deleting
the declaration "local args=()" in eval/shims/gcloud (or, if it was intended to
collect parameters, replace its declaration with proper usage where the function
populates and uses "args"); ensure no other code references "args" remain so
there are no unused-variable warnings.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@eval/shims/curl`:
- Around line 81-83: The grouped case arm "--compressed|--retry|--retry-delay)"
treats all three as flags but --retry and --retry-delay require an argument;
update the option parsing in the case statement so --retry and --retry-delay are
handled in their own arms (e.g., "--retry") shift; RETRY="$1"; shift;; and
("--retry-delay") shift; RETRY_DELAY="$1"; shift;; or, if keeping grouping,
detect which option is present and perform an extra shift and assignment for its
argument instead of shifting only once; ensure the argument is consumed so it
isn't interpreted later as the URL.

---

Nitpick comments:
In `@eval/shims/gcloud`:
- Line 59: Remove the unused local variable by deleting the declaration "local
args=()" in eval/shims/gcloud (or, if it was intended to collect parameters,
replace its declaration with proper usage where the function populates and uses
"args"); ensure no other code references "args" remain so there are no
unused-variable warnings.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 62d78959-616b-4207-9127-e83b0585f449

📥 Commits

Reviewing files that changed from the base of the PR and between dcaa2bb and 8674d5d.

📒 Files selected for processing (4)

eval.yaml
eval/shims/curl
eval/shims/gcloud
eval/shims/gh

🚧 Files skipped from review as they are similar to previous changes (2)

eval/shims/gh
eval.yaml

coderabbitai · 2026-05-18T20:48:33Z

+            --compressed|--retry|--retry-delay)
+                shift
+                ;;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

--retry and --retry-delay require arguments but only shift once.

Both --retry <num> and --retry-delay <seconds> take arguments. Grouping them with --compressed (which is a boolean flag) causes incorrect parsing—the argument value is left on the stack and may be misinterpreted as the URL.

Proposed fix

- --compressed|--retry|--retry-delay) - shift + --compressed) + shift + ;; + --retry|--retry-delay) + shift 2 ;;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

--compressed|--retry|--retry-delay)

shift

;;

--compressed)

shift

;;

--retry|--retry-delay)

shift 2

;;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@eval/shims/curl` around lines 81 - 83, The grouped case arm "--compressed|--retry|--retry-delay)" treats all three as flags but --retry and --retry-delay require an argument; update the option parsing in the case statement so --retry and --retry-delay are handled in their own arms (e.g., "--retry") shift; RETRY="$1"; shift;; and ("--retry-delay") shift; RETRY_DELAY="$1"; shift;; or, if keeping grouping, detect which option is present and perform an extra shift and assignment for its argument instead of shifting only once; ensure the argument is consumed so it isn't interpreted later as the URL.

The archive directories are read-only when mounted as OCI volumes. Move shim logs to /tmp and add error suppression to prevent failures.

Models are now configurable via environment variables: - EVAL_MODEL: model for skill execution (default: claude-sonnet-4-6) - EVAL_JUDGE_MODEL: model for judging (default: claude-sonnet-4-6)

The shims need to be in PATH for hermetic evaluation. Use ${PWD}/eval/shims:${PATH} to prepend them properly.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

eval-install-failure.yaml (1)
295-305: Consider the impact of 100% pass rate requirements.

The min_pass_rate: 1.0 thresholds for output_files_exist and failure_classification mean that if even a single test case fails these checks, the entire evaluation fails. While this enforces high standards, it may make the eval framework brittle during development or when adding new challenging test cases.

Consider whether these should allow for a small failure rate (e.g., 0.9 or 0.95) to accommodate edge cases while still maintaining quality standards, or document that these strict requirements are intentional for release gating.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@eval-install-failure.yaml` around lines 295 - 305, The thresholds file
currently enforces a 100% pass rate for output_files_exist and
failure_classification via the keys output_files_exist.min_pass_rate and
failure_classification.min_pass_rate; change these values to a slightly lower
tolerance (for example 0.95 or 0.9) if you want to allow occasional edge-case
failures, or add a short comment near the thresholds block documenting that the
1.0 values are intentional release gates; update the min_pass_rate values or the
documentation accordingly and keep the other threshold keys
(root_cause_identification, pr_correlation, analysis_quality) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@eval-install-failure.yaml`:
- Around line 59-61: Documentation references the wrong environment variable
name; update the doc text that mentions EVAL_ARTIFACT_ARCHIVE to use the actual
variable EVAL_ARCHIVES_DIR used by the gcloud shim so the shim can find local
archives for hermetic evaluation (reference the gcloud shim and the env var
symbol EVAL_ARCHIVES_DIR in your change).

---

Nitpick comments:
In `@eval-install-failure.yaml`:
- Around line 295-305: The thresholds file currently enforces a 100% pass rate
for output_files_exist and failure_classification via the keys
output_files_exist.min_pass_rate and failure_classification.min_pass_rate;
change these values to a slightly lower tolerance (for example 0.95 or 0.9) if
you want to allow occasional edge-case failures, or add a short comment near the
thresholds block documenting that the 1.0 values are intentional release gates;
update the min_pass_rate values or the documentation accordingly and keep the
other threshold keys (root_cause_identification, pr_correlation,
analysis_quality) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c8ac4c64-15d1-43cc-a671-ec5fd2ca2c8f

📥 Commits

Reviewing files that changed from the base of the PR and between af2a262 and ff1b144.

📒 Files selected for processing (2)

eval-install-failure.yaml
eval.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

eval.yaml

The env section in eval.yaml doesn't expand shell variables like ${PWD} at parse time. PATH must be set by the runner/deployment before invoking the skill.

Simple eval that tests the hello-world:echo command. Used to verify the eval harness works without waiting 10+ minutes for complex skill evals. - eval-hello-world.yaml: Config with inline check + LLM judge - case-001-default: Tests "Hello world" (no args) - case-002-named: Tests "Hello Claude" (with name arg) Expected runtime: ~30 seconds per case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Make name argument optional ({name?}) since case-001 uses empty string - Change output path from "." to "output" (harness rejects project root) - Add traces.stdout for capturing skill output Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Hardcode judge to claude-opus-4-6, drop skill model (comes from CLI --model) - Fix env var syntax: ${VAR} → $VAR (harness uses $VAR, not shell expansion) - Fix inline checks: outputs key is "conversation", not "transcript" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- plugins/ci/evals/cases/payload-agent/ — payload analysis cases - plugins/ci/evals/cases/install-analysis/ — install failure cases - plugins/ci/evals/shims/ — shared shims for ci evals - plugins/hello-world/evals/ — hello-world eval cases - Eval configs live alongside their cases - Updated dataset.path in all configs to match new locations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Cases, shims, and configs moved to plugins/ in previous commit. Historical runs, reports, and scripts archived separately. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove "archived" prefix from payload-agent cases 006-009 and add docs/evals.md covering directory layout, authoring, running, and known gotchas. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

openshift-ci · 2026-05-21T13:43:26Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

stbenjam changed the title ~~Add eval framework for ci:analyze-payload skill~~ [wip] Add eval framework for ci:analyze-payload skill May 15, 2026

openshift-ci Bot requested review from bryan-cox and dgoodwin May 15, 2026 14:03

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2026

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

stbenjam marked this pull request as draft May 18, 2026 20:34

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

not-stbenjam added 4 commits May 18, 2026 17:47

Restore eval-install-failure.yaml for install case testing

17b4a3a

Fix shims to log to /tmp for read-only OCI volumes

af2a262

The archive directories are read-only when mounted as OCI volumes. Move shim logs to /tmp and add error suppression to prevent failures.

Use EVAL_MODEL env var for model configuration

712101e

Models are now configurable via environment variables: - EVAL_MODEL: model for skill execution (default: claude-sonnet-4-6) - EVAL_JUDGE_MODEL: model for judging (default: claude-sonnet-4-6)

Fix PATH in eval configs to include shims directory

ff1b144

The shims need to be in PATH for hermetic evaluation. Use ${PWD}/eval/shims:${PATH} to prepend them properly.

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread plugins/ci/evals/eval-install-analysis.yaml

not-stbenjam and others added 11 commits May 18, 2026 18:23

Remove PATH from eval env (must be set by runner)

234d190

The env section in eval.yaml doesn't expand shell variables like ${PWD} at parse time. PATH must be set by the runner/deployment before invoking the skill.

Add multi-line annotation to reproduce harness bug

c29095c

Fix inline check to get annotations from outputs dict

a2415ad

Fix LLM judge prompt to use supported template syntax

9fea3a1

Remove env var syntax from models config - YAML doesn't expand env vars

fdbd334

Remove old eval/ directory

c27f0fc

Cases, shims, and configs moved to plugins/ in previous commit. Historical runs, reports, and scripts archived separately. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add eval documentation and clean up case names

f9e28b8

Remove "archived" prefix from payload-agent cases 006-009 and add docs/evals.md covering directory layout, authoring, running, and known gotchas. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

stbenjam changed the title ~~[wip] Add eval framework for ci:analyze-payload skill~~ [wip] Add eval framework for payload agent skills May 19, 2026

not-stbenjam force-pushed the eval-framework branch from 2a8e0fb to 32e4c3b Compare May 20, 2026 01:53

not-stbenjam force-pushed the eval-framework branch from 32e4c3b to f9e28b8 Compare May 20, 2026 01:53

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2026

		No test cases exist yet. Run /eval-dataset to generate cases with real
		payload tags from recent CI history.

		python3 scripts/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG"
		```

	python3 scripts/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG"
	```
	python3 plugins/ci/skills/archive-payload-result/extract_session_data.py "$SESSION_TARBALL" "$ARCHIVE" "$TAG"

Conversation

stbenjam commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Eval suites

Infrastructure

Documentation

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stbenjam commented May 15, 2026 •

edited

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading