OKD-370: Add promptfoo evals for agentic-docs plugin by kenjpais · Pull Request #477 · openshift-eng/ai-helpers

kenjpais · 2026-05-15T00:56:08Z

Summary

Add promptfoo-based eval workflows to the agentic-docs plugin via [PR #437].

Introduces two new skills:

/agentic-docs:generate-evals — generate repository-specific promptfoo eval suites from templates
/agentic-docs:evaluate — validate provider configuration and execute eval suites with automated analysis

Features

generate-evals

Template-based eval generation
Repository-scoped eval configs

evaluate

Bundled eval execution workflow
Spawns judge sub-agent to analyze results and session logs and metrics
Pass/fail quality reporting
Cost and latency regression checks

These workflows add deterministic and LLM-judged validation for:

skill routing
command execution
output quality
regression detection

Assertion types used

skill-used / not-skill-used
icontains / not-icontains
llm-rubric
cost / latency

Test coverage

Test infrastructure: Both skills include their own test suites in evals/evals.json

Summary by CodeRabbit

New Features
- Added agentic-docs plugin for creating and maintaining AI-optimized documentation for OpenShift.
- Added /agentic-docs:evaluate to run comparative documentation evaluations and produce structured reports.
- Added /agentic-docs:generate-evals to generate repository-specific evaluation suites.
- Added /metrics:ai-docs-telemetry to analyze ai-docs usage from session logs.
Documentation
- Updated plugin registry and docs index; added README/command docs and usage examples for new commands.

Introduces tier-1 platform documentation skills for creating and maintaining AI-optimized documentation in openshift/enhancements. Skills: /agentic-docs:platform (/platform-docs): Creates tier-1 platform documentation with: - AGENTS.md navigation index - DESIGN_PHILOSOPHY.md and KNOWLEDGE_GRAPH.md - platform/, domain/, practices/, decisions/, workflows/, references/ - Automated discovery, structure creation, template population, and validation /agentic-docs:update-platform-docs (/update-platform-docs): Incrementally updates tier-1 documentation with: - Automatic gap detection (scans existing ai-docs/ for missing files) - Targeted additions without full regeneration - Smart navigation updates (auto-updates indexes and AGENTS.md) - Validation of naming conventions, line counts, and link integrity

Introduces tier-2 lean component documentation skill for creating structured component-level documentation in OpenShift repositories. Skills: /agentic-docs:component (/component-docs): Creates tier-2 lean component documentation with: - Component-specific CRDs and architecture only - Pointers to tier-1 for generic patterns - Component ADRs and exec-plan tracking - AGENTS.md entry point - DEVELOPMENT.md and TESTING.md guides - Domain concepts and ecosystem maps

Platform documentation in openshift/enhancements/ai-docs/ already exists and was created using this skill. Remove the /platform-docs skill that was designed to create it from scratch - it's no longer needed. Changes: - Remove entire skills/platform/ directory - Keep /update-platform-docs for incremental updates to existing platform docs - Keep /component-docs for creating component-level documentation - Update README to clarify platform docs "already exist" - Simplify tier architecture description (tier-1/tier-2 → platform/component) - Update component skill templates to reference "platform docs" consistently - Update validation scripts to remove platform-specific checks - Remove platform-docs from marketplace registration This simplifies the plugin to focus on its two active use cases: 1. Creating new component documentation (/component-docs) 2. Updating existing platform documentation (/update-platform-docs)

- Fix generate-evals to use only anthropic:claude-sonnet-4-6 provider - Rewrite evaluate skill to use 2-agent architecture: - Code claude sub-agent: runs promptfoo tool - Judge claude sub-agent: evaluates results + metrics - Integrate metrics plugin for session telemetry - Remove manual test spawning approach - Add comprehensive error handling and documentation

Critical fixes: 1. EVALUATE SKILL (v5.0): - Actually spawn judge sub-agent after code agent completes (was missing) - Use bundled scripts/run-eval.sh instead of raw promptfoo commands - Add explicit step-by-step workflow with Agent tool examples - Fix sequential execution (code → collect metrics → judge) - Add comprehensive error handling for 100% error rate scenario - Document common issues and fixes 2. GENERATE-EVALS SKILL: - Fix provider format to simple string: anthropic:claude-sonnet-4-6 - Remove incorrect object format with id/config - Add explicit DO/DON'T examples for provider configuration - Change outputPath to ./promptfoo-results.json - Change prompts to use file://prompts/system.txt Issues fixed: - Test 1 failure: Judge sub-agent now explicitly spawned with results - 100% error rate: Provider format corrected (was using API format not promptfoo format) - Missing workflow: Added complete sequential workflow with Agent() examples - Script usage: Now uses bundled run-eval.sh for reliable execution

Remove unnecessary code sub-agent - Option B implementation: BEFORE (v5.0 - Two sub-agents): 1. Spawn code sub-agent → run promptfoo 2. Spawn judge sub-agent → analyze results AFTER (v6.0 - One sub-agent): 1. Main agent runs run-eval.sh directly 2. Main agent collects session metrics 3. Spawn judge sub-agent → analyze results + metrics Benefits: - ✅ Simpler: One sub-agent instead of two - ✅ Faster: ~20-30s saved (no code sub-agent spawn overhead) - ✅ Cheaper: ~$0.02-0.05 saved per evaluation - ✅ Clearer: Main agent runs tools, judge analyzes - ✅ More reliable: Fewer moving parts, fewer failure modes Technical changes: - Removed Step 2 (spawn code sub-agent) - Main agent now executes bash /scripts/run-eval.sh - Main agent collects metrics directly from session - Judge sub-agent receives results from main agent (not from code sub-agent) - Updated all documentation and examples - Added complete example workflow showing direct execution Addresses user question: 'Why cannot the coding sub-agent directly pass results to judge?' Answer: It can't (sub-agents can't spawn sub-agents), but we don't need it anyway - main agent can run the script directly.

## Changes ### generate-evals skill (v2.0) - Add canonical template at templates/promptfooconfig.example.yaml - Update skill to always use template as foundation - Document common provider format mistakes to avoid - Switch from weight-based to llm-rubric assertions - Use vars.prompt instead of vars.task_description ### evaluate skill (v6.2) - Add provider validation before running promptfoo - Add bundled run-eval.sh script for consistent execution - Add test suite (evals/evals.json) with 3 test cases - Document skill testing and iteration workflow ### Plugin version - Bump agentic-docs plugin from 1.0.0 to 1.1.0 (MINOR) - Reflects enhanced functionality in both skills ## Key improvements - Prevents invalid Vertex AI provider format (vertex:anthropic:claude-...) - Template-first approach ensures consistency - Skills now include their own test infrastructure - Better error detection and user guidance Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

openshift-ci · 2026-05-15T00:56:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kenjpais
Once this PR has been reviewed and has the lgtm label, please assign theobarberbany for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-15T00:56:20Z

Walkthrough

Adds an agentic-docs plugin (generate-evals and evaluate skills, templates, scripts, docs), registers it in marketplace/docs registries and PLUGINS.md, and introduces ai-docs telemetry: a metrics command plus a Python script to extract ai-docs usage from Claude Code session logs.

Changes

agentic-docs Plugin and Evaluation Suite

Layer / File(s)	Summary
Plugin registration and manifest `.claude-plugin/marketplace.json`, `docs/data.json`, `plugins/agentic-docs/.claude-plugin/plugin.json`, `PLUGINS.md`	Adds agentic-docs plugin entry to marketplace and docs registries, updates metrics command registration in `docs/data.json`, and updates PLUGINS.md TOC/sections.
Generate-evals skill and template `plugins/agentic-docs/commands/generate-evals.md`, `plugins/agentic-docs/skills/generate-evals/SKILL.md`, `plugins/agentic-docs/skills/generate-evals/templates/promptfooconfig.example.yaml`	Skill and example promptfoo template to generate repository-specific `promptfooconfig.yaml` and `EVALUATION.md` with multiple SME/convention test cases and rubric assertions.
Evaluate skill, eval configs, and runner `plugins/agentic-docs/commands/evaluate.md`, `plugins/agentic-docs/skills/evaluate/SKILL.md`, `plugins/agentic-docs/skills/evaluate/evals/evals.json`, `plugins/agentic-docs/skills/evaluate/scripts/run-eval.sh`	Adds evaluation command docs, comprehensive SKILL describing baseline/with-docs comparative workflow, eval scenarios, evidence collection, judge sub-agent prompt, and a `run-eval.sh` helper to run promptfoo.
Metrics: ai-docs telemetry `plugins/metrics/README.md`, `plugins/metrics/commands/ai-docs-telemetry.md`, `plugins/metrics/scripts/ai_docs_telemetry.py`	Adds `/metrics:ai-docs-telemetry` documentation and a Python script that scans Claude Code JSONL sessions for ai-docs/AGENTS/CLAUDE reads and emits structured telemetry JSON; supports scanning recent sessions and single-session analysis.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

approved, lgtm

Suggested reviewers

theobarberbany
stleerh

🚥 Pre-merge checks | ✅ 10

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: adding promptfoo-based evaluation workflows (evals) to the agentic-docs plugin, which is the primary focus of this changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
No Real People Names In Style References	✅ Passed	Comprehensive search across all PR files found no real people names used as style references, in example prompts, or in instructions. All style guidance uses explicit quality descriptions instead.
No Assumed Git Remote Names	✅ Passed	No hardcoded git remote names (origin, upstream) found in any new files added by this PR. All scripts, documentation, and configurations avoid assuming git remote names.
Git Push Safety Rules	✅ Passed	PR contains no git push, force push, or autonomous push workflows. New skills are for evaluation/analysis only with no version control operations.
No Untrusted Mcp Servers	✅ Passed	PR introduces no MCP servers from untrusted sources. Only legitimate tools used: promptfoo via npx and Python standard library. No new package dependencies or external service calls.
Ai-Helpers Overlap Detection	✅ Passed	No AI-helpers overlap detected. New agentic-docs and metrics commands occupy distinct domains (docs evaluation, eval generation, session telemetry) with <35% similarity to existing functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-15T00:56:46Z

skillsaw: additional violations

Severity	Rule	File	Message
⚠️ warning	`plugin-readme`	`plugins/agentic-docs/README.md`	Missing README.md (recommended)
❌ error	`plugins-doc-up-to-date`	`PLUGINS.md`	PLUGINS.md is out of sync with plugin metadata. Run 'make update' to update.

github-actions · 2026-05-15T00:56:46Z

⚠️ warning (agentskill-evals): evals[0] all assertions must be strings

github-actions · 2026-05-15T00:56:47Z

⚠️ warning (agentskill-evals): evals[1] all assertions must be strings

github-actions · 2026-05-15T00:56:48Z

⚠️ warning (agentskill-evals): evals[2] all assertions must be strings

github-actions · 2026-05-15T00:56:50Z

❌ error (plugin-owners-required): Plugin 'agentic-docs' is missing an OWNERS file

openshift-ci-robot · 2026-05-15T00:59:13Z

@kenjpais: This pull request references OKD-370 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add promptfoo-based eval workflows to the agentic-docs plugin via [PR #437].

Introduces two new skills:

/agentic-docs:generate-evals — generate repository-specific promptfoo eval suites from templates

/agentic-docs:evaluate — validate provider configuration and execute eval suites with automated analysis

Features

generate-evals

Template-based eval generation

Repository-scoped eval configs

evaluate

Bundled eval execution workflow

Spawns judge sub-agent to analyze results and session logs and metrics

Pass/fail quality reporting

Cost and latency regression checks

These workflows add deterministic and LLM-judged validation for:

skill routing

command execution

output quality

regression detection

Assertion types used

skill-used / not-skill-used

icontains / not-icontains

llm-rubric

cost / latency

Test coverage

Test infrastructure: Both skills include their own test suites in evals/evals.json

Summary by CodeRabbit

New Features

Added agentic-docs plugin for creating and maintaining AI-optimized documentation for OpenShift.

Added /agentic-docs:evaluate command to assess documentation quality through behavioral validation.

Added /agentic-docs:generate-evals command to generate repository-specific evaluation configurations.

Added /metrics:ai-docs-telemetry command to analyze documentation usage patterns from Claude Code sessions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

plugins/metrics/commands/ai-docs-telemetry.md (2)
10-13: ⚡ Quick win

Add language specifiers to fenced code blocks.

For better syntax highlighting and rendering, specify the language for fenced code blocks. Since these are command examples, use bash:
📝 Proposed fix
 ## Synopsis
-```
+```bash
 /metrics:ai-docs-telemetry -scan [-project <name>]
 /metrics:ai-docs-telemetry -session <path-to-session.jsonl>
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @plugins/metrics/commands/ai-docs-telemetry.md around lines 10 - 13, The
fenced code block containing the command examples /metrics:ai-docs-telemetry -scan [-project <name>] and /metrics:ai-docs-telemetry -session <path-to-session.jsonl> needs a language specifier for proper highlighting;
update the block to start with bash and keep the closing so the two
command lines are rendered as bash code.
</details>

---

`44-88`: _⚡ Quick win_

**Add language specifiers to example code blocks.**

The example code blocks should specify `bash` for better rendering:

<details>
<summary>📝 Proposed fix</summary>

```diff
 1. **Scan all recent sessions (last 7 days)**:
-   ```
+   ```bash
    /metrics:ai-docs-telemetry -scan
    ```
```

```diff
 2. **Scan only enhancements repository**:
-   ```
+   ```bash
    /metrics:ai-docs-telemetry -scan -project enhancements
    ```
```

```diff
 3. **Scan only machine-config-operator repository**:
-   ```
+   ```bash
    /metrics:ai-docs-telemetry -scan -project machine-config-operator
    ```
```

```diff
 4. **Analyze a specific session**:
-   ```
+   ```bash
    /metrics:ai-docs-telemetry -session ~/.claude/projects/<project>/<session-id>.jsonl
    ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @plugins/metrics/commands/ai-docs-telemetry.md around lines 44 - 88, Add
explicit language specifiers (bash) to all example fenced code blocks that show
command usage for the ai-docs telemetry tool (e.g., blocks containing
"/metrics:ai-docs-telemetry -scan", "/metrics:ai-docs-telemetry -scan -project
enhancements", "/metrics:ai-docs-telemetry -scan -project
machine-config-operator", "/metrics:ai-docs-telemetry -session
~/.claude/projects//.jsonl" and the bash pipeline examples
using jq) by changing the opening triple backticks to ```bash so the snippets
render correctly as shell commands.
</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude-plugin/marketplace.json:

Around line 250-255: The marketplace entry for "agentic-docs" has a version
mismatch: marketplace declares "version": "1.0.0" while the plugin manifest
(plugin.json) declares "1.1.0"; update the "version" value in the marketplace
JSON to match the plugin manifest's "1.1.0" (or vice‑versa if you intend to
downgrade) so both "agentic-docs" version fields are identical, and ensure
future releases update both the marketplace entry and the plugin.json together.

In @docs/data.json:

Around line 1826-1844: Update docs/data.json to match the actual plugin
contents: replace the empty "commands" array with the two command names
"generate-evals" and "evaluate"; replace the "skills" entries for "component"
and "update-platform-docs" with the actual skill objects for the new skills
(ids/names "generate-evals" and "evaluate" and appropriate descriptions matching
the PR); and change the "version" value from "1.0.0" to "1.1.0" to match
plugin.json (verify plugin.json for authoritative version). Ensure the keys
"commands", "skills", and "version" exactly reflect the new symbols
generate-evals and evaluate.

In @plugins/agentic-docs/skills/evaluate/evals/evals.json:

Around line 6-17: This eval is internally inconsistent: the eval named
"happy-path-evaluation" and its prompt/expected_output describe a normal run but
the assertions (e.g., "detected_invalid_provider_config",
"did_not_run_promptfoo", "v60_runs_without_validation") expect invalid-provider
behavior; change this case to consistently represent an invalid-provider
scenario by renaming "eval_name" (e.g., "invalid-provider-evaluation"), updating
"prompt" to state the promptfooconfig.yaml contains an invalid Vertex AI
provider format, and adjust "expected_output" to assert detection of the invalid
provider, instructions to fix, reference to the generate-evals skill, and that
promptfoo is not run; keep the listed assertions as-is so the test suite checks
for detection, fix instructions, no run, and baseline v6.0 behavior.

In @plugins/agentic-docs/skills/evaluate/scripts/run-eval.sh:

Around line 7-8: REPO_ROOT is being set to a plugin-relative path so promptfoo
runs in the wrong directory and misses promptfooconfig.yaml; update run-eval.sh
to compute the true repository root (e.g., use git rev-parse --show-toplevel or
resolve SCRIPT_DIR up to the repo root) and ensure the script cds into that
computed REPO_ROOT before invoking promptfoo (the area around the current
cd/execution that references REPO_ROOT). Also verify promptfoo is invoked with
the correct working directory or explicit config path so promptfooconfig.yaml in
the repo root is found.

In @plugins/metrics/scripts/ai_docs_telemetry.py:

Around line 102-107: The try/except around opening and reading session_path
currently catches broad Exception; narrow it to file-related exceptions (e.g.,
catch FileNotFoundError, PermissionError and IsADirectoryError or a general
OSError) when opening/reading the file so different failure modes aren’t masked,
keep the same error print to sys.stderr and return None as before; update the
block that opens session_path and reads content (the with open(session_path,
'r') as f: / content = f.read() section) to catch these specific exceptions
instead of Exception.

Around line 204-209: The pre-filter around session_file.read_text() should
also check for "CLAUDE.md" in addition to "ai-docs/" and "AGENTS.md" so sessions
that only touched CLAUDE.md aren't skipped; update the conditional that
currently reads if not ("ai-docs/" in content or "AGENTS.md" in content) to
include "CLAUDE.md". Also replace the silent except: continue with logged error
handling—catch the exception from session_file.read_text(), log the exception
and the session_file (or its path) using the module's existing logger (e.g.,
logger.exception or logger.error) for visibility, then continue. Ensure you
modify the try/except block around session_file.read_text() and the conditional
that inspects content.

Nitpick comments:
In @plugins/metrics/commands/ai-docs-telemetry.md:

Around line 10-13: The fenced code block containing the command examples
/metrics:ai-docs-telemetry -scan [-project <name>] and
/metrics:ai-docs-telemetry -session <path-to-session.jsonl> needs a language
specifier for proper highlighting; update the block to start with bash and keep the closing so the two command lines are rendered as bash code.

Around line 44-88: Add explicit language specifiers (bash) to all example
fenced code blocks that show command usage for the ai-docs telemetry tool (e.g.,
blocks containing "/metrics:ai-docs-telemetry -scan",
"/metrics:ai-docs-telemetry -scan -project enhancements",
"/metrics:ai-docs-telemetry -scan -project machine-config-operator",
"/metrics:ai-docs-telemetry -session
~/.claude/projects//.jsonl" and the bash pipeline examples
using jq) by changing the opening triple backticks to ```bash so the snippets
render correctly as shell commands.
</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ]  Push a commit to this branch (recommended)
- [ ]  Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Enterprise

**Run ID**: `f0fc34ce-a64e-46b8-b2b5-72896a629198`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 503e009a7755326342b30ef7efc736e5f89d079c and d2b43331a30d942ddda31300873a247280284e9e.

</details>

<details>
<summary>📒 Files selected for processing (13)</summary>

* `.claude-plugin/marketplace.json`
* `docs/data.json`
* `plugins/agentic-docs/.claude-plugin/plugin.json`
* `plugins/agentic-docs/commands/evaluate.md`
* `plugins/agentic-docs/commands/generate-evals.md`
* `plugins/agentic-docs/skills/evaluate/SKILL.md`
* `plugins/agentic-docs/skills/evaluate/evals/evals.json`
* `plugins/agentic-docs/skills/evaluate/scripts/run-eval.sh`
* `plugins/agentic-docs/skills/generate-evals/SKILL.md`
* `plugins/agentic-docs/skills/generate-evals/templates/promptfooconfig.example.yaml`
* `plugins/metrics/README.md`
* `plugins/metrics/commands/ai-docs-telemetry.md`
* `plugins/metrics/scripts/ai_docs_telemetry.py`

</details>

</details>

coderabbitai · 2026-05-15T01:00:46Z

+    {
+      "commands": [],
+      "description": "Create and maintain AI-optimized documentation for OpenShift",
+      "has_readme": true,
+      "hooks": [],
+      "name": "agentic-docs",
+      "skills": [
+        {
+          "description": "Create lean component documentation for OpenShift repositories",
+          "id": "component",
+          "name": "component-docs"
+        },
+        {
+          "description": "Update existing platform documentation with automatic gap detection in openshift/enhancements",
+          "id": "update-platform-docs",
+          "name": "update-platform-docs"
+        }
+      ],
+      "version": "1.0.0"


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Plugin registration data is inconsistent with actual plugin contents.

The docs/data.json entry has three critical mismatches:

Commands: Empty array, but the PR adds two commands (generate-evals and evaluate).

Skills: Lists component and update-platform-docs, but the PR description and actual files define generate-evals and evaluate skills.

Version: Shows "1.0.0", but plugin.json declares "1.1.0".

🔧 Proposed fix

}, { - "commands": [], + "commands": [ + { + "argument_hint": "[repository-path]", + "description": "Generate repository-specific promptfoo evaluation suites for OpenShift documentation", + "name": "generate-evals", + "synopsis": "/agentic-docs:generate-evals [repository-path]" + }, + { + "argument_hint": "[repository-path]", + "description": "Evaluate agentic documentation quality using promptfoo-based behavioral validation", + "name": "evaluate", + "synopsis": "/agentic-docs:evaluate [repository-path]" + } + ], "description": "Create and maintain AI-optimized documentation for OpenShift", "has_readme": true, "hooks": [], "name": "agentic-docs", "skills": [ { - "description": "Create lean component documentation for OpenShift repositories", - "id": "component", - "name": "component-docs" + "description": "Generate repository-specific promptfoo evaluation suites tailored to OpenShift conventions and repository patterns", + "id": "generate-evals", + "name": "agentic-docs:generate-evals" }, { - "description": "Update existing platform documentation with automatic gap detection in openshift/enhancements", - "id": "update-platform-docs", - "name": "update-platform-docs" + "description": "Evaluate agentic documentation quality using promptfoo-based behavioral validation with natural discovery testing", + "id": "evaluate", + "name": "agentic-docs:evaluate" } ], - "version": "1.0.0" + "version": "1.1.0" } ] }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{

"commands": [],

"description": "Create and maintain AI-optimized documentation for OpenShift",

"has_readme": true,

"hooks": [],

"name": "agentic-docs",

"skills": [

{

"description": "Create lean component documentation for OpenShift repositories",

"id": "component",

"name": "component-docs"

},

{

"description": "Update existing platform documentation with automatic gap detection in openshift/enhancements",

"id": "update-platform-docs",

"name": "update-platform-docs"

}

],

"version": "1.0.0"

{

"commands": [

{

"argument_hint": "[repository-path]",

"description": "Generate repository-specific promptfoo evaluation suites for OpenShift documentation",

"name": "generate-evals",

"synopsis": "/agentic-docs:generate-evals [repository-path]"

},

{

"argument_hint": "[repository-path]",

"description": "Evaluate agentic documentation quality using promptfoo-based behavioral validation",

"name": "evaluate",

"synopsis": "/agentic-docs:evaluate [repository-path]"

}

],

"description": "Create and maintain AI-optimized documentation for OpenShift",

"has_readme": true,

"hooks": [],

"name": "agentic-docs",

"skills": [

{

"description": "Generate repository-specific promptfoo evaluation suites tailored to OpenShift conventions and repository patterns",

"id": "generate-evals",

"name": "agentic-docs:generate-evals"

},

{

"description": "Evaluate agentic documentation quality using promptfoo-based behavioral validation with natural discovery testing",

"id": "evaluate",

"name": "agentic-docs:evaluate"

}

],

"version": "1.1.0"

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/data.json` around lines 1826 - 1844, Update docs/data.json to match the actual plugin contents: replace the empty "commands" array with the two command names "generate-evals" and "evaluate"; replace the "skills" entries for "component" and "update-platform-docs" with the actual skill objects for the new skills (ids/names "generate-evals" and "evaluate" and appropriate descriptions matching the PR); and change the "version" value from "1.0.0" to "1.1.0" to match plugin.json (verify plugin.json for authoritative version). Ensure the keys "commands", "skills", and "version" exactly reflect the new symbols generate-evals and evaluate.

coderabbitai · 2026-05-15T01:00:46Z

+      "eval_name": "happy-path-evaluation",
+      "prompt": "I just created documentation for the multiarch-tuning-operator repository at /Users/kpais/kpais-workspace/claude-tmp/multiarch-tuning-operator-test-plugin. I ran /agentic-docs:generate-evals and it created a promptfooconfig.yaml file with 43 test cases. Now I want to run the evaluation to see if the documentation is good. Can you evaluate it?",
+      "expected_output": "Should spawn code sub-agent to run promptfoo, collect metrics, spawn judge sub-agent to analyze results, and produce comprehensive evaluation report",
+      "files": [],
+      "setup_required": "Repository with promptfooconfig.yaml, ANTHROPIC_API_KEY set",
+      "assertions": [
+        {"name": "detected_invalid_provider_config", "description": "v6.1 should detect the invalid Vertex AI provider format in promptfooconfig.yaml"},
+        {"name": "provided_fix_instructions", "description": "Should provide clear instructions on how to fix the provider configuration"},
+        {"name": "referenced_generate_evals_skill", "description": "Should reference the generate-evals skill documentation for the correct format"},
+        {"name": "did_not_run_promptfoo", "description": "Should NOT run promptfoo when invalid config is detected"},
+        {"name": "clear_next_steps", "description": "Should provide clear next steps (edit config or regenerate)"},
+        {"name": "v60_runs_without_validation", "description": "v6.0 (baseline) should attempt to run promptfoo and encounter API errors"}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make eval case 1 internally consistent.

Line 6/Line 8 define a happy-path run, but Line 12–Line 17 assert invalid-provider handling and “did_not_run_promptfoo”. This contradiction can make the suite report misleading results.

Suggested fix

"assertions": [ - {"name": "detected_invalid_provider_config", "description": "v6.1 should detect the invalid Vertex AI provider format in promptfooconfig.yaml"}, - {"name": "provided_fix_instructions", "description": "Should provide clear instructions on how to fix the provider configuration"}, - {"name": "referenced_generate_evals_skill", "description": "Should reference the generate-evals skill documentation for the correct format"}, - {"name": "did_not_run_promptfoo", "description": "Should NOT run promptfoo when invalid config is detected"}, - {"name": "clear_next_steps", "description": "Should provide clear next steps (edit config or regenerate)"}, - {"name": "v60_runs_without_validation", "description": "v6.0 (baseline) should attempt to run promptfoo and encounter API errors"} + {"name": "spawned_code_subagent", "description": "Should spawn code sub-agent to run promptfoo"}, + {"name": "ran_promptfoo_tests", "description": "Should execute promptfoo evals successfully"}, + {"name": "spawned_judge_subagent", "description": "Should spawn judge sub-agent to analyze results"}, + {"name": "reported_quality_summary", "description": "Should report pass/fail quality summary"}, + {"name": "reported_cost_latency", "description": "Should include cost and latency regression checks"}, + {"name": "clear_next_steps", "description": "Should provide clear next steps based on results"} ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"eval_name": "happy-path-evaluation",

"prompt": "I just created documentation for the multiarch-tuning-operator repository at /Users/kpais/kpais-workspace/claude-tmp/multiarch-tuning-operator-test-plugin. I ran /agentic-docs:generate-evals and it created a promptfooconfig.yaml file with 43 test cases. Now I want to run the evaluation to see if the documentation is good. Can you evaluate it?",

"expected_output": "Should spawn code sub-agent to run promptfoo, collect metrics, spawn judge sub-agent to analyze results, and produce comprehensive evaluation report",

"files": [],

"setup_required": "Repository with promptfooconfig.yaml, ANTHROPIC_API_KEY set",

"assertions": [

{"name": "detected_invalid_provider_config", "description": "v6.1 should detect the invalid Vertex AI provider format in promptfooconfig.yaml"},

{"name": "provided_fix_instructions", "description": "Should provide clear instructions on how to fix the provider configuration"},

{"name": "referenced_generate_evals_skill", "description": "Should reference the generate-evals skill documentation for the correct format"},

{"name": "did_not_run_promptfoo", "description": "Should NOT run promptfoo when invalid config is detected"},

{"name": "clear_next_steps", "description": "Should provide clear next steps (edit config or regenerate)"},

{"name": "v60_runs_without_validation", "description": "v6.0 (baseline) should attempt to run promptfoo and encounter API errors"}

"eval_name": "happy-path-evaluation",

"prompt": "I just created documentation for the multiarch-tuning-operator repository at /Users/kpais/kpais-workspace/claude-tmp/multiarch-tuning-operator-test-plugin. I ran /agentic-docs:generate-evals and it created a promptfooconfig.yaml file with 43 test cases. Now I want to run the evaluation to see if the documentation is good. Can you evaluate it?",

"expected_output": "Should spawn code sub-agent to run promptfoo, collect metrics, spawn judge sub-agent to analyze results, and produce comprehensive evaluation report",

"files": [],

"setup_required": "Repository with promptfooconfig.yaml, ANTHROPIC_API_KEY set",

"assertions": [

{"name": "spawned_code_subagent", "description": "Should spawn code sub-agent to run promptfoo"},

{"name": "ran_promptfoo_tests", "description": "Should execute promptfoo evals successfully"},

{"name": "spawned_judge_subagent", "description": "Should spawn judge sub-agent to analyze results"},

{"name": "reported_quality_summary", "description": "Should report pass/fail quality summary"},

{"name": "reported_cost_latency", "description": "Should include cost and latency regression checks"},

{"name": "clear_next_steps", "description": "Should provide clear next steps based on results"}

]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/agentic-docs/skills/evaluate/evals/evals.json` around lines 6 - 17, This eval is internally inconsistent: the eval named "happy-path-evaluation" and its prompt/expected_output describe a normal run but the assertions (e.g., "detected_invalid_provider_config", "did_not_run_promptfoo", "v60_runs_without_validation") expect invalid-provider behavior; change this case to consistently represent an invalid-provider scenario by renaming "eval_name" (e.g., "invalid-provider-evaluation"), updating "prompt" to state the promptfooconfig.yaml contains an invalid Vertex AI provider format, and adjust "expected_output" to assert detection of the invalid provider, instructions to fix, reference to the generate-evals skill, and that promptfoo is not run; keep the listed assertions as-is so the test suite checks for detection, fix instructions, no run, and baseline v6.0 behavior.

coderabbitai · 2026-05-15T01:00:46Z

+SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+REPO_ROOT="$( cd "$SCRIPT_DIR/../.." && pwd )"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix target directory resolution before running promptfoo.

Line 8 resolves to a plugin-relative path, not the repository being evaluated; then Line 45–Line 46 force execution there. This can break evaluation by missing promptfooconfig.yaml.

Suggested fix

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" -REPO_ROOT="$( cd "$SCRIPT_DIR/../.." && pwd )" +TARGET_REPO="${TARGET_REPO:-$PWD}" @@ -# Change to repo root (where config and files are) -cd "$REPO_ROOT" +# Change to target repository (where promptfooconfig.yaml should exist) +cd "$TARGET_REPO" + +if [ ! -f "promptfooconfig.yaml" ]; then + echo "❌ Error: promptfooconfig.yaml not found in $TARGET_REPO" + echo " Run /agentic-docs:generate-evals first or set TARGET_REPO correctly." + exit 1 +fi

Also applies to: 45-50

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/agentic-docs/skills/evaluate/scripts/run-eval.sh` around lines 7 - 8, REPO_ROOT is being set to a plugin-relative path so promptfoo runs in the wrong directory and misses promptfooconfig.yaml; update run-eval.sh to compute the true repository root (e.g., use git rev-parse --show-toplevel or resolve SCRIPT_DIR up to the repo root) and ensure the script cds into that computed REPO_ROOT before invoking promptfoo (the area around the current cd/execution that references REPO_ROOT). Also verify promptfoo is invoked with the correct working directory or explicit config path so promptfooconfig.yaml in the repo root is found.

coderabbitai · 2026-05-15T01:00:46Z

+    try:
+        with open(session_path, 'r') as f:
+            content = f.read()
+    except Exception as e:
+        print(f"Error reading session: {e}", file=sys.stderr)
+        return None


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Catch specific exceptions instead of broad Exception.

Catching broad Exception masks different error scenarios. Specify file-related exceptions for clearer error handling:

🛡️ Proposed fix

try: with open(session_path, 'r') as f: content = f.read() - except Exception as e: + except (FileNotFoundError, PermissionError, IOError) as e: print(f"Error reading session: {e}", file=sys.stderr) return None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

with open(session_path, 'r') as f:

content = f.read()

except Exception as e:

print(f"Error reading session: {e}", file=sys.stderr)

return None

try:

with open(session_path, 'r') as f:

content = f.read()

except (FileNotFoundError, PermissionError, IOError) as e:

print(f"Error reading session: {e}", file=sys.stderr)

return None

🧰 Tools

🪛 Ruff (0.15.12)

[warning] 105-105: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/metrics/scripts/ai_docs_telemetry.py` around lines 102 - 107, The try/except around opening and reading session_path currently catches broad Exception; narrow it to file-related exceptions (e.g., catch FileNotFoundError, PermissionError and IsADirectoryError or a general OSError) when opening/reading the file so different failure modes aren’t masked, keep the same error print to sys.stderr and return None as before; update the block that opens session_path and reads content (the with open(session_path, 'r') as f: / content = f.read() section) to catch these specific exceptions instead of Exception.

coderabbitai · 2026-05-15T01:00:46Z

+        try:
+            content = session_file.read_text()
+            if not ("ai-docs/" in content or "AGENTS.md" in content):
+                continue
+        except Exception:
+            continue


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pre-filter is missing "CLAUDE.md" check and lacks error logging.

Two issues:

Line 206 checks for "ai-docs/" and "AGENTS.md" but not "CLAUDE.md", even though the full processing at line 140 includes it. Sessions with only CLAUDE.md accesses will be incorrectly skipped.

The try-except silently continues without logging, making it difficult to diagnose issues.

🔧 Proposed fix

# Quick pre-filter: check if file contains ai-docs markers try: content = session_file.read_text() - if not ("ai-docs/" in content or "AGENTS.md" in content): + if not ("ai-docs/" in content or "AGENTS.md" in content or "CLAUDE.md" in content): continue - except Exception: + except (FileNotFoundError, PermissionError, IOError) as e: + print(f"Warning: Could not read {session_file}: {e}", file=sys.stderr) continue

🧰 Tools

🪛 Ruff (0.15.12)

[error] 208-209: try-except-continue detected, consider logging the exception

(S112)

[warning] 208-208: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/metrics/scripts/ai_docs_telemetry.py` around lines 204 - 209, The pre-filter around session_file.read_text() should also check for "CLAUDE.md" in addition to "ai-docs/" and "AGENTS.md" so sessions that only touched CLAUDE.md aren't skipped; update the conditional that currently reads if not ("ai-docs/" in content or "AGENTS.md" in content) to include "CLAUDE.md". Also replace the silent except: continue with logged error handling—catch the exception from session_file.read_text(), log the exception and the session_file (or its path) using the module's existing logger (e.g., logger.exception or logger.error) for visibility, then continue. Ensure you modify the try/except block around session_file.read_text() and the conditional that inspects content.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

plugins/agentic-docs/skills/evaluate/SKILL.md (1)
1034-1034: 💤 Low value

Consider adding language identifiers to fenced code blocks.

Several fenced code blocks (at lines 1034, 1067, 1099, 1112, 1139, and 1179) lack language identifiers. Adding text, markdown, or other appropriate language tags would improve syntax highlighting and accessibility.
Example fix
-```
+```text
 ERROR: Evaluation configuration not found
 ...
 ```
Also applies to: 1067-1067, 1099-1099, 1112-1112, 1139-1139, 1179-1179
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/agentic-docs/skills/evaluate/SKILL.md` at line 1034, Several fenced
code blocks in the evaluate skill markdown currently open with bare ``` and lack
language hints (e.g., blocks containing "ERROR: Evaluation configuration not
found" and similar examples); update each opening fence from ``` to a suitable
language tag such as ```text or ```markdown (choose `text` for plain
error/output blocks and `markdown`/other for formatted snippets) so syntax
highlighting and accessibility are improved, ensuring every code fence in the
SKILL.md evaluate documentation has a language identifier.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/data.json`:
- Around line 1329-1332: The ai-docs-telemetry command metadata is inconsistent:
the argument_hint field contains "[-session <path>]" but the synopsis does not;
update the JSON so both match—either remove "[-session <path>]" from
argument_hint or add "[-session <path>]" into the synopsis string for the
"ai-docs-telemetry" entry so that "argument_hint" and "synopsis" are consistent.

In `@plugins/agentic-docs/skills/evaluate/SKILL.md`:
- Around line 1269-1272: The SKILL.md lists a non-existent command
'/agentic-docs:component' causing inaccurate docs; remove that entry (or replace
it with a real command such as '/agentic-docs:evaluate' if intended) from the
markdown and ensure the plugin command list matches the registry in
docs/data.json which only contains 'evaluate' and 'generate-evals'; update the
line in SKILL.md that currently contains '/agentic-docs:component' so the
documented commands exactly match the names in docs/data.json.

---

Nitpick comments:
In `@plugins/agentic-docs/skills/evaluate/SKILL.md`:
- Line 1034: Several fenced code blocks in the evaluate skill markdown currently
open with bare ``` and lack language hints (e.g., blocks containing "ERROR:
Evaluation configuration not found" and similar examples); update each opening
fence from ``` to a suitable language tag such as ```text or ```markdown (choose
`text` for plain error/output blocks and `markdown`/other for formatted
snippets) so syntax highlighting and accessibility are improved, ensuring every
code fence in the SKILL.md evaluate documentation has a language identifier.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3822426f-7a26-4c5b-8542-f6ccf8fdafed

📥 Commits

Reviewing files that changed from the base of the PR and between d2b4333 and dfda728.

📒 Files selected for processing (5)

.claude-plugin/marketplace.json
PLUGINS.md
docs/data.json
plugins/agentic-docs/.claude-plugin/plugin.json
plugins/agentic-docs/skills/evaluate/SKILL.md

✅ Files skipped from review due to trivial changes (1)

PLUGINS.md

🚧 Files skipped from review as they are similar to previous changes (1)

plugins/agentic-docs/.claude-plugin/plugin.json

coderabbitai · 2026-05-15T19:21:12Z

+          "argument_hint": "[-scan] [-project <name>] [-session <path>]",
+          "description": "Analyze Claude Code session logs for ai-docs usage patterns",
+          "name": "ai-docs-telemetry",
+          "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Minor: Inconsistency between argument_hint and synopsis.

The argument_hint includes [-session <path>] but the synopsis omits it. Either add -session to the synopsis or remove it from the argument_hint to keep them consistent.

📝 Proposed fix

{ - "argument_hint": "[-scan] [-project <name>] [-session <path>]", + "argument_hint": "[-scan] [-project <name>]", "description": "Analyze Claude Code session logs for ai-docs usage patterns", "name": "ai-docs-telemetry", "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]" }

Or if -session is intentional:

{ "argument_hint": "[-scan] [-project <name>] [-session <path>]", "description": "Analyze Claude Code session logs for ai-docs usage patterns", "name": "ai-docs-telemetry", - "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]" + "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>] [-session <path>]" }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"argument_hint": "[-scan] [-project <name>] [-session <path>]",

"description": "Analyze Claude Code session logs for ai-docs usage patterns",

"name": "ai-docs-telemetry",

"synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]"

"argument_hint": "[-scan] [-project <name>]",

"description": "Analyze Claude Code session logs for ai-docs usage patterns",

"name": "ai-docs-telemetry",

"synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]"

Suggested change

"argument_hint": "[-scan] [-project <name>] [-session <path>]",

"description": "Analyze Claude Code session logs for ai-docs usage patterns",

"name": "ai-docs-telemetry",

"synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]"

"argument_hint": "[-scan] [-project <name>] [-session <path>]",

"description": "Analyze Claude Code session logs for ai-docs usage patterns",

"name": "ai-docs-telemetry",

"synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>] [-session <path>]"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/data.json` around lines 1329 - 1332, The ai-docs-telemetry command metadata is inconsistent: the argument_hint field contains "[-session <path>]" but the synopsis does not; update the JSON so both match—either remove "[-session <path>]" from argument_hint or add "[-session <path>]" into the synopsis string for the "ai-docs-telemetry" entry so that "argument_hint" and "synopsis" are consistent.

coderabbitai · 2026-05-15T19:21:13Z

+
+- `/agentic-docs:generate-evals` - Generate promptfooconfig.yaml before evaluating
+- `/agentic-docs:component` - Create component documentation to evaluate
+- `/metrics:ai-docs-telemetry` - Analyze documentation usage patterns


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify that component command exists in docs registry if grep -q '"name": "component"' docs/data.json; then echo "component command found in registry" else echo "component command NOT found in registry - reference should be removed" fi

Repository: openshift-eng/ai-helpers

Length of output: 138

Remove non-existent command reference.

Line 1271 references /agentic-docs:component, but this command does not exist in the plugin's command registry (docs/data.json contains only evaluate and generate-evals).

🔧 Proposed fix

## Related Commands - `/agentic-docs:generate-evals` - Generate promptfooconfig.yaml before evaluating -- `/agentic-docs:component` - Create component documentation to evaluate - `/metrics:ai-docs-telemetry` - Analyze documentation usage patterns

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- `/agentic-docs:generate-evals` - Generate promptfooconfig.yaml before evaluating

- `/agentic-docs:component` - Create component documentation to evaluate

- `/metrics:ai-docs-telemetry` - Analyze documentation usage patterns

- `/agentic-docs:generate-evals` - Generate promptfooconfig.yaml before evaluating

- `/metrics:ai-docs-telemetry` - Analyze documentation usage patterns

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/agentic-docs/skills/evaluate/SKILL.md` around lines 1269 - 1272, The SKILL.md lists a non-existent command '/agentic-docs:component' causing inaccurate docs; remove that entry (or replace it with a real command such as '/agentic-docs:evaluate' if intended) from the markdown and ensure the plugin command list matches the registry in docs/data.json which only contains 'evaluate' and 'generate-evals'; update the line in SKILL.md that currently contains '/agentic-docs:component' so the documented commands exactly match the names in docs/data.json.

github-actions · 2026-05-15T19:28:13Z

❌ error (plugins-doc-up-to-date): docs/data.json is out of sync with plugin metadata. Run 'make update' to update.

Prashanth684 · 2026-05-15T21:25:27Z

+    label: claude
+
+prompts:
+  - "{{prompt}}"


let's craft the prompt like this:

You are working in the <repo-name> repository. {{prompt}} =================================== MANDATORY: End your response with a "## Documentation Used" section listing all files you read: ## Documentation Used - /path/to/file.md (reason) DO NOT SKIP THIS SECTION. ===================================

so that we can later check in the rubric that the documentation was indeed used. check https://github.com/openshift/enhancements/pull/1992/changes#diff-c7c3415c9cea54e2f9f4b6c84a6d9f381aaad790522f156c94dd39cf4af278d9 for an example

Prashanth684 · 2026-05-15T21:25:55Z

+        What API changes and controller logic are needed?
+    assert:
+      - type: llm-rubric
+        value: "The output mentions platform-specific KMS services (AWS KMS and Azure Key Vault)"


rubric should also check that the agentic documentation was actually used

Prashanth684 · 2026-05-15T21:29:24Z

+    vars:
+      agent: cloud-provider-sme
+      prompt: |
+        We want to implement customer-managed encryption key support for


we should also ensure that any new features that it tries to develop must either a) not be present and b) are hypothetical features (in case it comes up with a name, it must make sure that API name/CRD name should not be present)

Prashanth684 · 2026-05-15T21:31:07Z

+   - description: "conventions/01-api-versioning"
+     vars:
+       prompt: |
+         Review: "We should create a new <RepoSpecificAPI> starting at v1."


rather than asking if it is correct - the prompt should just ask the LLM to do it with the violation. we expect the LLM to tell it it shouldn't based on the documentation guidelines

Prashanth684 · 2026-05-15T21:33:56Z

+
+**Repository-specific anti-patterns**:
+
+Extract from CLAUDE.md or ai-docs/ sections that say:


while this is true, the cases are not limited to this .for example again in openshift/enhancements#1992, one anti pattern test is to create stable v1 apis which is strongly discouraged. maybe we want to keep this a little open and in the end anyway the component owner will have to review these cases

Prashanth684 · 2026-05-15T21:34:45Z

+- Auto-invocation after agentic-docs:create
+- Three test categories (navigation, authoring, anti-pattern)
+- Standard + repository-specific anti-patterns
+- promptfooconfig.yaml generation


skill is too long. i'm worried claude will miss some context. let's try to keep it succinct

Prashanth684 · 2026-05-15T21:52:15Z

+
+The generated configuration follows the exact format from the template (HyperShift-based evaluation framework).
+
+### Why Repository-Specific Evals?


do we need this section about why we ned repo specific evals?

Prashanth684 · 2026-05-15T21:53:45Z

+
+### Phase 2: Navigation Test Generation
+
+Generate 2-3 navigation tests that verify agents can find repository-specific documentation.


maybe how many of each test is something that can be user input

Prashanth684 · 2026-05-15T22:00:17Z

+  • promptfooconfig.yaml - Evaluation configuration
+  • EVALUATION.md - Evaluation documentation
+
+Run evaluations: make eval


should we have templated guidance on writing the makefile changes?

Prashanth684 · 2026-05-15T22:07:43Z

/hold
only to be merged after #437 merges

Prashanth684 · 2026-05-15T22:14:04Z

+```
+
+### Phase 4: Anti-Pattern Test Generation
+


the anti-pattern test generation is also repo specific. i.e, the example below of API starting at v1 is more of a generic example which is why it was in enhancements. maybe other repos will have repo specific anti patterns

Prashanth684 · 2026-05-15T22:34:44Z

+    {
+      "id": 1,
+      "eval_name": "happy-path-evaluation",
+      "prompt": "I just created documentation for the multiarch-tuning-operator repository at /Users/kpais/kpais-workspace/claude-tmp/multiarch-tuning-operator-test-plugin. I ran /agentic-docs:generate-evals and it created a promptfooconfig.yaml file with 43 test cases. Now I want to run the evaluation to see if the documentation is good. Can you evaluate it?",


is this file meant to be an example ?

So this file contains evals to test the /evaluate skill itself.
evals.json file is generated by the skill-creator skill as part of its predefined workflow:
https://github.com/anthropics/skills/blob/main/skills/skill-creator/SKILL.md

However, I haven't added evals.json for /generate-evals skill or the /component skill yet.

openshift-ci · 2026-05-18T09:15:36Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Prashanth684 and others added 15 commits May 15, 2026 02:56

Add evaluate and generate-evals skills to agentic-docs plugin

28c45fe

Add command wrappers for evaluate and generate-evals skills

7029d28

Add command wrapper for component skill

d560ce6

Add metrics plugin from PR openshift-eng#450 for session telemetry

3bc9046

Update SKILL.md

6864da2

Update SKILL.md

ad5da0b

Refined evaluate skill

f58d024

Included only eval

d2b4333

openshift-ci Bot requested review from stleerh and theobarberbany May 15, 2026 00:56

github-actions Bot reviewed May 15, 2026

View reviewed changes

kenjpais changed the title ~~Add evaluation~~ Add agentic-docs evals May 15, 2026

kenjpais changed the title ~~Add agentic-docs evals~~ OKD-370: Add promptfoo evals for agentic-docs plugin May 15, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 15, 2026

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

kenjpais force-pushed the add-evaluation branch from dfda728 to d2b4333 Compare May 15, 2026 19:27

github-actions Bot reviewed May 15, 2026

View reviewed changes

Prashanth684 reviewed May 15, 2026

View reviewed changes

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 15, 2026

Prashanth684 reviewed May 15, 2026

View reviewed changes

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2026

		SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
		REPO_ROOT="$( cd "$SCRIPT_DIR/../.." && pwd )"


		Repository-specific anti-patterns:

		Extract from CLAUDE.md or ai-docs/ sections that say:


		The generated configuration follows the exact format from the template (HyperShift-based evaluation framework).

		### Why Repository-Specific Evals?


		### Phase 2: Navigation Test Generation

		Generate 2-3 navigation tests that verify agents can find repository-specific documentation.

Conversation

kenjpais commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Assertion types used

Test coverage

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

skillsaw: additional violations

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented May 15, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Assertion types used

Test coverage

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kenjpais commented May 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading

openshift-ci-robot commented May 15, 2026 •

edited by openshift-ci Bot

Loading

Prashanth684 May 15, 2026 •

edited

Loading

kenjpais May 18, 2026 •

edited

Loading