OKD-370: Add promptfoo evals for agentic-docs plugin#477
Conversation
Introduces tier-1 platform documentation skills for creating and maintaining AI-optimized documentation in openshift/enhancements. Skills: /agentic-docs:platform (/platform-docs): Creates tier-1 platform documentation with: - AGENTS.md navigation index - DESIGN_PHILOSOPHY.md and KNOWLEDGE_GRAPH.md - platform/, domain/, practices/, decisions/, workflows/, references/ - Automated discovery, structure creation, template population, and validation /agentic-docs:update-platform-docs (/update-platform-docs): Incrementally updates tier-1 documentation with: - Automatic gap detection (scans existing ai-docs/ for missing files) - Targeted additions without full regeneration - Smart navigation updates (auto-updates indexes and AGENTS.md) - Validation of naming conventions, line counts, and link integrity
Introduces tier-2 lean component documentation skill for creating structured component-level documentation in OpenShift repositories. Skills: /agentic-docs:component (/component-docs): Creates tier-2 lean component documentation with: - Component-specific CRDs and architecture only - Pointers to tier-1 for generic patterns - Component ADRs and exec-plan tracking - AGENTS.md entry point - DEVELOPMENT.md and TESTING.md guides - Domain concepts and ecosystem maps
Platform documentation in openshift/enhancements/ai-docs/ already exists and was created using this skill. Remove the /platform-docs skill that was designed to create it from scratch - it's no longer needed. Changes: - Remove entire skills/platform/ directory - Keep /update-platform-docs for incremental updates to existing platform docs - Keep /component-docs for creating component-level documentation - Update README to clarify platform docs "already exist" - Simplify tier architecture description (tier-1/tier-2 → platform/component) - Update component skill templates to reference "platform docs" consistently - Update validation scripts to remove platform-specific checks - Remove platform-docs from marketplace registration This simplifies the plugin to focus on its two active use cases: 1. Creating new component documentation (/component-docs) 2. Updating existing platform documentation (/update-platform-docs)
- Fix generate-evals to use only anthropic:claude-sonnet-4-6 provider - Rewrite evaluate skill to use 2-agent architecture: - Code claude sub-agent: runs promptfoo tool - Judge claude sub-agent: evaluates results + metrics - Integrate metrics plugin for session telemetry - Remove manual test spawning approach - Add comprehensive error handling and documentation
Critical fixes: 1. EVALUATE SKILL (v5.0): - Actually spawn judge sub-agent after code agent completes (was missing) - Use bundled scripts/run-eval.sh instead of raw promptfoo commands - Add explicit step-by-step workflow with Agent tool examples - Fix sequential execution (code → collect metrics → judge) - Add comprehensive error handling for 100% error rate scenario - Document common issues and fixes 2. GENERATE-EVALS SKILL: - Fix provider format to simple string: anthropic:claude-sonnet-4-6 - Remove incorrect object format with id/config - Add explicit DO/DON'T examples for provider configuration - Change outputPath to ./promptfoo-results.json - Change prompts to use file://prompts/system.txt Issues fixed: - Test 1 failure: Judge sub-agent now explicitly spawned with results - 100% error rate: Provider format corrected (was using API format not promptfoo format) - Missing workflow: Added complete sequential workflow with Agent() examples - Script usage: Now uses bundled run-eval.sh for reliable execution
Remove unnecessary code sub-agent - Option B implementation: BEFORE (v5.0 - Two sub-agents): 1. Spawn code sub-agent → run promptfoo 2. Spawn judge sub-agent → analyze results AFTER (v6.0 - One sub-agent): 1. Main agent runs run-eval.sh directly 2. Main agent collects session metrics 3. Spawn judge sub-agent → analyze results + metrics Benefits: - ✅ Simpler: One sub-agent instead of two - ✅ Faster: ~20-30s saved (no code sub-agent spawn overhead) - ✅ Cheaper: ~$0.02-0.05 saved per evaluation - ✅ Clearer: Main agent runs tools, judge analyzes - ✅ More reliable: Fewer moving parts, fewer failure modes Technical changes: - Removed Step 2 (spawn code sub-agent) - Main agent now executes bash /scripts/run-eval.sh - Main agent collects metrics directly from session - Judge sub-agent receives results from main agent (not from code sub-agent) - Updated all documentation and examples - Added complete example workflow showing direct execution Addresses user question: 'Why cannot the coding sub-agent directly pass results to judge?' Answer: It can't (sub-agents can't spawn sub-agents), but we don't need it anyway - main agent can run the script directly.
## Changes ### generate-evals skill (v2.0) - Add canonical template at templates/promptfooconfig.example.yaml - Update skill to always use template as foundation - Document common provider format mistakes to avoid - Switch from weight-based to llm-rubric assertions - Use vars.prompt instead of vars.task_description ### evaluate skill (v6.2) - Add provider validation before running promptfoo - Add bundled run-eval.sh script for consistent execution - Add test suite (evals/evals.json) with 3 test cases - Document skill testing and iteration workflow ### Plugin version - Bump agentic-docs plugin from 1.0.0 to 1.1.0 (MINOR) - Reflects enhanced functionality in both skills ## Key improvements - Prevents invalid Vertex AI provider format (vertex:anthropic:claude-...) - Template-first approach ensures consistency - Skills now include their own test infrastructure - Better error detection and user guidance Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kenjpais The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughAdds an agentic-docs plugin (generate-evals and evaluate skills, templates, scripts, docs), registers it in marketplace/docs registries and PLUGINS.md, and introduces ai-docs telemetry: a metrics command plus a Python script to extract ai-docs usage from Claude Code session logs. Changesagentic-docs Plugin and Evaluation Suite
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 10✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
skillsaw: additional violations
|
There was a problem hiding this comment.
agentskill-evals): evals[0] all assertions must be strings
There was a problem hiding this comment.
agentskill-evals): evals[1] all assertions must be strings
There was a problem hiding this comment.
agentskill-evals): evals[2] all assertions must be strings
There was a problem hiding this comment.
❌ error (plugin-owners-required): Plugin 'agentic-docs' is missing an OWNERS file
|
@kenjpais: This pull request references OKD-370 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
plugins/metrics/commands/ai-docs-telemetry.md (2)
10-13: ⚡ Quick winAdd language specifiers to fenced code blocks.
For better syntax highlighting and rendering, specify the language for fenced code blocks. Since these are command examples, use
bash:📝 Proposed fix
## Synopsis -``` +```bash /metrics:ai-docs-telemetry -scan [-project <name>] /metrics:ai-docs-telemetry -session <path-to-session.jsonl></details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@plugins/metrics/commands/ai-docs-telemetry.mdaround lines 10 - 13, The
fenced code block containing the command examples/metrics:ai-docs-telemetry -scan [-project <name>]and/metrics:ai-docs-telemetry -session <path-to-session.jsonl>needs a language specifier for proper highlighting;
update the block to start withbash and keep the closingso the two
command lines are rendered as bash code.</details> --- `44-88`: _⚡ Quick win_ **Add language specifiers to example code blocks.** The example code blocks should specify `bash` for better rendering: <details> <summary>📝 Proposed fix</summary> ```diff 1. **Scan all recent sessions (last 7 days)**: - ``` + ```bash /metrics:ai-docs-telemetry -scan ``` ``` ```diff 2. **Scan only enhancements repository**: - ``` + ```bash /metrics:ai-docs-telemetry -scan -project enhancements ``` ``` ```diff 3. **Scan only machine-config-operator repository**: - ``` + ```bash /metrics:ai-docs-telemetry -scan -project machine-config-operator ``` ``` ```diff 4. **Analyze a specific session**: - ``` + ```bash /metrics:ai-docs-telemetry -session ~/.claude/projects/<project>/<session-id>.jsonl ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@plugins/metrics/commands/ai-docs-telemetry.mdaround lines 44 - 88, Add
explicit language specifiers (bash) to all example fenced code blocks that show
command usage for the ai-docs telemetry tool (e.g., blocks containing
"/metrics:ai-docs-telemetry -scan", "/metrics:ai-docs-telemetry -scan -project
enhancements", "/metrics:ai-docs-telemetry -scan -project
machine-config-operator", "/metrics:ai-docs-telemetry -session
~/.claude/projects//.jsonl" and the bash pipeline examples
using jq) by changing the opening triple backticks to ```bash so the snippets
render correctly as shell commands.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Inline comments:
In @.claude-plugin/marketplace.json:
- Around line 250-255: The marketplace entry for "agentic-docs" has a version
mismatch: marketplace declares "version": "1.0.0" while the plugin manifest
(plugin.json) declares "1.1.0"; update the "version" value in the marketplace
JSON to match the plugin manifest's "1.1.0" (or vice‑versa if you intend to
downgrade) so both "agentic-docs" version fields are identical, and ensure
future releases update both the marketplace entry and the plugin.json together.In
@docs/data.json:
- Around line 1826-1844: Update docs/data.json to match the actual plugin
contents: replace the empty "commands" array with the two command names
"generate-evals" and "evaluate"; replace the "skills" entries for "component"
and "update-platform-docs" with the actual skill objects for the new skills
(ids/names "generate-evals" and "evaluate" and appropriate descriptions matching
the PR); and change the "version" value from "1.0.0" to "1.1.0" to match
plugin.json (verify plugin.json for authoritative version). Ensure the keys
"commands", "skills", and "version" exactly reflect the new symbols
generate-evals and evaluate.In
@plugins/agentic-docs/skills/evaluate/evals/evals.json:
- Around line 6-17: This eval is internally inconsistent: the eval named
"happy-path-evaluation" and its prompt/expected_output describe a normal run but
the assertions (e.g., "detected_invalid_provider_config",
"did_not_run_promptfoo", "v60_runs_without_validation") expect invalid-provider
behavior; change this case to consistently represent an invalid-provider
scenario by renaming "eval_name" (e.g., "invalid-provider-evaluation"), updating
"prompt" to state the promptfooconfig.yaml contains an invalid Vertex AI
provider format, and adjust "expected_output" to assert detection of the invalid
provider, instructions to fix, reference to the generate-evals skill, and that
promptfoo is not run; keep the listed assertions as-is so the test suite checks
for detection, fix instructions, no run, and baseline v6.0 behavior.In
@plugins/agentic-docs/skills/evaluate/scripts/run-eval.sh:
- Around line 7-8: REPO_ROOT is being set to a plugin-relative path so promptfoo
runs in the wrong directory and misses promptfooconfig.yaml; update run-eval.sh
to compute the true repository root (e.g., use git rev-parse --show-toplevel or
resolve SCRIPT_DIR up to the repo root) and ensure the script cds into that
computed REPO_ROOT before invoking promptfoo (the area around the current
cd/execution that references REPO_ROOT). Also verify promptfoo is invoked with
the correct working directory or explicit config path so promptfooconfig.yaml in
the repo root is found.In
@plugins/metrics/scripts/ai_docs_telemetry.py:
- Around line 102-107: The try/except around opening and reading session_path
currently catches broad Exception; narrow it to file-related exceptions (e.g.,
catch FileNotFoundError, PermissionError and IsADirectoryError or a general
OSError) when opening/reading the file so different failure modes aren’t masked,
keep the same error print to sys.stderr and return None as before; update the
block that opens session_path and reads content (the with open(session_path,
'r') as f: / content = f.read() section) to catch these specific exceptions
instead of Exception.- Around line 204-209: The pre-filter around session_file.read_text() should
also check for "CLAUDE.md" in addition to "ai-docs/" and "AGENTS.md" so sessions
that only touched CLAUDE.md aren't skipped; update the conditional that
currently reads if not ("ai-docs/" in content or "AGENTS.md" in content) to
include "CLAUDE.md". Also replace the silent except: continue with logged error
handling—catch the exception from session_file.read_text(), log the exception
and the session_file (or its path) using the module's existing logger (e.g.,
logger.exception or logger.error) for visibility, then continue. Ensure you
modify the try/except block around session_file.read_text() and the conditional
that inspects content.
Nitpick comments:
In@plugins/metrics/commands/ai-docs-telemetry.md:
- Around line 10-13: The fenced code block containing the command examples
/metrics:ai-docs-telemetry -scan [-project <name>]and
/metrics:ai-docs-telemetry -session <path-to-session.jsonl>needs a language
specifier for proper highlighting; update the block to start withbash and keep the closingso the two command lines are rendered as bash code.- Around line 44-88: Add explicit language specifiers (bash) to all example
fenced code blocks that show command usage for the ai-docs telemetry tool (e.g.,
blocks containing "/metrics:ai-docs-telemetry -scan",
"/metrics:ai-docs-telemetry -scan -project enhancements",
"/metrics:ai-docs-telemetry -scan -project machine-config-operator",
"/metrics:ai-docs-telemetry -session
~/.claude/projects//.jsonl" and the bash pipeline examples
using jq) by changing the opening triple backticks to ```bash so the snippets
render correctly as shell commands.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Enterprise **Run ID**: `f0fc34ce-a64e-46b8-b2b5-72896a629198` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 503e009a7755326342b30ef7efc736e5f89d079c and d2b43331a30d942ddda31300873a247280284e9e. </details> <details> <summary>📒 Files selected for processing (13)</summary> * `.claude-plugin/marketplace.json` * `docs/data.json` * `plugins/agentic-docs/.claude-plugin/plugin.json` * `plugins/agentic-docs/commands/evaluate.md` * `plugins/agentic-docs/commands/generate-evals.md` * `plugins/agentic-docs/skills/evaluate/SKILL.md` * `plugins/agentic-docs/skills/evaluate/evals/evals.json` * `plugins/agentic-docs/skills/evaluate/scripts/run-eval.sh` * `plugins/agentic-docs/skills/generate-evals/SKILL.md` * `plugins/agentic-docs/skills/generate-evals/templates/promptfooconfig.example.yaml` * `plugins/metrics/README.md` * `plugins/metrics/commands/ai-docs-telemetry.md` * `plugins/metrics/scripts/ai_docs_telemetry.py` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| { | ||
| "commands": [], | ||
| "description": "Create and maintain AI-optimized documentation for OpenShift", | ||
| "has_readme": true, | ||
| "hooks": [], | ||
| "name": "agentic-docs", | ||
| "skills": [ | ||
| { | ||
| "description": "Create lean component documentation for OpenShift repositories", | ||
| "id": "component", | ||
| "name": "component-docs" | ||
| }, | ||
| { | ||
| "description": "Update existing platform documentation with automatic gap detection in openshift/enhancements", | ||
| "id": "update-platform-docs", | ||
| "name": "update-platform-docs" | ||
| } | ||
| ], | ||
| "version": "1.0.0" |
There was a problem hiding this comment.
Critical: Plugin registration data is inconsistent with actual plugin contents.
The docs/data.json entry has three critical mismatches:
- Commands: Empty array, but the PR adds two commands (
generate-evalsandevaluate). - Skills: Lists
componentandupdate-platform-docs, but the PR description and actual files definegenerate-evalsandevaluateskills. - Version: Shows
"1.0.0", butplugin.jsondeclares"1.1.0".
🔧 Proposed fix
},
{
- "commands": [],
+ "commands": [
+ {
+ "argument_hint": "[repository-path]",
+ "description": "Generate repository-specific promptfoo evaluation suites for OpenShift documentation",
+ "name": "generate-evals",
+ "synopsis": "/agentic-docs:generate-evals [repository-path]"
+ },
+ {
+ "argument_hint": "[repository-path]",
+ "description": "Evaluate agentic documentation quality using promptfoo-based behavioral validation",
+ "name": "evaluate",
+ "synopsis": "/agentic-docs:evaluate [repository-path]"
+ }
+ ],
"description": "Create and maintain AI-optimized documentation for OpenShift",
"has_readme": true,
"hooks": [],
"name": "agentic-docs",
"skills": [
{
- "description": "Create lean component documentation for OpenShift repositories",
- "id": "component",
- "name": "component-docs"
+ "description": "Generate repository-specific promptfoo evaluation suites tailored to OpenShift conventions and repository patterns",
+ "id": "generate-evals",
+ "name": "agentic-docs:generate-evals"
},
{
- "description": "Update existing platform documentation with automatic gap detection in openshift/enhancements",
- "id": "update-platform-docs",
- "name": "update-platform-docs"
+ "description": "Evaluate agentic documentation quality using promptfoo-based behavioral validation with natural discovery testing",
+ "id": "evaluate",
+ "name": "agentic-docs:evaluate"
}
],
- "version": "1.0.0"
+ "version": "1.1.0"
}
]
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| { | |
| "commands": [], | |
| "description": "Create and maintain AI-optimized documentation for OpenShift", | |
| "has_readme": true, | |
| "hooks": [], | |
| "name": "agentic-docs", | |
| "skills": [ | |
| { | |
| "description": "Create lean component documentation for OpenShift repositories", | |
| "id": "component", | |
| "name": "component-docs" | |
| }, | |
| { | |
| "description": "Update existing platform documentation with automatic gap detection in openshift/enhancements", | |
| "id": "update-platform-docs", | |
| "name": "update-platform-docs" | |
| } | |
| ], | |
| "version": "1.0.0" | |
| { | |
| "commands": [ | |
| { | |
| "argument_hint": "[repository-path]", | |
| "description": "Generate repository-specific promptfoo evaluation suites for OpenShift documentation", | |
| "name": "generate-evals", | |
| "synopsis": "/agentic-docs:generate-evals [repository-path]" | |
| }, | |
| { | |
| "argument_hint": "[repository-path]", | |
| "description": "Evaluate agentic documentation quality using promptfoo-based behavioral validation", | |
| "name": "evaluate", | |
| "synopsis": "/agentic-docs:evaluate [repository-path]" | |
| } | |
| ], | |
| "description": "Create and maintain AI-optimized documentation for OpenShift", | |
| "has_readme": true, | |
| "hooks": [], | |
| "name": "agentic-docs", | |
| "skills": [ | |
| { | |
| "description": "Generate repository-specific promptfoo evaluation suites tailored to OpenShift conventions and repository patterns", | |
| "id": "generate-evals", | |
| "name": "agentic-docs:generate-evals" | |
| }, | |
| { | |
| "description": "Evaluate agentic documentation quality using promptfoo-based behavioral validation with natural discovery testing", | |
| "id": "evaluate", | |
| "name": "agentic-docs:evaluate" | |
| } | |
| ], | |
| "version": "1.1.0" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/data.json` around lines 1826 - 1844, Update docs/data.json to match the
actual plugin contents: replace the empty "commands" array with the two command
names "generate-evals" and "evaluate"; replace the "skills" entries for
"component" and "update-platform-docs" with the actual skill objects for the new
skills (ids/names "generate-evals" and "evaluate" and appropriate descriptions
matching the PR); and change the "version" value from "1.0.0" to "1.1.0" to
match plugin.json (verify plugin.json for authoritative version). Ensure the
keys "commands", "skills", and "version" exactly reflect the new symbols
generate-evals and evaluate.
| "eval_name": "happy-path-evaluation", | ||
| "prompt": "I just created documentation for the multiarch-tuning-operator repository at /Users/kpais/kpais-workspace/claude-tmp/multiarch-tuning-operator-test-plugin. I ran /agentic-docs:generate-evals and it created a promptfooconfig.yaml file with 43 test cases. Now I want to run the evaluation to see if the documentation is good. Can you evaluate it?", | ||
| "expected_output": "Should spawn code sub-agent to run promptfoo, collect metrics, spawn judge sub-agent to analyze results, and produce comprehensive evaluation report", | ||
| "files": [], | ||
| "setup_required": "Repository with promptfooconfig.yaml, ANTHROPIC_API_KEY set", | ||
| "assertions": [ | ||
| {"name": "detected_invalid_provider_config", "description": "v6.1 should detect the invalid Vertex AI provider format in promptfooconfig.yaml"}, | ||
| {"name": "provided_fix_instructions", "description": "Should provide clear instructions on how to fix the provider configuration"}, | ||
| {"name": "referenced_generate_evals_skill", "description": "Should reference the generate-evals skill documentation for the correct format"}, | ||
| {"name": "did_not_run_promptfoo", "description": "Should NOT run promptfoo when invalid config is detected"}, | ||
| {"name": "clear_next_steps", "description": "Should provide clear next steps (edit config or regenerate)"}, | ||
| {"name": "v60_runs_without_validation", "description": "v6.0 (baseline) should attempt to run promptfoo and encounter API errors"} |
There was a problem hiding this comment.
Make eval case 1 internally consistent.
Line 6/Line 8 define a happy-path run, but Line 12–Line 17 assert invalid-provider handling and “did_not_run_promptfoo”. This contradiction can make the suite report misleading results.
Suggested fix
"assertions": [
- {"name": "detected_invalid_provider_config", "description": "v6.1 should detect the invalid Vertex AI provider format in promptfooconfig.yaml"},
- {"name": "provided_fix_instructions", "description": "Should provide clear instructions on how to fix the provider configuration"},
- {"name": "referenced_generate_evals_skill", "description": "Should reference the generate-evals skill documentation for the correct format"},
- {"name": "did_not_run_promptfoo", "description": "Should NOT run promptfoo when invalid config is detected"},
- {"name": "clear_next_steps", "description": "Should provide clear next steps (edit config or regenerate)"},
- {"name": "v60_runs_without_validation", "description": "v6.0 (baseline) should attempt to run promptfoo and encounter API errors"}
+ {"name": "spawned_code_subagent", "description": "Should spawn code sub-agent to run promptfoo"},
+ {"name": "ran_promptfoo_tests", "description": "Should execute promptfoo evals successfully"},
+ {"name": "spawned_judge_subagent", "description": "Should spawn judge sub-agent to analyze results"},
+ {"name": "reported_quality_summary", "description": "Should report pass/fail quality summary"},
+ {"name": "reported_cost_latency", "description": "Should include cost and latency regression checks"},
+ {"name": "clear_next_steps", "description": "Should provide clear next steps based on results"}
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "eval_name": "happy-path-evaluation", | |
| "prompt": "I just created documentation for the multiarch-tuning-operator repository at /Users/kpais/kpais-workspace/claude-tmp/multiarch-tuning-operator-test-plugin. I ran /agentic-docs:generate-evals and it created a promptfooconfig.yaml file with 43 test cases. Now I want to run the evaluation to see if the documentation is good. Can you evaluate it?", | |
| "expected_output": "Should spawn code sub-agent to run promptfoo, collect metrics, spawn judge sub-agent to analyze results, and produce comprehensive evaluation report", | |
| "files": [], | |
| "setup_required": "Repository with promptfooconfig.yaml, ANTHROPIC_API_KEY set", | |
| "assertions": [ | |
| {"name": "detected_invalid_provider_config", "description": "v6.1 should detect the invalid Vertex AI provider format in promptfooconfig.yaml"}, | |
| {"name": "provided_fix_instructions", "description": "Should provide clear instructions on how to fix the provider configuration"}, | |
| {"name": "referenced_generate_evals_skill", "description": "Should reference the generate-evals skill documentation for the correct format"}, | |
| {"name": "did_not_run_promptfoo", "description": "Should NOT run promptfoo when invalid config is detected"}, | |
| {"name": "clear_next_steps", "description": "Should provide clear next steps (edit config or regenerate)"}, | |
| {"name": "v60_runs_without_validation", "description": "v6.0 (baseline) should attempt to run promptfoo and encounter API errors"} | |
| "eval_name": "happy-path-evaluation", | |
| "prompt": "I just created documentation for the multiarch-tuning-operator repository at /Users/kpais/kpais-workspace/claude-tmp/multiarch-tuning-operator-test-plugin. I ran /agentic-docs:generate-evals and it created a promptfooconfig.yaml file with 43 test cases. Now I want to run the evaluation to see if the documentation is good. Can you evaluate it?", | |
| "expected_output": "Should spawn code sub-agent to run promptfoo, collect metrics, spawn judge sub-agent to analyze results, and produce comprehensive evaluation report", | |
| "files": [], | |
| "setup_required": "Repository with promptfooconfig.yaml, ANTHROPIC_API_KEY set", | |
| "assertions": [ | |
| {"name": "spawned_code_subagent", "description": "Should spawn code sub-agent to run promptfoo"}, | |
| {"name": "ran_promptfoo_tests", "description": "Should execute promptfoo evals successfully"}, | |
| {"name": "spawned_judge_subagent", "description": "Should spawn judge sub-agent to analyze results"}, | |
| {"name": "reported_quality_summary", "description": "Should report pass/fail quality summary"}, | |
| {"name": "reported_cost_latency", "description": "Should include cost and latency regression checks"}, | |
| {"name": "clear_next_steps", "description": "Should provide clear next steps based on results"} | |
| ] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/agentic-docs/skills/evaluate/evals/evals.json` around lines 6 - 17,
This eval is internally inconsistent: the eval named "happy-path-evaluation" and
its prompt/expected_output describe a normal run but the assertions (e.g.,
"detected_invalid_provider_config", "did_not_run_promptfoo",
"v60_runs_without_validation") expect invalid-provider behavior; change this
case to consistently represent an invalid-provider scenario by renaming
"eval_name" (e.g., "invalid-provider-evaluation"), updating "prompt" to state
the promptfooconfig.yaml contains an invalid Vertex AI provider format, and
adjust "expected_output" to assert detection of the invalid provider,
instructions to fix, reference to the generate-evals skill, and that promptfoo
is not run; keep the listed assertions as-is so the test suite checks for
detection, fix instructions, no run, and baseline v6.0 behavior.
| SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" | ||
| REPO_ROOT="$( cd "$SCRIPT_DIR/../.." && pwd )" |
There was a problem hiding this comment.
Fix target directory resolution before running promptfoo.
Line 8 resolves to a plugin-relative path, not the repository being evaluated; then Line 45–Line 46 force execution there. This can break evaluation by missing promptfooconfig.yaml.
Suggested fix
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
-REPO_ROOT="$( cd "$SCRIPT_DIR/../.." && pwd )"
+TARGET_REPO="${TARGET_REPO:-$PWD}"
@@
-# Change to repo root (where config and files are)
-cd "$REPO_ROOT"
+# Change to target repository (where promptfooconfig.yaml should exist)
+cd "$TARGET_REPO"
+
+if [ ! -f "promptfooconfig.yaml" ]; then
+ echo "❌ Error: promptfooconfig.yaml not found in $TARGET_REPO"
+ echo " Run /agentic-docs:generate-evals first or set TARGET_REPO correctly."
+ exit 1
+fiAlso applies to: 45-50
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/agentic-docs/skills/evaluate/scripts/run-eval.sh` around lines 7 - 8,
REPO_ROOT is being set to a plugin-relative path so promptfoo runs in the wrong
directory and misses promptfooconfig.yaml; update run-eval.sh to compute the
true repository root (e.g., use git rev-parse --show-toplevel or resolve
SCRIPT_DIR up to the repo root) and ensure the script cds into that computed
REPO_ROOT before invoking promptfoo (the area around the current cd/execution
that references REPO_ROOT). Also verify promptfoo is invoked with the correct
working directory or explicit config path so promptfooconfig.yaml in the repo
root is found.
| try: | ||
| with open(session_path, 'r') as f: | ||
| content = f.read() | ||
| except Exception as e: | ||
| print(f"Error reading session: {e}", file=sys.stderr) | ||
| return None |
There was a problem hiding this comment.
Catch specific exceptions instead of broad Exception.
Catching broad Exception masks different error scenarios. Specify file-related exceptions for clearer error handling:
🛡️ Proposed fix
try:
with open(session_path, 'r') as f:
content = f.read()
- except Exception as e:
+ except (FileNotFoundError, PermissionError, IOError) as e:
print(f"Error reading session: {e}", file=sys.stderr)
return None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| with open(session_path, 'r') as f: | |
| content = f.read() | |
| except Exception as e: | |
| print(f"Error reading session: {e}", file=sys.stderr) | |
| return None | |
| try: | |
| with open(session_path, 'r') as f: | |
| content = f.read() | |
| except (FileNotFoundError, PermissionError, IOError) as e: | |
| print(f"Error reading session: {e}", file=sys.stderr) | |
| return None |
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 105-105: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/metrics/scripts/ai_docs_telemetry.py` around lines 102 - 107, The
try/except around opening and reading session_path currently catches broad
Exception; narrow it to file-related exceptions (e.g., catch FileNotFoundError,
PermissionError and IsADirectoryError or a general OSError) when opening/reading
the file so different failure modes aren’t masked, keep the same error print to
sys.stderr and return None as before; update the block that opens session_path
and reads content (the with open(session_path, 'r') as f: / content = f.read()
section) to catch these specific exceptions instead of Exception.
| try: | ||
| content = session_file.read_text() | ||
| if not ("ai-docs/" in content or "AGENTS.md" in content): | ||
| continue | ||
| except Exception: | ||
| continue |
There was a problem hiding this comment.
Pre-filter is missing "CLAUDE.md" check and lacks error logging.
Two issues:
- Line 206 checks for
"ai-docs/"and"AGENTS.md"but not"CLAUDE.md", even though the full processing at line 140 includes it. Sessions with only CLAUDE.md accesses will be incorrectly skipped. - The try-except silently continues without logging, making it difficult to diagnose issues.
🔧 Proposed fix
# Quick pre-filter: check if file contains ai-docs markers
try:
content = session_file.read_text()
- if not ("ai-docs/" in content or "AGENTS.md" in content):
+ if not ("ai-docs/" in content or "AGENTS.md" in content or "CLAUDE.md" in content):
continue
- except Exception:
+ except (FileNotFoundError, PermissionError, IOError) as e:
+ print(f"Warning: Could not read {session_file}: {e}", file=sys.stderr)
continue🧰 Tools
🪛 Ruff (0.15.12)
[error] 208-209: try-except-continue detected, consider logging the exception
(S112)
[warning] 208-208: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/metrics/scripts/ai_docs_telemetry.py` around lines 204 - 209, The
pre-filter around session_file.read_text() should also check for "CLAUDE.md" in
addition to "ai-docs/" and "AGENTS.md" so sessions that only touched CLAUDE.md
aren't skipped; update the conditional that currently reads if not ("ai-docs/"
in content or "AGENTS.md" in content) to include "CLAUDE.md". Also replace the
silent except: continue with logged error handling—catch the exception from
session_file.read_text(), log the exception and the session_file (or its path)
using the module's existing logger (e.g., logger.exception or logger.error) for
visibility, then continue. Ensure you modify the try/except block around
session_file.read_text() and the conditional that inspects content.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
plugins/agentic-docs/skills/evaluate/SKILL.md (1)
1034-1034: 💤 Low valueConsider adding language identifiers to fenced code blocks.
Several fenced code blocks (at lines 1034, 1067, 1099, 1112, 1139, and 1179) lack language identifiers. Adding
text,markdown, or other appropriate language tags would improve syntax highlighting and accessibility.Example fix
-``` +```text ERROR: Evaluation configuration not found ... ```Also applies to: 1067-1067, 1099-1099, 1112-1112, 1139-1139, 1179-1179
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/agentic-docs/skills/evaluate/SKILL.md` at line 1034, Several fenced code blocks in the evaluate skill markdown currently open with bare ``` and lack language hints (e.g., blocks containing "ERROR: Evaluation configuration not found" and similar examples); update each opening fence from ``` to a suitable language tag such as ```text or ```markdown (choose `text` for plain error/output blocks and `markdown`/other for formatted snippets) so syntax highlighting and accessibility are improved, ensuring every code fence in the SKILL.md evaluate documentation has a language identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/data.json`:
- Around line 1329-1332: The ai-docs-telemetry command metadata is inconsistent:
the argument_hint field contains "[-session <path>]" but the synopsis does not;
update the JSON so both match—either remove "[-session <path>]" from
argument_hint or add "[-session <path>]" into the synopsis string for the
"ai-docs-telemetry" entry so that "argument_hint" and "synopsis" are consistent.
In `@plugins/agentic-docs/skills/evaluate/SKILL.md`:
- Around line 1269-1272: The SKILL.md lists a non-existent command
'/agentic-docs:component' causing inaccurate docs; remove that entry (or replace
it with a real command such as '/agentic-docs:evaluate' if intended) from the
markdown and ensure the plugin command list matches the registry in
docs/data.json which only contains 'evaluate' and 'generate-evals'; update the
line in SKILL.md that currently contains '/agentic-docs:component' so the
documented commands exactly match the names in docs/data.json.
---
Nitpick comments:
In `@plugins/agentic-docs/skills/evaluate/SKILL.md`:
- Line 1034: Several fenced code blocks in the evaluate skill markdown currently
open with bare ``` and lack language hints (e.g., blocks containing "ERROR:
Evaluation configuration not found" and similar examples); update each opening
fence from ``` to a suitable language tag such as ```text or ```markdown (choose
`text` for plain error/output blocks and `markdown`/other for formatted
snippets) so syntax highlighting and accessibility are improved, ensuring every
code fence in the SKILL.md evaluate documentation has a language identifier.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3822426f-7a26-4c5b-8542-f6ccf8fdafed
📒 Files selected for processing (5)
.claude-plugin/marketplace.jsonPLUGINS.mddocs/data.jsonplugins/agentic-docs/.claude-plugin/plugin.jsonplugins/agentic-docs/skills/evaluate/SKILL.md
✅ Files skipped from review due to trivial changes (1)
- PLUGINS.md
🚧 Files skipped from review as they are similar to previous changes (1)
- plugins/agentic-docs/.claude-plugin/plugin.json
| "argument_hint": "[-scan] [-project <name>] [-session <path>]", | ||
| "description": "Analyze Claude Code session logs for ai-docs usage patterns", | ||
| "name": "ai-docs-telemetry", | ||
| "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]" |
There was a problem hiding this comment.
Minor: Inconsistency between argument_hint and synopsis.
The argument_hint includes [-session <path>] but the synopsis omits it. Either add -session to the synopsis or remove it from the argument_hint to keep them consistent.
📝 Proposed fix
{
- "argument_hint": "[-scan] [-project <name>] [-session <path>]",
+ "argument_hint": "[-scan] [-project <name>]",
"description": "Analyze Claude Code session logs for ai-docs usage patterns",
"name": "ai-docs-telemetry",
"synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]"
}Or if -session is intentional:
{
"argument_hint": "[-scan] [-project <name>] [-session <path>]",
"description": "Analyze Claude Code session logs for ai-docs usage patterns",
"name": "ai-docs-telemetry",
- "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]"
+ "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>] [-session <path>]"
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "argument_hint": "[-scan] [-project <name>] [-session <path>]", | |
| "description": "Analyze Claude Code session logs for ai-docs usage patterns", | |
| "name": "ai-docs-telemetry", | |
| "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]" | |
| "argument_hint": "[-scan] [-project <name>]", | |
| "description": "Analyze Claude Code session logs for ai-docs usage patterns", | |
| "name": "ai-docs-telemetry", | |
| "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]" |
| "argument_hint": "[-scan] [-project <name>] [-session <path>]", | |
| "description": "Analyze Claude Code session logs for ai-docs usage patterns", | |
| "name": "ai-docs-telemetry", | |
| "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>]" | |
| "argument_hint": "[-scan] [-project <name>] [-session <path>]", | |
| "description": "Analyze Claude Code session logs for ai-docs usage patterns", | |
| "name": "ai-docs-telemetry", | |
| "synopsis": "/metrics:ai-docs-telemetry -scan [-project <name>] [-session <path>]" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/data.json` around lines 1329 - 1332, The ai-docs-telemetry command
metadata is inconsistent: the argument_hint field contains "[-session <path>]"
but the synopsis does not; update the JSON so both match—either remove
"[-session <path>]" from argument_hint or add "[-session <path>]" into the
synopsis string for the "ai-docs-telemetry" entry so that "argument_hint" and
"synopsis" are consistent.
|
|
||
| - `/agentic-docs:generate-evals` - Generate promptfooconfig.yaml before evaluating | ||
| - `/agentic-docs:component` - Create component documentation to evaluate | ||
| - `/metrics:ai-docs-telemetry` - Analyze documentation usage patterns |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that component command exists in docs registry
if grep -q '"name": "component"' docs/data.json; then
echo "component command found in registry"
else
echo "component command NOT found in registry - reference should be removed"
fiRepository: openshift-eng/ai-helpers
Length of output: 138
Remove non-existent command reference.
Line 1271 references /agentic-docs:component, but this command does not exist in the plugin's command registry (docs/data.json contains only evaluate and generate-evals).
🔧 Proposed fix
## Related Commands
- `/agentic-docs:generate-evals` - Generate promptfooconfig.yaml before evaluating
-- `/agentic-docs:component` - Create component documentation to evaluate
- `/metrics:ai-docs-telemetry` - Analyze documentation usage patterns📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - `/agentic-docs:generate-evals` - Generate promptfooconfig.yaml before evaluating | |
| - `/agentic-docs:component` - Create component documentation to evaluate | |
| - `/metrics:ai-docs-telemetry` - Analyze documentation usage patterns | |
| - `/agentic-docs:generate-evals` - Generate promptfooconfig.yaml before evaluating | |
| - `/metrics:ai-docs-telemetry` - Analyze documentation usage patterns |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/agentic-docs/skills/evaluate/SKILL.md` around lines 1269 - 1272, The
SKILL.md lists a non-existent command '/agentic-docs:component' causing
inaccurate docs; remove that entry (or replace it with a real command such as
'/agentic-docs:evaluate' if intended) from the markdown and ensure the plugin
command list matches the registry in docs/data.json which only contains
'evaluate' and 'generate-evals'; update the line in SKILL.md that currently
contains '/agentic-docs:component' so the documented commands exactly match the
names in docs/data.json.
There was a problem hiding this comment.
❌ error (plugins-doc-up-to-date): docs/data.json is out of sync with plugin metadata. Run 'make update' to update.
| label: claude | ||
|
|
||
| prompts: | ||
| - "{{prompt}}" |
There was a problem hiding this comment.
let's craft the prompt like this:
You are working in the <repo-name> repository.
{{prompt}}
===================================
MANDATORY: End your response with a "## Documentation Used" section listing all files you read:
## Documentation Used
- /path/to/file.md (reason)
DO NOT SKIP THIS SECTION.
===================================
so that we can later check in the rubric that the documentation was indeed used. check https://github.com/openshift/enhancements/pull/1992/changes#diff-c7c3415c9cea54e2f9f4b6c84a6d9f381aaad790522f156c94dd39cf4af278d9 for an example
| What API changes and controller logic are needed? | ||
| assert: | ||
| - type: llm-rubric | ||
| value: "The output mentions platform-specific KMS services (AWS KMS and Azure Key Vault)" |
There was a problem hiding this comment.
rubric should also check that the agentic documentation was actually used
| vars: | ||
| agent: cloud-provider-sme | ||
| prompt: | | ||
| We want to implement customer-managed encryption key support for |
There was a problem hiding this comment.
we should also ensure that any new features that it tries to develop must either a) not be present and b) are hypothetical features (in case it comes up with a name, it must make sure that API name/CRD name should not be present)
| - description: "conventions/01-api-versioning" | ||
| vars: | ||
| prompt: | | ||
| Review: "We should create a new <RepoSpecificAPI> starting at v1." |
There was a problem hiding this comment.
rather than asking if it is correct - the prompt should just ask the LLM to do it with the violation. we expect the LLM to tell it it shouldn't based on the documentation guidelines
|
|
||
| **Repository-specific anti-patterns**: | ||
|
|
||
| Extract from CLAUDE.md or ai-docs/ sections that say: |
There was a problem hiding this comment.
while this is true, the cases are not limited to this .for example again in openshift/enhancements#1992, one anti pattern test is to create stable v1 apis which is strongly discouraged. maybe we want to keep this a little open and in the end anyway the component owner will have to review these cases
| - Auto-invocation after agentic-docs:create | ||
| - Three test categories (navigation, authoring, anti-pattern) | ||
| - Standard + repository-specific anti-patterns | ||
| - promptfooconfig.yaml generation |
There was a problem hiding this comment.
skill is too long. i'm worried claude will miss some context. let's try to keep it succinct
|
|
||
| The generated configuration follows the exact format from the template (HyperShift-based evaluation framework). | ||
|
|
||
| ### Why Repository-Specific Evals? |
There was a problem hiding this comment.
do we need this section about why we ned repo specific evals?
|
|
||
| ### Phase 2: Navigation Test Generation | ||
|
|
||
| Generate 2-3 navigation tests that verify agents can find repository-specific documentation. |
There was a problem hiding this comment.
maybe how many of each test is something that can be user input
| • promptfooconfig.yaml - Evaluation configuration | ||
| • EVALUATION.md - Evaluation documentation | ||
|
|
||
| Run evaluations: make eval |
There was a problem hiding this comment.
should we have templated guidance on writing the makefile changes?
|
/hold |
| ``` | ||
|
|
||
| ### Phase 4: Anti-Pattern Test Generation | ||
|
|
There was a problem hiding this comment.
the anti-pattern test generation is also repo specific. i.e, the example below of API starting at v1 is more of a generic example which is why it was in enhancements. maybe other repos will have repo specific anti patterns
| { | ||
| "id": 1, | ||
| "eval_name": "happy-path-evaluation", | ||
| "prompt": "I just created documentation for the multiarch-tuning-operator repository at /Users/kpais/kpais-workspace/claude-tmp/multiarch-tuning-operator-test-plugin. I ran /agentic-docs:generate-evals and it created a promptfooconfig.yaml file with 43 test cases. Now I want to run the evaluation to see if the documentation is good. Can you evaluate it?", |
There was a problem hiding this comment.
is this file meant to be an example ?
There was a problem hiding this comment.
So this file contains evals to test the /evaluate skill itself.
evals.json file is generated by the skill-creator skill as part of its predefined workflow:
https://github.com/anthropics/skills/blob/main/skills/skill-creator/SKILL.md
However, I haven't added evals.json for /generate-evals skill or the /component skill yet.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
Add promptfoo-based eval workflows to the
agentic-docsplugin via [PR #437].Introduces two new skills:
/agentic-docs:generate-evals— generate repository-specific promptfoo eval suites from templates/agentic-docs:evaluate— validate provider configuration and execute eval suites with automated analysisFeatures
generate-evals
evaluate
These workflows add deterministic and LLM-judged validation for:
Assertion types used
skill-used/not-skill-usedicontains/not-icontainsllm-rubriccost/latencyTest coverage
Test infrastructure: Both skills include their own test suites in evals/evals.json
Summary by CodeRabbit
New Features
agentic-docsplugin for creating and maintaining AI-optimized documentation for OpenShift./agentic-docs:evaluateto run comparative documentation evaluations and produce structured reports./agentic-docs:generate-evalsto generate repository-specific evaluation suites./metrics:ai-docs-telemetryto analyze ai-docs usage from session logs.Documentation