diff --git a/README.md b/README.md index b1453cd..ffa4f1f 100644 --- a/README.md +++ b/README.md @@ -72,7 +72,7 @@ Initialize the project environment and create necessary directory structures. | **[pdd](agent-sops/pdd.sop.md)** | Prompt-driven development methodology | Complex problem solving, architectural decisions, system design | | **[code-task-generator](agent-sops/code-task-generator.sop.md)** | Intelligent task breakdown and planning from requirements | Project planning, sprint preparation, requirement analysis | | **[code-assist](agent-sops/code-assist.sop.md)** | TDD-based code implementation with structured workflow | Feature development, bug fixes, refactoring | -| **[eval](agent-sops/eval.sop.md)** | Automated evaluation workflow for AI agents using [Strands Evals SDK](https://github.com/strands-agents/evals) | Evaluation planning, test data generation, evaluation execution, and result analysis | +| **[eval](agent-sops/eval.sop.md)** | Automated evaluation workflow for AI agents using [Strands Evals SDK](https://github.com/strands-agents/evals) | Evaluation planning, test data generation, evaluation execution, and result analysis ([usage guide](https://strandsagents.com/latest/documentation/docs/user-guide/evals-sdk/eval-sop/)) | ## Quick Start diff --git a/agent-sops/eval.sop.md b/agent-sops/eval.sop.md index 4798d55..942b01c 100644 --- a/agent-sops/eval.sop.md +++ b/agent-sops/eval.sop.md @@ -4,56 +4,23 @@ EvalKit is a conversational evaluation framework for AI agents that guides you through creating robust evaluations using the Strands Evals SDK. Through natural conversation, you can plan evaluations, generate test data, execute evaluations, and analyze results. -## How Users Interact with EvalKit +## Parameters -Users interact with EvalKit through natural conversation, such as: +- **agent_path** (required): Path to the agent folder to evaluate (e.g., `./chatbot-agent`, `/path/to/my-agent`) +- **evaluation_focus** (optional): Specific aspects to evaluate (e.g., "response quality", "tool calling accuracy") +- **test_case_count** (optional, default: 3): Number of test cases to generate +- **output_format** (optional, default: "jsonl"): Format for test data output -- "Build an evaluation plan for my QA agent at /path/to/agent" -- "Generate test cases focusing on edge cases" -- "Run the evaluation and show me the results" -- "Analyze the evaluation results and suggest improvements" +**Constraints for parameter acquisition:** +- You MUST ask for the agent_path if not provided because evaluation cannot proceed without knowing which agent to evaluate +- You MUST support multiple input methods including: + - Direct input: Path provided directly in the conversation + - Relative paths: Paths relative to current working directory + - Absolute paths: Full system paths to agent location +- You SHOULD infer evaluation_focus from user's natural language description when not explicitly provided +- You MAY use intelligent defaults for optional parameters based on agent analysis -EvalKit understands the evaluation workflow and guides users through four phases: **Plan**, **Data**, **Eval**, and **Report**. - -## Evaluation Workflow - -### Phase 1: Planning - -**User Intent**: Create an evaluation strategy -**Example Requests**: - -- "Create an evaluation plan for my chatbot" -- "I need to evaluate my agent's tool calling accuracy" -- "Plan an evaluation for the agent at /path/to/agent" - -### Phase 2: Test Data Generation - -**User Intent**: Generate test cases -**Example Requests**: - -- "Generate test cases for the evaluation" -- "Create 10 test cases covering edge cases" -- "Add more test scenarios for error handling" - -### Phase 3: Evaluation Execution - -**User Intent**: Run the evaluation -**Example Requests**: - -- "Run the evaluation" -- "Execute the tests and show results" -- "Evaluate the agent with the test cases" - -### Phase 4: Results Analysis - -**User Intent**: Analyze results and get recommendations -**Example Requests**: - -- "Analyze the evaluation results" -- "What improvements should I make?" -- "Generate a report with recommendations" - -## Implementation Guidelines +## Steps ### 1. Setup and Initialization @@ -136,22 +103,9 @@ All evaluation artifacts MUST be created in the `eval/` folder at the same level 6. Report completion with evaluation plan file path, and suggest next step: "Would you like me to generate test cases based on this plan?" -### Planning Phase Guidelines - -#### Decision Guidelines - -When creating evaluation plans from a user prompt: - -1. **Prioritize user evaluation requests**: User input takes precedence over detected agent state - always honor specific user requirements and constraints -2. **Provide intelligent defaults**: When user input is minimal, use agent state analysis to suggest appropriate modules and implementation strategy -3. **Make informed guesses**: Use context, agent type patterns, and evaluation best practices to fill remaining gaps -4. **Enable design iteration**: Always include guidance for refining evaluation requests when defaults don't match user needs -5. **Think like an evaluator and architect**: Every requirement should be measurable and every technology choice should have clear rationale -6. **Make informed decisions**: Use context, agent type patterns, and evaluation best practices to make reasonable decisions without requiring user clarification - -### Evaluation Planning Guidelines +#### Evaluation Planning Guidelines -#### Design Principles +##### Design Principles **High-Level Design (What & Why)**: @@ -165,7 +119,7 @@ When creating evaluation plans from a user prompt: - Design practical file structure and execution approach - Choose integration patterns and configuration methods -#### Metrics Guidelines +##### Metrics Guidelines Evaluation metrics must be: @@ -173,7 +127,7 @@ Evaluation metrics must be: 2. **Verifiable**: Can be measured through actual agent execution 3. **Implementation-ready**: Clear enough to guide technical implementation -#### Architecture Principles +##### Architecture Principles **Key Principles**: @@ -184,7 +138,7 @@ Evaluation metrics must be: - **Framework-First**: Leverage existing evaluation frameworks before building custom solutions - **Modular Design**: Create reusable components that can be easily tested and maintained -#### Technology Selection Defaults +##### Technology Selection Defaults **Examples of reasonable defaults**: @@ -237,7 +191,7 @@ Evaluation metrics must be: 5. Report completion with test case count, coverage summary, and suggest next step: "Would you like me to run the evaluation with these test cases?" -### Data Generation Guidelines +#### Data Generation Guidelines 1. **Prioritize user-specific data requests**: User input takes precedence over the established evaluation plan - always honor specific user requirements and constraints. Update the evaluation plan if needed. @@ -289,11 +243,11 @@ Evaluation metrics must be: 5. Report completion with evaluation results summary and suggest next step: "Would you like me to analyze these results and provide recommendations?" -### Implementation Guidelines +#### Implementation Guidelines **CRITICAL: Always Create Minimal Working Version**: Implement the most basic version that works -#### Strands Evals SDK Integration +##### Strands Evals SDK Integration **CRITICAL REQUIREMENT - Getting Latest Documentation**: Before implementing evaluation code, you MUST retrieve the latest Strands Evals SDK documentation and API usage examples. This is NOT optional. You MUST NOT proceed with implementation without either context7 access or the source code. This ensures you're using the most current patterns and avoiding deprecated APIs. @@ -302,8 +256,7 @@ Before implementing evaluation code, you MUST retrieve the latest Strands Evals First, check if context7 MCP server is available by attempting to use it. If you receive an error indicating context7 is not available, proceed to Step 3. **Step 2: Primary Method - Using Context7 (If Available)**: - -1. Use context7 to get library documentation: "Get documentation for strands-evals focusing on Case, Experiment, and Evaluator classes" +1. Use context7 to get library documentation: "Get documentation for strands-agents-evals focusing on Case, Experiment, and Evaluator classes" 2. Review the latest API patterns and examples 3. Implement evaluation code using the current API @@ -322,7 +275,7 @@ Please install the context7 MCP server in your coding assistant to access the la **Note**: If you're unsure how to install MCP servers in your coding assistant, please consult your assistant's support resources or choose Option 2 below (clone source code). -After installation, you'll be able to query: "Get documentation for strands-evals focusing on Case, Experiment, and Evaluator classes" +After installation, you'll be able to query: "Get documentation for strands-agents-evals focusing on Case, Experiment, and Evaluator classes" **Option 2: Clone Strands Evals SDK Source Code** @@ -330,7 +283,7 @@ If you cannot install context7 MCP or prefer to work with source code directly: ```bash cd -git clone https://github.com/strands-agents/evals strands-evals-source +git clone https://github.com/strands-agents/evals strands-agents-evals-source ``` **IMPORTANT**: You MUST NOT proceed with implementation until the user has completed one of these options. Do NOT attempt to implement evaluation code using only the reference examples in Appendix C, as they may be outdated. @@ -343,8 +296,8 @@ After the user confirms they've completed one of the above options: 3. Implement evaluation code using the current API **If source code was cloned:** -1. Read the source files to understand the current API: `strands-evals-source/src/strands_evals/` -2. Check examples in the repository: `strands-evals-source/examples/` +1. Read the source files to understand the current API: `strands-agents-evals-source/src/strands_evals/` +2. Check examples in the repository: `strands-agents-evals-source/examples/` 3. Review API definitions and usage patterns 4. Implement evaluation code based on the actual source code @@ -355,33 +308,7 @@ After the user confirms they've completed one of the above options: - **Built-in evaluators**: OutputEvaluator, TrajectoryEvaluator, InteractionsEvaluator - **Direct execution**: Agent execution with evaluation, no separate trace collection needed -**Basic Pattern**: - -```python -from strands_evals import Case, Experiment, OutputEvaluator - -# Create test cases -cases = [ - Case( - input="test input", - expected_output="expected response", - metadata={"scenario": "basic_test"} - ) -] - -# Create experiment with evaluator -experiment = Experiment( - cases=cases, - evaluator=OutputEvaluator() -) - -# Run evaluation -results = experiment.run(agent_function) -``` - -#### Environment Setup Guidelines - -##### Update Existing Requirements +##### Environment Setup Guidelines 1. **Check Existing Requirements**: Verify requirements.txt exists in repository root @@ -394,7 +321,7 @@ results = experiment.run(agent_function) ```bash # Add Strands Evals SDK and related dependencies - grep -q "strands-evals" requirements.txt || echo "strands-evals>=1.0.0" >> requirements.txt + grep -q "strands-agents-evals" requirements.txt || echo "strands-agents-evals" >> requirements.txt # Add other evaluation-specific dependencies as needed based on evaluation plan ``` @@ -405,7 +332,7 @@ results = experiment.run(agent_function) uv pip install -r requirements.txt ``` -#### Common Pitfalls to Avoid +##### Common Pitfalls to Avoid - **Over-Engineering**: Don't add complexity before the basic version works - **Ignoring the Plan**: Follow the established evaluation plan structure and requirements @@ -512,9 +439,9 @@ results = experiment.run(agent_function) 9. Report completion with key findings and ask: "Would you like me to help implement any of these recommendations?" -### Analysis and Reporting Guidelines +#### Analysis and Reporting Guidelines -#### Analysis Principles +##### Analysis Principles - **Evidence-Based**: All insights must be supported by actual execution data - **Actionable**: Recommendations must be specific and implementable @@ -522,7 +449,7 @@ results = experiment.run(agent_function) - **Measurable**: Include expected outcomes and success metrics - **Realistic**: Consider implementation effort and constraints -#### Red Flags for Simulation +##### Red Flags for Simulation Always check for these indicators of simulated results: @@ -531,7 +458,7 @@ Always check for these indicators of simulated results: - Keywords like "simulated", "mocked", "fake" in results - Lack of natural variation in evaluation scores -#### Quality Standards for Recommendations +##### Quality Standards for Recommendations **Good Recommendations**: @@ -549,7 +476,7 @@ Always check for these indicators of simulated results: - Are too generic or theoretical - Ignore practical implementation challenges -#### Report Quality Standards +##### Report Quality Standards Ensure your advisory report: @@ -596,82 +523,57 @@ Finalize the evaluation and prepare deliverables. - You SHOULD create a summary of key findings and recommendations - You MAY suggest next steps for continuous evaluation -## Conversation Examples - -### Example 1: Complete Evaluation Flow - -**User**: "I need to evaluate my customer support chatbot at ./chatbot-agent. Focus on response quality and tool calling accuracy." - -**Assistant**: "I'll create an evaluation plan for your customer support chatbot focusing on response quality and tool calling accuracy. Let me analyze the agent at ./chatbot-agent..." - -[Creates eval/eval-plan.md] - -**Assistant**: "I've created an evaluation plan with: - -- 2 key metrics: Response Quality Score and Tool Call Accuracy -- 3 test scenarios covering common support queries -- Strands Evals SDK implementation approach - -Would you like me to generate test cases based on this plan?" - -**User**: "Yes, generate 5 test cases" - -**Assistant**: "I'll generate 5 test cases covering the scenarios in the evaluation plan..." - -[Creates eval/test-cases.jsonl] +## Examples -**Assistant**: "I've generated 5 test cases: +### Example Input -- 2 basic support queries -- 2 tool-calling scenarios -- 1 edge case +``` +agent_path: "./chatbot-agent" +evaluation_focus: "response quality and tool calling accuracy" +``` -Would you like me to run the evaluation with these test cases?" +### Example Output -**User**: "Run the evaluation" +**Complete Evaluation Flow:** -**Assistant**: "I'll implement and execute the evaluation using Strands Evals SDK..." +``` +Phase 1 - Planning: +User: "I need to evaluate my customer support chatbot at ./chatbot-agent. Focus on response quality and tool calling accuracy." +Assistant: "I'll create an evaluation plan for your customer support chatbot..." +[Creates eval/eval-plan.md with 2 key metrics and 3 test scenarios] + +Phase 2 - Data Generation: +User: "Yes, generate 5 test cases" +Assistant: "I'll generate 5 test cases covering the scenarios..." +[Creates eval/test-cases.jsonl with 2 basic queries, 2 tool-calling scenarios, 1 edge case] + +Phase 3 - Evaluation Execution: +User: "Run the evaluation" +Assistant: "I'll implement and execute the evaluation using Strands Evals SDK..." [Creates eval/run_evaluation.py, runs evaluation] +Results: Overall success rate: 80%, Response Quality: 4.2/5, Tool Call Accuracy: 75% -**Assistant**: "Evaluation completed! Results: - -- Overall success rate: 80% -- Response Quality: 4.2/5 -- Tool Call Accuracy: 75% - -Would you like me to analyze these results and provide recommendations?" - -**User**: "Yes, analyze the results" - -**Assistant**: "I'll analyze the evaluation results and generate recommendations..." - +Phase 4 - Analysis: +User: "Yes, analyze the results" +Assistant: "I'll analyze the evaluation results and generate recommendations..." [Creates eval/eval-report.md] +Key findings: Strong performance on basic queries (100% success), Tool calling needs improvement (25% failure rate) +``` -**Assistant**: "Analysis complete! Key findings: - -- Strong performance on basic queries (100% success) -- Tool calling needs improvement (25% failure rate) -- 3 critical recommendations prioritized - -The detailed report is in eval/eval-report.md. Would you like me to help implement any of these recommendations?" - -### Example 2: Iterative Refinement - -**User**: "Create an evaluation plan for my agent at ./my-agent" - -**Assistant**: [Creates initial plan] - -**User**: "Add more focus on error handling" +### Example: Iterative Refinement -**Assistant**: "I'll update the evaluation plan to include error handling metrics..." +``` +User: "Create an evaluation plan for my agent at ./my-agent" +Assistant: [Creates initial plan in eval/eval-plan.md] +User: "Add more focus on error handling" +Assistant: "I'll update the evaluation plan to include error handling metrics..." [Updates eval/eval-plan.md] -**User**: "Generate test cases with more edge cases" - -**Assistant**: "I'll generate test cases with additional edge case coverage..." - +User: "Generate test cases with more edge cases" +Assistant: "I'll generate test cases with additional edge case coverage..." [Updates eval/test-cases.jsonl] +``` ### Example Output Structure @@ -1063,511 +965,3 @@ ACTION REQUIRED: Identify limitations in the current evaluation approach and sug [Repeat structure for other evaluation improvement areas] ``` - -## Appendix C: Strands Evals SDK Reference and Code Examples - -**CRITICAL WARNING**: The code examples below are REFERENCE ONLY and may be outdated. You MUST NOT use these examples as your primary implementation guide. - -**REQUIRED**: Before implementing evaluation code, you MUST follow the documentation retrieval process described in Section 4 "Evaluation Implementation and Execution Phase" > "Strands Evals SDK Integration". Do NOT proceed with implementation until you have obtained current documentation through either context7 MCP or the cloned source code. - -### Core Evaluation Principles - -1. **Case-Based Testing**: Individual test cases with input, expected output, and metadata -2. **Experiment Framework**: Collection of cases with evaluator for running evaluations -3. **Built-in Evaluators**: OutputEvaluator, TrajectoryEvaluator, InteractionsEvaluator -4. **Direct Evaluation**: No separate trace collection, direct evaluation during execution - -### Basic Usage Pattern - -```python -# In eval/run_evaluation.py -import json -from strands_evals import Case, Experiment, OutputEvaluator - -def load_test_cases(file_path: str) -> List[Case]: - """Load test cases from JSONL file.""" - cases = [] - with open(file_path, 'r') as f: - for line in f: - test_data = json.loads(line) - case = Case( - input=test_data['input'], - expected_output=test_data.get('expected_output'), - metadata=test_data.get('metadata', {}) - ) - cases.append(case) - return cases - -def create_experiment(test_cases_file: str = 'test-cases.jsonl'): - """Create experiment with test cases and evaluator.""" - cases = load_test_cases(test_cases_file) - - experiment = Experiment( - cases=cases, - evaluator=OutputEvaluator() - ) - - return experiment -``` - -### Agent Integration Pattern - -```python -# In eval/run_evaluation.py -import json -import os -from datetime import datetime -from pathlib import Path - -def agent_wrapper(agent_function): - """Wrapper to integrate agent with Strands evaluation.""" - def wrapped_agent(case_input): - # Call the actual agent - result = agent_function(case_input) - return result - return wrapped_agent - -def run_evaluation(agent_function, output_dir: str = 'results'): - """Run complete evaluation pipeline.""" - # Create experiment - experiment = create_experiment() - - # Wrap agent for evaluation - wrapped_agent = agent_wrapper(agent_function) - - # Run evaluation - results = experiment.run(wrapped_agent) - - # Save results - timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") - output_path = Path(output_dir) / timestamp - output_path.mkdir(parents=True, exist_ok=True) - - with open(output_path / 'results.json', 'w') as f: - json.dump(results, f, indent=2) - - return results, output_path - -def main(): - """Main evaluation entry point.""" - # Import your agent function - from your_agent import main_agent_function - - # Run evaluation - results, output_path = run_evaluation(main_agent_function) - - # Print summary - print(f"Evaluation completed. Results saved to: {output_path}") - print(f"Total cases: {len(results.get('cases', []))}") - - # Print metrics summary - if 'metrics' in results: - for metric_name, score in results['metrics'].items(): - print(f"{metric_name}: {score:.3f}") - -if __name__ == "__main__": - main() -``` - -### Key Implementation Points - -1. **Direct Integration**: Import and call agent functions directly, no trace collection needed -2. **Case Objects**: Use Case objects to structure test inputs, expected outputs, and metadata -3. **Experiment Framework**: Use Experiment class to manage test execution and evaluation -4. **Built-in Evaluators**: Leverage OutputEvaluator, TrajectoryEvaluator, InteractionsEvaluator -5. **Custom Evaluators**: Extend BaseEvaluator for domain-specific evaluation logic - -### Evaluator Reference - -**Note**: This section provides detailed information about Strands Evals SDK evaluators. While the API patterns shown here are important for understanding the framework, you MUST still obtain current documentation before implementation. - -Strands Evals SDK provides a set of evaluators that plug into the standard: - -- `Case[InputT, OutputT]` -- `Experiment[InputT, OutputT]` -- `user_task_function(case: Case) -> TaskOutput | dict | OutputT` - -All evaluators live under: - -```python -from strands_evals.evaluators import ( - OutputEvaluator, - Evaluator, # base class for custom evaluators -) -``` - -Most evaluators return one or more `EvaluationOutput` objects: - -```python -from strands_evals.types.evaluation import EvaluationOutput - -# Fields: -# score: float in [0.0, 1.0] -# test_pass: bool -# reason: str (judge / metric reasoning) -# label: Optional[str] (categorical label, e.g. "Yes", "Very helpful", "3/4 keywords") -``` - -Custom evaluators directly consume `EvaluationData[InputT, OutputT]`, which includes input/output, optional expected values, and optional trajectory / interactions. - ---- - -#### 1. Overview – Which Evaluator When? - -| Evaluator | Level | What it measures | Task function must return… | -| ------------------ | ------------ | ---------------------------------------------------------------- | ---------------------------------------- | -| `OutputEvaluator` | Output-level | Subjective quality vs rubric (helpfulness, safety, clarity, etc) | `output` (string or serializable output) | -| `Custom Evaluator` | Any | Your own metrics / judge logic | Anything accessible via `EvaluationData` | - ---- - -#### 2. OutputEvaluator (LLM-as-a-judge over outputs) - -**Namespace:** `strands_evals.evaluators.OutputEvaluator` - -##### 2.1. What it does - -`OutputEvaluator` runs a **judge LLM** over each `(input, output, optional expected_output)` and applies a **rubric** you provide. It’s the generic “LLM-as-a-judge” for arbitrary subjective criteria: - -- quality, correctness, completeness -- tone, style, safety -- policy or UX guideline compliance - -Use it when you want a flexible scoring mechanism without writing your own evaluator. - -##### 2.2. Key parameters - -From the docs: - -- `rubric: str` (required) - A natural-language description of: - - - What “good” looks like (criteria). - - How to map to scores (e.g., 0 / 0.5 / 1). - - Optional labels or categories (e.g., “pass”, “borderline”, “fail”). - -- `model: Union[Model, str, None] = None` - - - Judge model used by the evaluator. - - `None` ⇒ default Bedrock model configured in Strands Agents. - -- `system_prompt: str | None = None` - - - Overrides the built-in system prompt used to drive the judge. - - Use this to add domain-specific guidance (e.g., “You are a security reviewer…”). - -- `include_inputs: bool = True` - - - If `True`, the evaluator passes the input prompt into the judge context. - - Set `False` when you want to judge the output in isolation (e.g., style-only). - -##### 2.3. What it expects from your task - -Your `user_task_function` must return something that the evaluator can treat as: - -- `output` (string or serializable to string). -- (Optionally) `expected_output` if you want the rubric to consider a reference answer. - -In the simplest case: - -```python -from strands import Agent -from strands_evals import Case, Experiment -from strands_evals.evaluators import OutputEvaluator - -def task_fn(case: Case[str, str]) -> str: - agent = Agent( - system_prompt="You are a friendly, concise assistant." - ) - return str(agent(case.input)) - -cases = [ - Case[str, str]( - name="greeting", - input="Say hi to a new user.", - expected_output="A short, warm greeting." - ), -] - -evaluator = OutputEvaluator( - rubric=( - "Score the response on correctness, completeness, and tone. " - "Return 1.0 for excellent, 0.5 for partially acceptable, " - "0.0 for poor or off-topic answers." - ), - include_inputs=True, -) - -experiment = Experiment[str, str](cases=cases, evaluator=evaluator) -report = experiment.run_evaluations(task_fn) -report.run_display() -``` - -(This is logically equivalent to the docs' example but rephrased.) - -##### 2.4. Output format - -Each case yields an `EvaluationOutput`: - -- `score: float` in `[0.0, 1.0]` -- `test_pass: bool`, usually determined via a default threshold (≥ 0.5) -- `reason: str` – judge’s explanation -- `label: Optional[str]` – optional category string (you can encode discrete classes here) - -##### 2.5. Designing good rubrics - -Practical suggestions (building on the SDK docs): - -1. **Constrain the scale** - - - Recommend 2–4 discrete levels (e.g. `0.0`, `0.5`, `1.0`). - - Explicitly map each level to conditions, _and_ mention examples: - - - “1.0: fully correct, complete, well-structured answer with no policy issues.” - - “0.5: partially correct or missing some details, but still useful.” - - “0.0: off-topic, incorrect, or unsafe.” - -2. **Separate dimensions in the rubric, not the model** - - For multi-dimensional evaluations (e.g., helpfulness + safety + tone), either: - - - Run multiple experiments with different rubrics (simplest), or - - In one rubric, instruct the judge to compute: - - - `score_overall`, plus optional sub-dimension comments that go into `reason`. - -3. **Use `include_inputs=True` for context-dependent criteria** - - - If your rubric cares about whether the response _answered the question_, you want the judge to see the original question. - - If you only care about lexical/format constraints, you can set `include_inputs=False`. - -4. **Test the rubric with known cases** - - - Build a small set of “gold” examples where you know the desired score. - - Run the evaluator on them to QA the rubric and adjust wording. - ---- - -#### 3. Custom Evaluators (extending `Evaluator`) - -**Namespace:** `strands_evals.evaluators.Evaluator` (base) - -##### 3.1. When to write a custom evaluator - -Use a custom evaluator when: - -- None of the built-ins capture the metric you care about. -- You want non-LLM metrics (e.g., exact match, BLEU, latency, cost). -- You need to integrate external services (e.g., another scoring API). -- You want to evaluate at a custom “level” (per paragraph, per tool call, per session cluster, etc.). - -##### 3.2. Core API shape - -Custom evaluators subclass: - -```python -from typing_extensions import TypeVar -from strands_evals.evaluators import Evaluator -from strands_evals.types.evaluation import EvaluationData, EvaluationOutput - -InputT = TypeVar("InputT") -OutputT = TypeVar("OutputT") - -class MyEvaluator(Evaluator[InputT, OutputT]): - def __init__(self, ...): - super().__init__() - # store config - - def evaluate( - self, - evaluation_case: EvaluationData[InputT, OutputT], - ) -> list[EvaluationOutput]: - # sync logic - ... - - async def evaluate_async( - self, - evaluation_case: EvaluationData[InputT, OutputT], - ) -> list[EvaluationOutput]: - # async logic (can call evaluate for simple cases) - ... -``` - -`EvaluationData` gives you: input, actual/expected output, actual/expected trajectory, actual/expected interactions, etc. You decide which fields are relevant. - -##### 3.3. Example: pure metric-based evaluator - -Adapted from the docs' keyword example: - -```python -class KeywordCoverageEvaluator(Evaluator[InputT, OutputT]): - """ - Checks whether the output includes all required keywords. - """ - - def __init__(self, required_keywords: list[str], case_sensitive: bool = False): - super().__init__() - self.required_keywords = required_keywords - self.case_sensitive = case_sensitive - - def _normalize(self, text: str) -> str: - return text if self.case_sensitive else text.lower() - - def evaluate(self, evaluation_case: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]: - output_text = self._normalize(str(evaluation_case.actual_output)) - - if self.case_sensitive: - keywords = self.required_keywords - else: - keywords = [kw.lower() for kw in self.required_keywords] - - present = [kw for kw in keywords if kw in output_text] - missing = [kw for kw in keywords if kw not in output_text] - - if keywords: - score = len(present) / len(keywords) - else: - score = 1.0 # nothing to check - - passed = score == 1.0 - if passed: - reason = f"All required keywords present: {present}" - else: - reason = f"Missing keywords: {missing}; found: {present}" - - return [ - EvaluationOutput( - score=score, - test_pass=passed, - reason=reason, - label=f"{len(present)}/{len(keywords)} keywords", - ) - ] - - async def evaluate_async( - self, - evaluation_case: EvaluationData[InputT, OutputT], - ) -> list[EvaluationOutput]: - return self.evaluate(evaluation_case) -``` - -##### 3.4. Example: LLM-based custom evaluator - -You can also embed your own judge Agent inside the evaluator (e.g., to specialize tone / style checks): - -```python -from strands import Agent as StrandsAgent # to avoid name clash - -class ToneEvaluator(Evaluator[InputT, OutputT]): - """ - Uses a judge agent to check whether the response has the desired tone. - """ - - def __init__(self, expected_tone: str, model: str | None = None): - super().__init__() - self.expected_tone = expected_tone - self.model = model - - def _build_judge(self) -> StrandsAgent: - return StrandsAgent( - model=self.model, - system_prompt=( - f"You evaluate whether responses have a {self.expected_tone} tone. " - "Return an EvaluationOutput with score 1.0 for fully appropriate tone, " - "0.5 for mixed tone, and 0.0 if tone clearly does not match." - ), - ) - - def _make_prompt(self, data: EvaluationData[InputT, OutputT]) -> str: - return ( - f"Input:\n{data.input}\n\n" - f"Response:\n{data.actual_output}\n\n" - "Judge whether the tone matches the desired style." - ) - - def evaluate(self, data: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]: - judge = self._build_judge() - prompt = self._make_prompt(data) - result = judge.structured_output(EvaluationOutput, prompt) - return [result] - - async def evaluate_async(self, data: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]: - judge = self._build_judge() - prompt = self._make_prompt(data) - result = await judge.structured_output_async(EvaluationOutput, prompt) - return [result] -``` - -##### 3.5. Example: multi-level / per-tool evaluation - -You can also implement your own "levels", e.g., iterate through `actual_trajectory` and emit multiple `EvaluationOutput`s (similar to how ToolParameter/ToolSelection do it): - -```python -class PerToolLatencyEvaluator(Evaluator[InputT, OutputT]): - """ - Example: emits one EvaluationOutput per tool call based on latency. - """ - - def __init__(self, max_ms: float): - super().__init__() - self.max_ms = max_ms - - def evaluate(self, data: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]: - results: list[EvaluationOutput] = [] - - if not data.actual_trajectory: - return [] - - for call in data.actual_trajectory: - # Assume telemetry tooling populates "duration_ms" - duration = call.get("duration_ms", 0.0) - score = 1.0 if duration <= self.max_ms else max(0.0, 1.0 - duration / (2 * self.max_ms)) - passed = duration <= self.max_ms - reason = f"Tool {call.get('name')} took {duration:.1f}ms (max allowed {self.max_ms}ms)." - - results.append( - EvaluationOutput( - score=score, - test_pass=passed, - reason=reason, - label="ok" if passed else "slow", - ) - ) - - return results - - async def evaluate_async(self, data: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]: - return self.evaluate(data) -``` - -##### 3.6. Using a custom evaluator in an Experiment - -Usage is identical to built-ins: - -```python -from strands_evals import Case, Experiment - -cases = [ - Case[str, str]( - name="email-1", - input="Write a short professional email to a recruiter.", - ), -] - -evaluator = ToneEvaluator(expected_tone="professional") - -experiment = Experiment[str, str](cases=cases, evaluator=evaluator) -report = experiment.run_evaluations(task_fn) # your existing task function -report.run_display() -``` - -##### 3.7. Custom evaluator best practices - -Based on the SDK guidance: - -- Always subclass `Evaluator` and implement both `evaluate` and `evaluate_async`. -- Always return a **list** of `EvaluationOutput`. -- Keep scores in `[0.0, 1.0]` and document what your thresholds mean. -- Put detailed human-readable reasoning in `reason` – this is what you’ll debug with. -- Handle missing data gracefully (e.g., no trajectory, no expected_output). -- Think about level: per-case, per-turn, per-tool, per-interaction, or multi-level.