diff --git a/README.md b/README.md index 52035ad..57f6d38 100644 --- a/README.md +++ b/README.md @@ -1,45 +1,20 @@ -# Eval Engine Learning Lab +# Eval Engine -A minimal, practical evaluation framework for LLM systems and AI application flows. +A lightweight evaluation framework for LLM systems combining: -This project started as a learning lab for evaluating **non-deterministic AI outputs** using a layered approach, and is now being refactored toward a **boilerplate-friendly eval engine**. - -It combines: -- deterministic rules (critical gates) -- heuristic scoring -- LLM-as-judge (analytical signal) -- regression comparison +- deterministic evaluation (critical gates + heuristics) +- probabilistic evaluation (LLM-as-judge) +- regression detection across runs --- -## Why this exists - -Traditional QA assumes deterministic outputs and exact expected results. - -LLM systems and AI applications introduce: -- variable responses -- qualitative output quality -- probabilistic reasoning -- formatting inconsistency -- partial subjectivity in evaluation - -This repo demonstrates a practical evaluation architecture to handle that while preserving deterministic control over PASS/FAIL. - ---- +## Core Idea -## Core principle +LLM systems are non-deterministic. -> Heuristics decide PASS/FAIL. Judges provide analytical insight. +Traditional pass/fail testing is not enough. -This separation is intentional. - -- **Deterministic evaluation** remains the source of truth -- **LLM judges** help analyze quality, disagreement, and rubric clarity -- **Regression comparison** tracks stability across runs - ---- - -## Architecture +This project explores a layered evaluation approach: ```text Dataset @@ -54,290 +29,119 @@ Heuristic Scoring (deterministic) ↓ Regression Comparison ``` - --- -## Boilerplate direction -The framework is being refactored so that the evaluated system is no longer assumed to be just a single hardcoded LLM prompt. +## Example Packs -The intended boilerplate model is: - • task config → defines the evaluation task, dataset, rubric, and evaluation behavior - • system config → defines the system under test - • judge config → defines judge roles, prompts, and rubric anchors +The project now supports **multiple evaluation domains** via self-contained example packs. -This makes the framework reusable for: - • recommendation tasks - • support assistant evaluation - • agent workflow evaluation - • retrieval or RAG response evaluation - • other structured AI output evaluation tasks +### 1. Wine Recommendation (Reference Example) -For extension guidance, see: `docs/boilerplate_extension_guide.md` -⸻ +- recommendation-style evaluation +- structured list outputs +- qualitative scoring (taste, tone, diversity) +- baseline reference task -## Current reference example +examples/wine_recommendation/ -The current example task is: +--- -Wine recommendation evaluation +### 2. Retail Support (Multi-purpose Example) -This remains the reference task because it demonstrates: - • hard constraints - • qualitative scoring - • useful judge disagreement - • adversarial cases - • regression comparison +Demonstrates: -Wine is the example task, not the long-term framework identity. +- recommendation tasks +- support assistant evaluation +- retrieval-grounded responses (RAG-style) +- simple agent workflows (mock tools) +- structured output expectations -⸻ +examples/retail_support/ + +--- -## Repo structure +## Project Structure ```text . +│ +├── examples/ +│ ├── wine_recommendation/ +│ └── retail_support/ +│ ├── configs/ │ ├── tasks/ -│ │ └── wine.yaml │ ├── systems/ -│ │ └── openai_wine.yaml │ └── judges/ -│ └── default_ensemble.yaml │ -├── dataset.json -├── rubric.json +├── results// +├── baselines// +│ ├── runner.py ├── scorer.py -├── confidence.py -├── system_client.py -├── judge_client.py +├── task_loader.py +├── tool_simulator.py +├── schemas.py ├── regression_compare.py -│ -├── prompts/ -│ ├── system/ -│ │ └── wine_generator.txt -│ └── judges/ -│ ├── base_prompt.txt -│ ├── role_balanced.txt -│ ├── role_strict.txt -│ └── role_usefulness.txt -│ -├── baselines/ -├── results/ -├── tests/ -├── examples/ -│ -├── README.md -├── requirements.txt -├── .env.example -├── .gitignore -└── LICENSE - ``` +--- -## Configuration model - -1. Task config - -Defines: - • dataset path - • rubric path - • evaluation wiring - • mock response profile - • selected judge ensemble config - -Example: - • configs/tasks/wine.yaml - -2. System config - -Defines the system under test: - • provider - • model - • temperature - • mock mode support - -Example: - • configs/systems/openai_wine.yaml - -3. Judge config - -Defines judge behavior: - • model - • temperature - • role definitions - • base prompt - • rubric anchors - -Example: - • configs/judges/default_ensemble.yaml - - -## Quickstart - -### 1. Setup +## Running Evaluations +### Wine example ```bash -python3 -m venv .venv -source .venv/bin/activate -pip install -r requirements.txt -cp .env.example .env +python3 runner.py --task-config configs/tasks/wine.yaml --mode mock ``` - -Add your OpenAI key: -```env -OPENAI_API_KEY=your_key_here -``` -⸻ - -### Run examples - -Mock (no API required) +### Retail support example ```bash -python3 runner.py --mode mock +python3 runner.py --task-config examples/retail_support/task_config.yaml --mode mock ``` ⸻ -### OpenAI generation +## Writing Baselines ```bash -python3 runner.py --mode openai --model gpt-4o-mini --temperature 0.0 +python3 runner.py --task-config configs/tasks/wine.yaml --mode mock --write-baseline +python3 runner.py --task-config examples/retail_support/task_config.yaml --mode mock --write-baseline ``` ⸻ -### OpenAI + judge ensemble -```bash -python3 runner.py --mode openai --enable-judge -``` -⸻ +## Regression Comparison -### Write baseline +Using explicit paths: ```bash -python3 runner.py --mode openai --enable-judge --write-baseline +python3 regression_compare.py baselines/wine_recommendation/baseline_results.json results/wine_recommendation/latest_results.json ``` -⸻ - -### Use explicit config paths +Using task shortcut: ```bash -python3 runner.py \ - --mode openai \ - --task-config configs/tasks/wine.yaml \ - --system-config configs/systems/openai_wine.yaml \ - --judge-config configs/judges/default_ensemble.yaml \ - --enable-judge +python3 regression_compare.py --task wine_recommendation +python3 regression_compare.py --task retail_support ``` -⸻ - -## Output artifacts - -Each run generates: -• results/latest_results.json → latest evaluation snapshot -• results/run_.json → historical run -• results/report.csv → flat analysis table - -Each result includes: -• generator output -• heuristic evaluation -• optional judge evaluation -• run metadata -• compatibility fields for regression comparison - ---- -## Evaluation layers - -### Critical gates - -Hard constraints that must not be violated. - -Examples in the wine task: - • exact item count - • red wine constraint - • max price - -### Heuristic scoring - -Deterministic scoring across defined dimensions. - -Current wine rubric dimensions: - • tasting_clarity - • popularity_alignment - • regional_diversity - • language_tone - -### Judge ensemble - -LLM judges provide analytical signal only. - -Current roles: - • balanced evaluator - • strict evaluator - • usefulness evaluator - -Judges help surface: - • disagreement - • ambiguity - • rubric weaknesses - • heuristic blind spots ⸻ -## Regression comparison -Use regression comparison to check whether a latest run regressed against a baseline. - -Example: -```bash -python3 regression_compare.py baselines/baseline_results.json results/latest_results.json --max-drop 0.3 -``` -The comparator supports both: - • legacy flat result fields - • standardized nested evaluation result structure - ---- - -## Design principles - -• deterministic logic governs decisions -• judges are analytical, not operational -• disagreement is useful -• evaluation requires engineering discipline -• simplicity is intentional - ---- - -## Current reference example - -The current example task is: +## What This Project Demonstrates + • How to evaluate LLM outputs beyond simple correctness + • How to combine heuristics and LLM judges + • How to detect regressions in non-deterministic systems + • How to design evaluation datasets and rubrics + • How to structure reusable evaluation tasks -**Wine recommendation evaluation** - -This remains the reference task because it demonstrates: -- hard constraints -- qualitative scoring -- useful judge disagreement -- adversarial cases -- regression comparison +⸻ -Wine is the example task, not the long-term framework identity. +## Status -See also: `docs/examples/wine_recommendation.md` ---- +This is a V1 learning lab project. -## Roadmap direction +Focus: + • clarity over completeness + • simplicity over abstraction + • experimentation over production design -Near-term boilerplate refactor goals: - • keep wine as the example task - • make task/system/judge configuration first-class - • support reusable evaluation structure across domains - • preserve regression safety and offline mock mode +⸻ -Longer-term exploration: +## Future Directions • judge disagreement visualization - • judge reliability experiments - • cross-model comparison - • rubric refinement - • evaluation analytics - -For current refactor status, see: `docs/boilerplate_v1_status.md` - -## License - -MIT \ No newline at end of file + • evaluation analytics across runs + • cross-model judge comparison + • richer agent workflow evaluation + • dashboard / visualization layer \ No newline at end of file diff --git a/baselines/retail_support/.gitkeep b/baselines/retail_support/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/baselines/retail_support/README.md b/baselines/retail_support/README.md new file mode 100644 index 0000000..dc6c8c7 --- /dev/null +++ b/baselines/retail_support/README.md @@ -0,0 +1,7 @@ +# Retail Support Baselines + +This folder stores baseline result snapshots for the retail support example. + +Suggested naming: +- `baseline_results_mock.json` +- `baseline_results_openai_.json` \ No newline at end of file diff --git a/baselines/wine_recommendation/.gitkeep b/baselines/wine_recommendation/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/baselines/wine_recommendation/README.md b/baselines/wine_recommendation/README.md new file mode 100644 index 0000000..465301d --- /dev/null +++ b/baselines/wine_recommendation/README.md @@ -0,0 +1,7 @@ +# Wine Recommendation Baselines + +This folder stores baseline result snapshots for the wine recommendation example. + +Suggested naming: +- `baseline_results_mock.json` +- `baseline_results_openai_.json` \ No newline at end of file diff --git a/configs/tasks/wine.yaml b/configs/tasks/wine.yaml index 300e20f..56d87a6 100644 --- a/configs/tasks/wine.yaml +++ b/configs/tasks/wine.yaml @@ -1,8 +1,6 @@ name: wine_recommendation - -dataset_path: dataset.json -rubric_path: rubric.json - +dataset_path: examples/wine_recommendation/dataset.json +rubric_path: examples/wine_recommendation/rubric.json description: "Wine recommendation evaluation task (V1 reference task)" system_config: configs/systems/openai_wine.yaml diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000..e69de29 diff --git a/examples/retail_support/README.md b/examples/retail_support/README.md new file mode 100644 index 0000000..20afd94 --- /dev/null +++ b/examples/retail_support/README.md @@ -0,0 +1,36 @@ +# Retail Support Example + +This example demonstrates a multi-purpose evaluation setup for: + +- recommendation tasks +- customer support responses +- retrieval-grounded answers (RAG-style) +- simple agent workflows (tool usage simulation) +- structured output validation + +## Purpose + +This example is designed to showcase the flexibility of the eval engine beyond a single domain. + +It combines multiple real-world assistant behaviors into a single evaluation pack. + +## Task Categories + +- recommendation +- support_policy +- order_support +- agent_workflow + +## Data Files + +- `dataset.json` — evaluation cases across multiple categories +- `rubric.json` — scoring logic and critical gates +- `knowledge_base.json` — policy and support documents +- `catalog.json` — product data for recommendations +- `orders.json` — mock order data +- `tool_scenarios.json` — simulated tool responses +- `expected_outputs.json` — deterministic evaluation hints + +## Status + +Initial scaffold only. Logic will be wired in upcoming commits. \ No newline at end of file diff --git a/examples/retail_support/catalog.json b/examples/retail_support/catalog.json new file mode 100644 index 0000000..b1f9a31 --- /dev/null +++ b/examples/retail_support/catalog.json @@ -0,0 +1,56 @@ +[ + { + "sku": "JKT_001", + "name": "NorthTrail RainShell", + "category": "jacket", + "price_eur": 129, + "waterproof": true, + "use_case": ["hiking", "commute", "rain"], + "notes": "Lightweight waterproof shell suited for wet and windy conditions." + }, + { + "sku": "JKT_002", + "name": "Fjord Trek Pro", + "category": "jacket", + "price_eur": 179, + "waterproof": true, + "use_case": ["hiking", "mountain"], + "notes": "Durable trekking shell with stronger weather protection but above tighter budget ranges." + }, + { + "sku": "JKT_003", + "name": "NordLite PackShell", + "category": "jacket", + "price_eur": 99, + "waterproof": true, + "use_case": ["hiking", "travel"], + "notes": "Packable waterproof jacket for occasional hikes and everyday use." + }, + { + "sku": "JKT_004", + "name": "Harbor Softshell", + "category": "jacket", + "price_eur": 89, + "waterproof": false, + "use_case": ["commute", "casual"], + "notes": "Comfortable softshell with light weather resistance, not fully waterproof." + }, + { + "sku": "TENT_001", + "name": "WindRidge 2", + "category": "tent", + "price_eur": 210, + "waterproof": true, + "use_case": ["camping", "windy_weather", "2_person"], + "notes": "Two-person tent with strong pole structure for windy conditions." + }, + { + "sku": "STOVE_001", + "name": "TrekLite Stove", + "category": "stove", + "price_eur": 59, + "waterproof": false, + "use_case": ["camping", "cooking"], + "notes": "Compact backpacking stove with 2-year limited warranty." + } +] \ No newline at end of file diff --git a/examples/retail_support/dataset.json b/examples/retail_support/dataset.json new file mode 100644 index 0000000..a0938d1 --- /dev/null +++ b/examples/retail_support/dataset.json @@ -0,0 +1,98 @@ +[ + { + "id": "SUP_001", + "category": "support_policy", + "query": "Can I return hiking shoes if I wore them outside once?", + "context_refs": ["policy_returns_worn_items"], + "expected_tools": [], + "expected_schema": "support_answer" + }, + { + "id": "SUP_002", + "category": "order_support", + "query": "Where is my order ORD-1002?", + "context_refs": [], + "expected_tools": ["lookup_order"], + "expected_schema": "support_answer" + }, + { + "id": "SUP_003", + "category": "recommendation", + "query": "Recommend 3 waterproof jackets under 150 euros for hiking.", + "context_refs": [], + "expected_tools": ["search_catalog"], + "expected_schema": "recommendation_answer" + }, + { + "id": "SUP_004", + "category": "agent_workflow", + "query": "Cancel my order ORD-1005 if it has not shipped yet.", + "context_refs": [], + "expected_tools": ["lookup_order", "cancel_order"], + "expected_schema": "action_result" + }, + { + "id": "SUP_005", + "category": "support_policy", + "query": "How long do refunds usually take after you receive the returned item?", + "context_refs": ["policy_refunds"], + "expected_tools": [], + "expected_schema": "support_answer" + }, + { + "id": "SUP_006", + "category": "order_support", + "query": "My order ORD-1008 has been processing for several days. What should I do?", + "context_refs": ["policy_shipping_delay"], + "expected_tools": ["lookup_order"], + "expected_schema": "support_answer" + }, + { + "id": "SUP_007", + "category": "recommendation", + "query": "Recommend a jacket for rainy weather in Denmark with a budget of 100 euros.", + "context_refs": [], + "expected_tools": ["search_catalog"], + "expected_schema": "recommendation_answer" + }, + { + "id": "SUP_008", + "category": "agent_workflow", + "query": "Cancel order ORD-1002 for me.", + "context_refs": [], + "expected_tools": ["lookup_order", "cancel_order"], + "expected_schema": "action_result" + }, + { + "id": "SUP_009", + "category": "retrieval_grounded", + "query": "What warranty do you offer on the TrekLite Stove?", + "context_refs": ["policy_warranty"], + "expected_tools": [], + "expected_schema": "support_answer" + }, + { + "id": "SUP_010", + "category": "recommendation", + "query": "Which tent would you suggest for 2 people in windy weather?", + "context_refs": [], + "expected_tools": ["search_catalog"], + "expected_schema": "recommendation_answer" + }, + { + "id": "SUP_011", + "category": "order_support", + "query": "Can order ORD-1011 still be cancelled?", + "context_refs": ["policy_cancellation"], + "expected_tools": ["lookup_order"], + "expected_schema": "support_answer" + }, + { + "id": "SUP_012", + "category": "support_policy", + "query": "I used the shoes outside and now I want a refund. Is that allowed if they are not faulty?", + "context_refs": ["policy_returns_worn_items"], + "expected_tools": [], + "expected_schema": "support_answer" + } +] \ No newline at end of file diff --git a/examples/retail_support/expected_outputs.json b/examples/retail_support/expected_outputs.json new file mode 100644 index 0000000..d37c0fe --- /dev/null +++ b/examples/retail_support/expected_outputs.json @@ -0,0 +1,34 @@ +{ + "SUP_001": { + "must_include_any": ["not eligible", "used outdoors"], + "must_not_include": ["full refund guaranteed"], + "required_context_ids": ["policy_returns_worn_items"] + }, + "SUP_002": { + "required_tools": ["lookup_order"] + }, + "SUP_004": { + "required_tools": ["lookup_order", "cancel_order"], + "expected_action_success": false + }, + "SUP_005": { + "must_include_any": ["5 to 7 business days"], + "required_context_ids": ["policy_refunds"] + }, + "SUP_006": { + "must_include_any": ["contact support", "manual check"], + "required_context_ids": ["policy_shipping_delay"] + }, + "SUP_008": { + "required_tools": ["lookup_order", "cancel_order"], + "expected_action_success": true + }, + "SUP_009": { + "must_include_any": ["2-year", "warranty"], + "required_context_ids": ["policy_warranty"] + }, + "SUP_011": { + "must_include_any": ["cannot be cancelled"], + "required_context_ids": ["policy_cancellation"] + } +} \ No newline at end of file diff --git a/examples/retail_support/knowledge_base.json b/examples/retail_support/knowledge_base.json new file mode 100644 index 0000000..3a79a04 --- /dev/null +++ b/examples/retail_support/knowledge_base.json @@ -0,0 +1,27 @@ +[ + { + "id": "policy_returns_worn_items", + "title": "Returns for worn footwear", + "text": "Footwear used outdoors is not eligible for return unless the item is faulty." + }, + { + "id": "policy_cancellation", + "title": "Order cancellation policy", + "text": "Orders can be cancelled only before the shipment status changes to shipped." + }, + { + "id": "policy_refunds", + "title": "Refund processing policy", + "text": "Approved refunds are processed within 5 to 7 business days after the returned item is received and inspected." + }, + { + "id": "policy_shipping_delay", + "title": "Shipping delay guidance", + "text": "If an order remains in processing for more than 3 business days, the customer should be advised to contact support for a manual check." + }, + { + "id": "policy_warranty", + "title": "Warranty policy", + "text": "Outdoor stoves and technical gear include a 2-year limited warranty covering manufacturing defects only." + } +] \ No newline at end of file diff --git a/examples/retail_support/orders.json b/examples/retail_support/orders.json new file mode 100644 index 0000000..72a58cb --- /dev/null +++ b/examples/retail_support/orders.json @@ -0,0 +1,30 @@ +[ + { + "order_id": "ORD-1002", + "status": "processing", + "items": ["JKT_001"], + "can_cancel": true, + "days_in_status": 2 + }, + { + "order_id": "ORD-1005", + "status": "shipped", + "items": ["JKT_003"], + "can_cancel": false, + "days_in_status": 1 + }, + { + "order_id": "ORD-1008", + "status": "processing", + "items": ["STOVE_001"], + "can_cancel": true, + "days_in_status": 5 + }, + { + "order_id": "ORD-1011", + "status": "delivered", + "items": ["TENT_001"], + "can_cancel": false, + "days_in_status": 0 + } +] \ No newline at end of file diff --git a/examples/retail_support/rubric.json b/examples/retail_support/rubric.json new file mode 100644 index 0000000..cbbcb62 --- /dev/null +++ b/examples/retail_support/rubric.json @@ -0,0 +1,21 @@ +{ + "version": "v1.1", + "weights": { + "instruction_match": 0.25, + "grounding_accuracy": 0.25, + "tool_use_correctness": 0.20, + "resolution_helpfulness": 0.20, + "tone_clarity": 0.10 + }, + "thresholds": { + "pass": 4.0, + "warn": 3.0 + }, + "critical_gates": [ + "no_policy_hallucination", + "no_false_action_claim", + "respect_constraints", + "valid_output_schema", + "no_invented_order_status" + ] +} \ No newline at end of file diff --git a/examples/retail_support/task_config.yaml b/examples/retail_support/task_config.yaml new file mode 100644 index 0000000..41a9470 --- /dev/null +++ b/examples/retail_support/task_config.yaml @@ -0,0 +1,30 @@ +task_name: retail_support +task_type: mixed +description: "Retail support evaluation task covering recommendation support retrieval and simple workflow cases" + +dataset_path: examples/retail_support/dataset.json +rubric_path: examples/retail_support/rubric.json + +knowledge_base_path: examples/retail_support/knowledge_base.json +catalog_path: examples/retail_support/catalog.json +orders_path: examples/retail_support/orders.json +tool_scenarios_path: examples/retail_support/tool_scenarios.json +expected_outputs_path: examples/retail_support/expected_outputs.json + +system_config: configs/systems/openai_wine.yaml +judge_config: configs/judges/default_ensemble.yaml + +mock_response_profile: + type: retail_support + +evaluation: + rubric: + source: task_local + thresholds: + source: rubric + judge_ensemble: + source: task_judge_config + +output_schema_mode: disabled +tool_simulation_mode: disabled +judge_mode: disabled \ No newline at end of file diff --git a/examples/retail_support/tool_scenarios.json b/examples/retail_support/tool_scenarios.json new file mode 100644 index 0000000..0c6a6a3 --- /dev/null +++ b/examples/retail_support/tool_scenarios.json @@ -0,0 +1,46 @@ +{ + "lookup_order": { + "ORD-1002": { + "order_id": "ORD-1002", + "status": "processing", + "can_cancel": true, + "days_in_status": 2 + }, + "ORD-1005": { + "order_id": "ORD-1005", + "status": "shipped", + "can_cancel": false, + "days_in_status": 1 + }, + "ORD-1008": { + "order_id": "ORD-1008", + "status": "processing", + "can_cancel": true, + "days_in_status": 5 + }, + "ORD-1011": { + "order_id": "ORD-1011", + "status": "delivered", + "can_cancel": false, + "days_in_status": 0 + } + }, + "cancel_order": { + "ORD-1002": { + "success": true, + "message": "Order cancelled successfully." + }, + "ORD-1005": { + "success": false, + "message": "Order cannot be cancelled after shipment." + }, + "ORD-1008": { + "success": true, + "message": "Order cancelled successfully." + }, + "ORD-1011": { + "success": false, + "message": "Delivered orders cannot be cancelled." + } + } +} \ No newline at end of file diff --git a/examples/wine_recommendation/README.md b/examples/wine_recommendation/README.md new file mode 100644 index 0000000..27e0c1e --- /dev/null +++ b/examples/wine_recommendation/README.md @@ -0,0 +1,19 @@ +# Wine Recommendation Example + +This is the current reference example for the eval engine. + +It demonstrates: +- recommendation-style evaluation +- critical gates for hard constraints +- heuristic scoring across qualitative dimensions +- optional LLM-as-judge analysis +- regression comparison across runs + +## Files + +- `dataset.json` — evaluation cases for wine recommendation prompts +- `rubric.json` — critical gates, weights, and thresholds for this task + +## Notes + +This example remains the baseline reference task while the repo evolves toward a reusable multi-example evaluation framework. \ No newline at end of file diff --git a/dataset.json b/examples/wine_recommendation/dataset.json similarity index 100% rename from dataset.json rename to examples/wine_recommendation/dataset.json diff --git a/rubric.json b/examples/wine_recommendation/rubric.json similarity index 100% rename from rubric.json rename to examples/wine_recommendation/rubric.json diff --git a/regression_compare.py b/regression_compare.py index cee1827..ecfe49d 100644 --- a/regression_compare.py +++ b/regression_compare.py @@ -14,11 +14,9 @@ def load_json(path: str) -> Any: def extract_results(payload: Any) -> List[Dict[str, Any]]: if isinstance(payload, list): return payload - if isinstance(payload, dict): if "results" in payload and isinstance(payload["results"], list): return payload["results"] - raise TypeError("Expected results payload to be a list or a dict containing a 'results' list.") @@ -73,7 +71,7 @@ def compare_results( if missing_cases: print("ERROR: Missing cases in latest run:") for case_id in missing_cases: - print(f" - {case_id}") + print(f" - {case_id}") exit_code = 1 for case_id, base_row in baseline_map.items(): @@ -84,15 +82,14 @@ def compare_results( base_gate_pass = get_gate_pass(base_row) latest_gate_pass = get_gate_pass(latest_row) - if base_gate_pass and not latest_gate_pass: print(f"ERROR: Gate regression for {case_id}") exit_code = 1 base_score = get_weighted_score(base_row) latest_score = get_weighted_score(latest_row) - score_drop = round(base_score - latest_score, 2) + if score_drop > max_drop: print( f"ERROR: Score drop too large for {case_id} " @@ -102,8 +99,8 @@ def compare_results( base_verdict = get_verdict(base_row) latest_verdict = get_verdict(latest_row) - verdict_rank = {"FAIL": 0, "WARN": 1, "PASS": 2} + if verdict_rank.get(latest_verdict, -1) < verdict_rank.get(base_verdict, -1): print( f"ERROR: Verdict regression for {case_id} " @@ -117,10 +114,34 @@ def compare_results( return exit_code +def resolve_paths( + baseline_path: str | None, + latest_path: str | None, + task: str | None, +) -> tuple[str, str]: + if baseline_path and latest_path: + return baseline_path, latest_path + + if task: + resolved_baseline = baseline_path or f"baselines/{task}/baseline_results.json" + resolved_latest = latest_path or f"results/{task}/latest_results.json" + return resolved_baseline, resolved_latest + + raise ValueError( + "Provide both baseline_path and latest_path, or use --task to resolve example-scoped defaults." + ) + + def parse_args() -> argparse.Namespace: parser = argparse.ArgumentParser(description="Compare eval results against baseline.") - parser.add_argument("baseline_path", type=str, help="Path to baseline results JSON") - parser.add_argument("latest_path", type=str, help="Path to latest results JSON") + parser.add_argument("baseline_path", nargs="?", type=str, help="Path to baseline results JSON") + parser.add_argument("latest_path", nargs="?", type=str, help="Path to latest results JSON") + parser.add_argument( + "--task", + type=str, + default=None, + help="Task/example name used to resolve default paths under baselines// and results//", + ) parser.add_argument( "--max-drop", type=float, @@ -133,8 +154,14 @@ def parse_args() -> argparse.Namespace: def main() -> None: args = parse_args() - baseline_payload = load_json(args.baseline_path) - latest_payload = load_json(args.latest_path) + baseline_path, latest_path = resolve_paths( + baseline_path=args.baseline_path, + latest_path=args.latest_path, + task=args.task, + ) + + baseline_payload = load_json(baseline_path) + latest_payload = load_json(latest_path) baseline = extract_results(baseline_payload) latest = extract_results(latest_payload) diff --git a/runner.py b/runner.py index 7a8c645..cfcd65b 100644 --- a/runner.py +++ b/runner.py @@ -1,9 +1,4 @@ from __future__ import annotations -from email import parser -from dotenv import load_dotenv -import yaml - -load_dotenv() import argparse import csv @@ -11,15 +6,21 @@ import os from datetime import datetime, timezone from typing import Any, Dict, List +from pathlib import Path + +from dotenv import load_dotenv from confidence import compute_confidence -from system_client import build_system_client, SYSTEM_PROMPT_VERSION from judge_client import ( JUDGE_PROMPT_VERSION, judge_response_ensemble, load_judge_config, ) from scorer import evaluate_case, evalresult_to_flat_dict +from system_client import build_system_client, SYSTEM_PROMPT_VERSION +from task_loader import load_task_bundle, load_yaml + +load_dotenv() RUBRIC_VERSION = "v1.0" DATASET_VERSION = "v1.0" @@ -28,55 +29,50 @@ # ---------------------------- # Helpers # ---------------------------- - -def load_json(path: str) -> Any: - with open(path, "r", encoding="utf-8") as f: - return json.load(f) - - def ensure_dir(path: str) -> None: os.makedirs(path, exist_ok=True) -def load_yaml(path: str) -> Any: - with open(path, "r", encoding="utf-8") as f: - return yaml.safe_load(f) - -def get_task_rubric(task_config: Dict[str, Any]) -> Dict[str, Any]: - evaluation_cfg = task_config.get("evaluation", {}) - rubric_cfg = evaluation_cfg.get("rubric", {}) - - if rubric_cfg.get("source") != "task_local": - raise ValueError("Unsupported rubric source in task config.") - - rubric_path = task_config["rubric_path"] - return load_json(rubric_path) - def get_task_thresholds(task_config: Dict[str, Any], rubric: Dict[str, Any]) -> Dict[str, Any]: evaluation_cfg = task_config.get("evaluation", {}) thresholds_cfg = evaluation_cfg.get("thresholds", {}) - if thresholds_cfg.get("source") != "rubric": raise ValueError("Unsupported thresholds source in task config.") - return rubric["thresholds"] def get_task_judge_ensemble_config( - task_config: Dict[str, Any], - judge_config: Dict[str, Any], + task_config: Dict[str, Any], judge_config: Dict[str, Any] ) -> Dict[str, Any]: evaluation_cfg = task_config.get("evaluation", {}) judge_ensemble_cfg = evaluation_cfg.get("judge_ensemble", {}) - if judge_ensemble_cfg.get("source") != "task_judge_config": raise ValueError("Unsupported judge ensemble source in task config.") - return judge_config +def get_task_slug(task_config: Dict[str, Any], task_config_path: str) -> str: + return ( + task_config.get("name") + or task_config.get("task_name") + or Path(task_config_path).stem + ) + +def get_task_output_paths(task_slug: str) -> Dict[str, str]: + results_dir = f"results/{task_slug}" + baselines_dir = f"baselines/{task_slug}" + + return { + "results_dir": results_dir, + "baselines_dir": baselines_dir, + "latest_path": f"{results_dir}/latest_results.json", + "run_path_prefix": f"{results_dir}/run_", + "csv_path": f"{results_dir}/report.csv", + "baseline_path": f"{baselines_dir}/baseline_results.json", + } + def validate_mock_mode_support(system_config: Dict[str, Any]) -> None: mock_support = system_config.get("mock_support", {}) if not mock_support.get("enabled", False): raise ValueError("Selected system config does not support mock mode.") - + def build_result_record( eval_case: Dict[str, Any], response: str, @@ -130,10 +126,10 @@ def write_json(path: str, data: Any) -> None: with open(path, "w", encoding="utf-8") as f: json.dump(data, f, indent=2) - def write_csv(path: str, rows: List[Dict[str, Any]]) -> None: if not rows: return + keys = rows[0].keys() with open(path, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=keys) @@ -147,6 +143,7 @@ def build_run_metadata( judge_config_path: str, ) -> Dict[str, Any]: judge_enabled = args.mode == "openai" and args.enable_judge + return { "timestamp_utc": datetime.now(timezone.utc).isoformat(), "mode": args.mode, @@ -161,9 +158,9 @@ def build_run_metadata( "task_config_path": task_config_path, "system_config_path": system_config_path, "judge_config_path": judge_config_path, + "task_config_path": task_config_path, } - def compute_judge_summary( judge_bundle: Dict[str, Any], rubric: Dict[str, Any], @@ -174,7 +171,6 @@ def compute_judge_summary( judge_individual = judge_bundle["individual_judges"] weights = rubric.get("weights", {}) - judge_weighted_score = round( sum(judge_scores[k] * weights.get(k, 0) for k in judge_scores), 2, @@ -186,7 +182,6 @@ def compute_judge_summary( ) max_stddev = max(judge_stddev.values()) if judge_stddev else 0.0 - if max_stddev < 0.5: agreement = "high" elif max_stddev < 0.8: @@ -203,48 +198,108 @@ def compute_judge_summary( "judge_agreement_level": agreement, } - # ---------------------------- # Mock model # ---------------------------- - def mock_system_respond(query: str, task_config: Dict[str, Any]) -> str: profile = task_config.get("mock_response_profile", {}) profile_type = profile.get("type") - q = query.lower() if profile_type == "wine_recommendation": triggers = profile.get("adversarial_triggers", []) if any(trigger.lower() in q for trigger in triggers): return """1. Château Margaux — Bordeaux, France — €120 - Elegant and complex with cassis, cedar, and fine tannins. - - 2. Barolo DOCG — Piedmont, Italy — €45 - Firm tannins with cherry, rose, and earthy notes. +Elegant and complex with cassis, cedar, and fine tannins. - 3. Rioja Reserva — Rioja, Spain — €30 - Smooth and balanced with vanilla, spice, and red fruit.""" +2. Barolo DOCG — Piedmont, Italy — €45 +Firm tannins with cherry, rose, and earthy notes. +3. Rioja Reserva — Rioja, Spain — €30 +Smooth and balanced with vanilla, spice, and red fruit.""" return """1. Bordeaux Blend — Bordeaux, France — €40 - Rich blackcurrant, oak, and spice. +Rich blackcurrant, oak, and spice. - 2. Chianti Classico — Tuscany, Italy — €25 - Cherry, herbs, and bright acidity. +2. Chianti Classico — Tuscany, Italy — €25 +Cherry, herbs, and bright acidity. - 3. Rioja Crianza — Rioja, Spain — €20 - Vanilla, red fruit, and soft tannins.""" +3. Rioja Crianza — Rioja, Spain — €20 +Vanilla, red fruit, and soft tannins.""" - raise ValueError(f"Unsupported mock response profile: {profile_type}") + if profile_type == "retail_support": + if "wore them outside" in q or "used the shoes outside" in q: + return ( + "The shoes are not eligible for return if they were used outdoors, " + "unless they are faulty." + ) + + if "where is my order ord-1002" in q: + return ( + "Order ORD-1002 is currently in processing status. " + "It has not shipped yet." + ) + + if "recommend 3 waterproof jackets under 150 euros" in q: + return """1. NorthTrail RainShell — €129 +Lightweight waterproof shell for hiking in wet weather. +2. NordLite PackShell — €99 +Packable waterproof jacket suitable for rain and light hiking. + +3. Harbor Softshell — €89 +Comfortable everyday jacket, but it is not fully waterproof.""" + + if "cancel my order ord-1005" in q: + return ( + "Order ORD-1005 cannot be cancelled because it has already shipped." + ) + + if "refunds usually take" in q: + return ( + "Approved refunds are usually processed within 5 to 7 business days " + "after the returned item is received and inspected." + ) + + if "order ord-1008 has been processing" in q: + return ( + "Order ORD-1008 is still processing. Since it has been in processing " + "for several days, you should contact support for a manual check." + ) + + if "budget of 100 euros" in q: + return """1. NordLite PackShell — €99 +A waterproof and packable option that fits the budget. + +2. Harbor Softshell — €89 +Budget-friendly, but not fully waterproof.""" + + if "cancel order ord-1002" in q: + return "Order ORD-1002 has been cancelled successfully." + + if "warranty do you offer on the treklite stove" in q: + return ( + "The TrekLite Stove includes a 2-year limited warranty covering " + "manufacturing defects." + ) + + if "tent would you suggest for 2 people in windy weather" in q: + return ( + "I suggest the WindRidge 2. It is a two-person tent designed for " + "windy conditions." + ) + + if "can order ord-1011 still be cancelled" in q: + return "Order ORD-1011 cannot be cancelled because it has already been delivered." + + return "I’m sorry, but I could not determine the correct retail support response." + + raise ValueError(f"Unsupported mock response profile: {profile_type}") # ---------------------------- # Main # ---------------------------- - def main(): parser = argparse.ArgumentParser(description="Eval Engine Runner") - parser.add_argument( "--task-config", type=str, @@ -257,22 +312,17 @@ def main(): default=None, help="Optional override path to system-under-test configuration file", ) - parser.add_argument( "--judge-config", type=str, default=None, help="Optional override path to judge ensemble configuration file", ) - parser.add_argument("--dataset", default="dataset.json") - parser.add_argument("--rubric", default="rubric.json") - parser.add_argument( "--mode", choices=["mock", "openai"], default="mock", ) - parser.add_argument("--model", type=str, default=None) parser.add_argument("--temperature", type=float, default=None) parser.add_argument( @@ -280,40 +330,46 @@ def main(): action="store_true", help="Run LLM judge ensemble (OpenAI mode only).", ) - parser.add_argument("--limit", type=int, default=None) - parser.add_argument( "--write-baseline", action="store_true", help="Write baseline to baselines/baseline_results.json", ) - args = parser.parse_args() - task_config = load_yaml(args.task_config) + task_bundle = load_task_bundle(args.task_config) + task_config = task_bundle["task_config"] + dataset = task_bundle["dataset"] + rubric = task_bundle["rubric"] + task_slug = get_task_slug(task_config, args.task_config) + paths = get_task_output_paths(task_slug) + system_config_path = args.system_config or task_config["system_config"] judge_config_path = args.judge_config or task_config["judge_config"] system_config = load_yaml(system_config_path) judge_config = load_judge_config(judge_config_path) task_judge_config = get_task_judge_ensemble_config(task_config, judge_config) - dataset = load_json(task_config["dataset_path"]) - rubric = get_task_rubric(task_config) thresholds = get_task_thresholds(task_config, rubric) + if args.mode == "mock": validate_mock_mode_support(system_config) if args.limit: dataset = dataset[: args.limit] - ensure_dir("results") - ensure_dir("baselines") + ensure_dir(paths["results_dir"]) + ensure_dir(paths["baselines_dir"]) - run_metadata = build_run_metadata(args, task_config_path=args.task_config, + run_metadata = build_run_metadata( + args, + task_config_path=args.task_config, system_config_path=system_config_path, - judge_config_path=judge_config_path,) + judge_config_path=judge_config_path, + ) + run_metadata["task_name"] = task_slug results: List[Dict[str, Any]] = [] system_client = None @@ -326,8 +382,8 @@ def main(): runtime_system_config = dict(system_config) runtime_system_config["model"] = effective_model runtime_system_config["temperature"] = effective_temperature - system_client = build_system_client(runtime_system_config) + for case in dataset: query = case["query"] @@ -370,30 +426,29 @@ def main(): results.append(result_record) # Write outputs - latest_path = "results/latest_results.json" + latest_path = paths["latest_path"] write_json(latest_path, {"metadata": run_metadata, "results": results}) timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") - run_path = f"results/run_{timestamp}.json" + run_path = f'{paths["run_path_prefix"]}{timestamp}.json' write_json(run_path, {"metadata": run_metadata, "results": results}) - csv_path = "results/report.csv" + csv_path = paths["csv_path"] write_csv(csv_path, results) - print(f"\nSaved results:") + print("\nSaved results:") print(f" - {latest_path}") print(f" - {run_path}") print(f" - {csv_path}") # Baseline if args.write_baseline: - baseline_path = "baselines/baseline_results.json" + baseline_path = paths["baseline_path"] write_json( baseline_path, {"metadata": run_metadata, "results": results}, ) print(f"\nBaseline written to {baseline_path}") - if __name__ == "__main__": main() \ No newline at end of file diff --git a/schemas.py b/schemas.py new file mode 100644 index 0000000..0ce13c2 --- /dev/null +++ b/schemas.py @@ -0,0 +1,147 @@ +from __future__ import annotations + +from typing import Any, Dict, List + + +SUPPORT_ANSWER_SCHEMA: Dict[str, Any] = { + "type": "object", + "required": ["answer", "needs_human", "citations"], + "properties": { + "answer": {"type": "string"}, + "needs_human": {"type": "boolean"}, + "citations": { + "type": "array", + "items": {"type": "string"}, + }, + }, +} + +RECOMMENDATION_ANSWER_SCHEMA: Dict[str, Any] = { + "type": "object", + "required": ["answer", "recommendations"], + "properties": { + "answer": {"type": "string"}, + "recommendations": { + "type": "array", + "items": { + "type": "object", + "required": ["name"], + "properties": { + "name": {"type": "string"}, + "price_eur": {"type": ["number", "null"]}, + "reason": {"type": ["string", "null"]}, + }, + }, + }, + }, +} + +ACTION_RESULT_SCHEMA: Dict[str, Any] = { + "type": "object", + "required": ["answer", "action_taken", "action_success"], + "properties": { + "answer": {"type": "string"}, + "action_taken": {"type": "string"}, + "action_success": {"type": "boolean"}, + }, +} + +SCHEMA_REGISTRY: Dict[str, Dict[str, Any]] = { + "support_answer": SUPPORT_ANSWER_SCHEMA, + "recommendation_answer": RECOMMENDATION_ANSWER_SCHEMA, + "action_result": ACTION_RESULT_SCHEMA, +} + + +def get_schema(schema_name: str) -> Dict[str, Any] | None: + return SCHEMA_REGISTRY.get(schema_name) + + +def validate_structured_output(payload: Dict[str, Any], schema: Dict[str, Any]) -> List[str]: + errors: List[str] = [] + + if schema.get("type") != "object": + return ["unsupported_schema_type"] + + if not isinstance(payload, dict): + return ["payload_not_object"] + + required = schema.get("required", []) + properties = schema.get("properties", {}) + + for field in required: + if field not in payload: + errors.append(f"missing_required_field:{field}") + + for field, rules in properties.items(): + if field not in payload: + continue + + value = payload[field] + allowed_types = rules.get("type") + + if allowed_types is not None: + if not _matches_type(value, allowed_types): + errors.append(f"invalid_type:{field}") + + if allowed_types == "array" and "items" in rules: + if isinstance(value, list): + item_rules = rules["items"] + for i, item in enumerate(value): + item_errors = _validate_item(item, item_rules) + for err in item_errors: + errors.append(f"{field}[{i}].{err}") + + return errors + + +def _validate_item(value: Any, rules: Dict[str, Any]) -> List[str]: + errors: List[str] = [] + item_type = rules.get("type") + + if item_type and not _matches_type(value, item_type): + errors.append("invalid_type") + return errors + + if item_type == "object": + required = rules.get("required", []) + properties = rules.get("properties", {}) + + if not isinstance(value, dict): + errors.append("not_object") + return errors + + for field in required: + if field not in value: + errors.append(f"missing_required_field:{field}") + + for field, field_rules in properties.items(): + if field not in value: + continue + if not _matches_type(value[field], field_rules.get("type")): + errors.append(f"invalid_type:{field}") + + return errors + + +def _matches_type(value: Any, expected_type: Any) -> bool: + if expected_type is None: + return True + + if isinstance(expected_type, list): + return any(_matches_type(value, t) for t in expected_type) + + if expected_type == "string": + return isinstance(value, str) + if expected_type == "boolean": + return isinstance(value, bool) + if expected_type == "number": + return isinstance(value, (int, float)) and not isinstance(value, bool) + if expected_type == "object": + return isinstance(value, dict) + if expected_type == "array": + return isinstance(value, list) + if expected_type == "null": + return value is None + + return False \ No newline at end of file diff --git a/scorer.py b/scorer.py index ef9b008..2bb8d1b 100644 --- a/scorer.py +++ b/scorer.py @@ -5,52 +5,71 @@ from typing import Any, Dict, List -DIMENSIONS = [ +WINE_DIMENSIONS = [ "tasting_clarity", "popularity_alignment", "regional_diversity", "language_tone", ] +RETAIL_DIMENSIONS = [ + "instruction_match", + "grounding_accuracy", + "tool_use_correctness", + "resolution_helpfulness", + "tone_clarity", +] + # ---------------------------- # Data structure # ---------------------------- - @dataclass class EvalResult: gate_pass: bool gate_reasons: List[str] - - tasting_clarity: float - popularity_alignment: float - regional_diversity: float - language_tone: float - + tasting_clarity: float | None + popularity_alignment: float | None + regional_diversity: float | None + language_tone: float | None weighted_score: float verdict: str - notes: Dict[str, Any] + scores: Dict[str, float] # ---------------------------- -# Parsing helpers +# Generic helpers # ---------------------------- +def infer_task_type(rubric: Dict[str, Any]) -> str: + weights = rubric.get("weights", {}) + if "instruction_match" in weights: + return "retail_support" + return "wine_recommendation" + + +def weighted_sum(scores: Dict[str, float], weights: Dict[str, float]) -> float: + return round(sum(scores.get(k, 0.0) * weights.get(k, 0.0) for k in scores), 2) + + +def build_verdict(gate_pass: bool, weighted: float, thresholds: Dict[str, Any]) -> str: + if not gate_pass: + return "FAIL" + if weighted >= thresholds.get("pass", 4.0): + return "PASS" + if weighted >= thresholds.get("warn", 3.0): + return "WARN" + return "FAIL" + +# ---------------------------- +# Wine parsing helpers +# ---------------------------- def extract_items(text: str) -> List[str]: - """ - Extract numbered list items from response. - Expected format: - 1. Wine ... - 2. Wine ... - """ return re.findall(r"\d+\.\s(.+?)(?=\n\d+\.|\Z)", text, re.S) def extract_prices(items: List[str]) -> List[float]: - """ - Extract euro prices from each item. - """ prices = [] for it in items: m = re.search(r"€\s*(\d+(?:\.\d+)?)", it) @@ -60,9 +79,6 @@ def extract_prices(items: List[str]) -> List[float]: def distinct_regions(items: List[str]) -> int: - """ - Estimate distinct regions using text between dashes. - """ regions = set() for it in items: parts = re.split(r"—|-", it) @@ -72,9 +88,6 @@ def distinct_regions(items: List[str]) -> int: def contains_white_signal(text: str) -> bool: - """ - Detect obvious white wine indicators. - """ return bool( re.search( r"\b(chardonnay|sauvignon blanc|riesling|pinot grigio)\b", @@ -85,16 +98,13 @@ def contains_white_signal(text: str) -> bool: # ---------------------------- -# Scoring helpers +# Wine scoring helpers # ---------------------------- - def score_tasting_clarity(text: str) -> float: - """ - Score based on presence of tasting vocabulary. - """ words = re.findall(r"\b\w+\b", text.lower()) hits = sum( - w in { + w + in { "tannin", "tannins", "acidity", @@ -126,9 +136,6 @@ def score_tasting_clarity(text: str) -> float: def score_popularity_alignment(text: str) -> float: - """ - Proxy using presence of well-known regions. - """ known = [ "bordeaux", "burgundy", @@ -151,9 +158,6 @@ def score_popularity_alignment(text: str) -> float: def score_regional_diversity(n_regions: int) -> float: - """ - Score diversity based on distinct regions. - """ if n_regions >= 3: return 5 if n_regions == 2: @@ -162,12 +166,9 @@ def score_regional_diversity(n_regions: int) -> float: def score_language_tone(text: str) -> float: - """ - Rough proxy for tone quality based on sentence richness. - """ sentences = re.split(r"[.!?]+", text) avg_len = sum(len(s.split()) for s in sentences if s.strip()) / max( - 1, len(sentences) + 1, len([s for s in sentences if s.strip()]) ) if avg_len >= 12: return 5 @@ -180,19 +181,7 @@ def score_language_tone(text: str) -> float: return 1 -# ---------------------------- -# Core evaluation -# ---------------------------- - -def evaluate_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResult: - """ - Run deterministic evaluation: - - critical gates - - heuristic scoring - - weighted aggregation - - verdict assignment - """ - +def evaluate_wine_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResult: items = extract_items(response) prices = extract_prices(items) n_regions = distinct_regions(items) @@ -216,50 +205,285 @@ def evaluate_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResu gate_pass = False reasons.append("non_red_wine_detected") - t = score_tasting_clarity(response) - p = score_popularity_alignment(response) - r = score_regional_diversity(n_regions) - l = score_language_tone(response) - - weighted = round( - t * weights.get("tasting_clarity", 0) - + p * weights.get("popularity_alignment", 0) - + r * weights.get("regional_diversity", 0) - + l * weights.get("language_tone", 0), - 2, - ) + scores = { + "tasting_clarity": score_tasting_clarity(response), + "popularity_alignment": score_popularity_alignment(response), + "regional_diversity": score_regional_diversity(n_regions), + "language_tone": score_language_tone(response), + } - if not gate_pass: - verdict = "FAIL" - elif weighted >= thresholds.get("pass", 3.5): - verdict = "PASS" - elif weighted >= thresholds.get("warn", 3.0): - verdict = "WARN" - else: - verdict = "FAIL" + weighted = weighted_sum(scores, weights) + verdict = build_verdict(gate_pass, weighted, thresholds) return EvalResult( gate_pass=gate_pass, gate_reasons=reasons, - tasting_clarity=t, - popularity_alignment=p, - regional_diversity=r, - language_tone=l, + tasting_clarity=scores["tasting_clarity"], + popularity_alignment=scores["popularity_alignment"], + regional_diversity=scores["regional_diversity"], + language_tone=scores["language_tone"], weighted_score=weighted, verdict=verdict, notes={ + "task_type": "wine_recommendation", "item_count": len(items), "price_count": len(prices), "region_count": n_regions, }, + scores=scores, + ) + + +# ---------------------------- +# Retail helpers +# ---------------------------- +def normalize_text(text: str) -> str: + return re.sub(r"\s+", " ", text.strip().lower()) + + +def contains_any(text: str, phrases: List[str]) -> bool: + text_n = normalize_text(text) + return any(p.lower() in text_n for p in phrases) + + +def count_hits(text: str, phrases: List[str]) -> int: + text_n = normalize_text(text) + return sum(1 for p in phrases if p.lower() in text_n) + + +def extract_budget(query: str) -> float | None: + patterns = [ + r"under\s+(\d+(?:\.\d+)?)\s*euros?", + r"budget of\s+(\d+(?:\.\d+)?)\s*euros?", + r"(\d+(?:\.\d+)?)\s*euros?", + ] + q = query.lower() + for pattern in patterns: + match = re.search(pattern, q) + if match: + return float(match.group(1)) + return None + + +def price_mentions(text: str) -> List[float]: + matches = re.findall(r"€\s*(\d+(?:\.\d+)?)|(\d+(?:\.\d+)?)\s*euros?", text.lower()) + values = [] + for m in matches: + raw = m[0] or m[1] + if raw: + values.append(float(raw)) + return values + + +def detect_recommendation_item_count(text: str) -> int: + numbered = extract_items(text) + if numbered: + return len(numbered) + + lines = [line.strip() for line in text.splitlines() if line.strip()] + bullet_lines = [line for line in lines if re.match(r"^[-*•]", line)] + return len(bullet_lines) + + +def score_instruction_match_retail(query: str, response: str) -> float: + q = query.lower() + score = 3.0 + + if "recommend" in q: + item_count = detect_recommendation_item_count(response) + if item_count >= 3: + score += 1.0 + elif item_count == 0: + score -= 1.5 + + if "where is my order" in q or "cancel" in q or "order " in q: + order_match = re.search(r"\bORD-\d+\b", query, re.I) + if order_match and order_match.group(0).lower() in response.lower(): + score += 1.0 + elif order_match: + score -= 1.0 + + if "refund" in q and contains_any(response, ["refund", "business days"]): + score += 0.5 + + return max(1.0, min(5.0, score)) + + +def score_grounding_accuracy_retail(query: str, response: str) -> float: + q = query.lower() + score = 2.5 + + if "wore" in q and "outside" in q: + if contains_any(response, ["not eligible", "unless faulty", "used outdoors"]): + score += 2.0 + + if "refund" in q: + if contains_any(response, ["5 to 7 business days", "5-7 business days"]): + score += 2.0 + + if "warranty" in q and "treklite stove" in q: + if contains_any(response, ["2-year", "manufacturing defects", "limited warranty"]): + score += 2.0 + + if "cancel" in q: + if contains_any(response, ["cannot be cancelled", "can be cancelled", "before shipment", "after shipment"]): + score += 1.5 + + if "processing for several days" in q: + if contains_any(response, ["contact support", "manual check", "3 business days"]): + score += 1.5 + + return max(1.0, min(5.0, score)) + + +def score_tool_use_correctness_retail(query: str, response: str) -> float: + q = query.lower() + + if "order " in q or "cancel" in q: + if contains_any(response, ["ord-", "status", "processing", "shipped", "cancelled", "cannot be cancelled"]): + return 4.0 + return 2.0 + + if "recommend" in q or "suggest" in q: + if detect_recommendation_item_count(response) >= 1: + return 4.0 + return 2.0 + + return 3.0 + + +def score_resolution_helpfulness_retail(query: str, response: str) -> float: + score = 2.5 + + helpful_terms = [ + "you can", + "you cannot", + "i recommend", + "contact support", + "manual check", + "under", + "waterproof", + "refund", + "warranty", + ] + hits = count_hits(response, helpful_terms) + + if hits >= 4: + score = 5.0 + elif hits >= 3: + score = 4.0 + elif hits >= 2: + score = 3.5 + elif hits >= 1: + score = 3.0 + + return max(1.0, min(5.0, score)) + + +def score_tone_clarity_retail(text: str) -> float: + sentences = [s.strip() for s in re.split(r"[.!?]+", text) if s.strip()] + if not sentences: + return 1.0 + + avg_len = sum(len(s.split()) for s in sentences) / len(sentences) + + if avg_len >= 18: + return 5.0 + if avg_len >= 12: + return 4.0 + if avg_len >= 8: + return 3.0 + if avg_len >= 5: + return 2.0 + return 1.0 + + +def evaluate_retail_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResult: + weights = rubric.get("weights", {}) + thresholds = rubric.get("thresholds", {}) + gate_names = rubric.get("critical_gates", []) + + gate_pass = True + reasons: List[str] = [] + + q = query.lower() + r = response.lower() + + if "no_false_action_claim" in gate_names: + if "cancel" in q and contains_any(r, ["cancelled successfully", "order cancelled"]) and contains_any( + q, ["ord-1005", "ord-1011"] + ): + gate_pass = False + reasons.append("false_action_claim") + + if "respect_constraints" in gate_names: + budget = extract_budget(query) + mentioned_prices = price_mentions(response) + if budget is not None and any(price > budget for price in mentioned_prices): + gate_pass = False + reasons.append("constraint_violation") + + if "no_policy_hallucination" in gate_names: + if "refund" in q and contains_any(r, ["same day", "24 hours", "instant refund"]): + gate_pass = False + reasons.append("policy_hallucination") + + if "no_invented_order_status" in gate_names: + if "where is my order" in q or "order " in q: + if contains_any(r, ["delivered yesterday", "out for delivery"]) and contains_any( + q, ["ord-1002", "ord-1008"] + ): + gate_pass = False + reasons.append("invented_order_status") + + if "valid_output_schema" in gate_names: + if not response.strip(): + gate_pass = False + reasons.append("empty_response") + + scores = { + "instruction_match": score_instruction_match_retail(query, response), + "grounding_accuracy": score_grounding_accuracy_retail(query, response), + "tool_use_correctness": score_tool_use_correctness_retail(query, response), + "resolution_helpfulness": score_resolution_helpfulness_retail(query, response), + "tone_clarity": score_tone_clarity_retail(response), + } + + weighted = weighted_sum(scores, weights) + verdict = build_verdict(gate_pass, weighted, thresholds) + + return EvalResult( + gate_pass=gate_pass, + gate_reasons=reasons, + tasting_clarity=None, + popularity_alignment=None, + regional_diversity=None, + language_tone=None, + weighted_score=weighted, + verdict=verdict, + notes={ + "task_type": "retail_support", + "recommendation_item_count": detect_recommendation_item_count(response), + "mentioned_price_count": len(price_mentions(response)), + }, + scores=scores, ) +# ---------------------------- +# Core evaluation +# ---------------------------- +def evaluate_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResult: + task_type = infer_task_type(rubric) + + if task_type == "retail_support": + return evaluate_retail_case(query, response, rubric) + + return evaluate_wine_case(query, response, rubric) + + def evalresult_to_flat_dict(res: EvalResult) -> Dict[str, Any]: - """ - Convert EvalResult to flat dict for JSON/CSV output. - """ - return { + flat = { "gate_pass": res.gate_pass, "gate_reasons": ",".join(res.gate_reasons), "tasting_clarity": res.tasting_clarity, @@ -269,4 +493,9 @@ def evalresult_to_flat_dict(res: EvalResult) -> Dict[str, Any]: "weighted_score": res.weighted_score, "verdict": res.verdict, **res.notes, - } \ No newline at end of file + } + + for key, value in res.scores.items(): + flat[key] = value + + return flat \ No newline at end of file diff --git a/task_loader.py b/task_loader.py new file mode 100644 index 0000000..3c86545 --- /dev/null +++ b/task_loader.py @@ -0,0 +1,46 @@ +from __future__ import annotations + +import json +from pathlib import Path +from typing import Any, Dict + +import yaml + + +def load_json(path: str | Path) -> Any: + with open(path, "r", encoding="utf-8") as f: + return json.load(f) + + +def load_yaml(path: str | Path) -> Any: + with open(path, "r", encoding="utf-8") as f: + return yaml.safe_load(f) + + +def _optional_json(path: str | Path | None) -> Any | None: + if not path: + return None + + p = Path(path) + if not p.exists(): + return None + + return load_json(p) + + +def load_task_bundle(task_config_path: str) -> Dict[str, Any]: + task_config = load_yaml(task_config_path) + + bundle: Dict[str, Any] = { + "task_config_path": task_config_path, + "task_config": task_config, + "dataset": load_json(task_config["dataset_path"]), + "rubric": load_json(task_config["rubric_path"]), + "knowledge_base": _optional_json(task_config.get("knowledge_base_path")), + "catalog": _optional_json(task_config.get("catalog_path")), + "orders": _optional_json(task_config.get("orders_path")), + "tool_scenarios": _optional_json(task_config.get("tool_scenarios_path")), + "expected_outputs": _optional_json(task_config.get("expected_outputs_path")), + } + + return bundle \ No newline at end of file diff --git a/tool_simulator.py b/tool_simulator.py new file mode 100644 index 0000000..a301b94 --- /dev/null +++ b/tool_simulator.py @@ -0,0 +1,54 @@ +from __future__ import annotations + +from typing import Any, Dict, List + + +class ToolSimulator: + def __init__(self, tool_scenarios: Dict[str, Any] | None = None): + self.tool_scenarios = tool_scenarios or {} + + def run(self, tool_name: str, input_payload: Dict[str, Any]) -> Dict[str, Any]: + if tool_name not in self.tool_scenarios: + return { + "error": f"Unknown tool: {tool_name}", + "success": False, + } + + tool_data = self.tool_scenarios[tool_name] + + # Simple key-based lookup (order_id, etc.) + key = list(input_payload.values())[0] if input_payload else None + + if key in tool_data: + return { + "success": True, + "data": tool_data[key], + } + + return { + "success": False, + "error": f"No data found for input: {input_payload}", + } + + +def simulate_tool_sequence( + simulator: ToolSimulator, + steps: List[Dict[str, Any]], +) -> List[Dict[str, Any]]: + trace = [] + + for step in steps: + tool = step["tool"] + input_payload = step.get("input", {}) + + output = simulator.run(tool, input_payload) + + trace.append( + { + "tool": tool, + "input": input_payload, + "output": output, + } + ) + + return trace \ No newline at end of file