diff --git a/README.md b/README.md
index 52035ad..57f6d38 100644
--- a/README.md
+++ b/README.md
@@ -1,45 +1,20 @@
-# Eval Engine Learning Lab
+# Eval Engine
 
-A minimal, practical evaluation framework for LLM systems and AI application flows.
+A lightweight evaluation framework for LLM systems combining:
 
-This project started as a learning lab for evaluating **non-deterministic AI outputs** using a layered approach, and is now being refactored toward a **boilerplate-friendly eval engine**.
-
-It combines:
-- deterministic rules (critical gates)
-- heuristic scoring
-- LLM-as-judge (analytical signal)
-- regression comparison
+- deterministic evaluation (critical gates + heuristics)
+- probabilistic evaluation (LLM-as-judge)
+- regression detection across runs
 
 ---
 
-## Why this exists
-
-Traditional QA assumes deterministic outputs and exact expected results.
-
-LLM systems and AI applications introduce:
-- variable responses
-- qualitative output quality
-- probabilistic reasoning
-- formatting inconsistency
-- partial subjectivity in evaluation
-
-This repo demonstrates a practical evaluation architecture to handle that while preserving deterministic control over PASS/FAIL.
-
----
+## Core Idea
 
-## Core principle
+LLM systems are non-deterministic.
 
-> Heuristics decide PASS/FAIL. Judges provide analytical insight.
+Traditional pass/fail testing is not enough.
 
-This separation is intentional.
-
-- **Deterministic evaluation** remains the source of truth
-- **LLM judges** help analyze quality, disagreement, and rubric clarity
-- **Regression comparison** tracks stability across runs
-
----
-
-## Architecture
+This project explores a layered evaluation approach:
 
 ```text
 Dataset
@@ -54,290 +29,119 @@ Heuristic Scoring (deterministic)
 ↓
 Regression Comparison
 ```
-
 ---
-## Boilerplate direction
 
-The framework is being refactored so that the evaluated system is no longer assumed to be just a single hardcoded LLM prompt.
+## Example Packs
 
-The intended boilerplate model is:
-	•	task config → defines the evaluation task, dataset, rubric, and evaluation behavior
-	•	system config → defines the system under test
-	•	judge config → defines judge roles, prompts, and rubric anchors
+The project now supports **multiple evaluation domains** via self-contained example packs.
 
-This makes the framework reusable for:
-	•	recommendation tasks
-	•	support assistant evaluation
-	•	agent workflow evaluation
-	•	retrieval or RAG response evaluation
-	•	other structured AI output evaluation tasks
+### 1. Wine Recommendation (Reference Example)
 
-For extension guidance, see: `docs/boilerplate_extension_guide.md`
-⸻
+- recommendation-style evaluation
+- structured list outputs
+- qualitative scoring (taste, tone, diversity)
+- baseline reference task
 
-## Current reference example
+examples/wine_recommendation/
 
-The current example task is:
+---
 
-Wine recommendation evaluation
+### 2. Retail Support (Multi-purpose Example)
 
-This remains the reference task because it demonstrates:
-	•	hard constraints
-	•	qualitative scoring
-	•	useful judge disagreement
-	•	adversarial cases
-	•	regression comparison
+Demonstrates:
 
-Wine is the example task, not the long-term framework identity.
+- recommendation tasks
+- support assistant evaluation
+- retrieval-grounded responses (RAG-style)
+- simple agent workflows (mock tools)
+- structured output expectations
 
-⸻
+examples/retail_support/
+
+---
 
-## Repo structure
+## Project Structure
 
 ```text
 .
+│
+├── examples/
+│   ├── wine_recommendation/
+│   └── retail_support/
+│
 ├── configs/
 │   ├── tasks/
-│   │   └── wine.yaml
 │   ├── systems/
-│   │   └── openai_wine.yaml
 │   └── judges/
-│       └── default_ensemble.yaml
 │
-├── dataset.json
-├── rubric.json
+├── results/<task_name>/
+├── baselines/<task_name>/
+│
 ├── runner.py
 ├── scorer.py
-├── confidence.py
-├── system_client.py
-├── judge_client.py
+├── task_loader.py
+├── tool_simulator.py
+├── schemas.py
 ├── regression_compare.py
-│
-├── prompts/
-│   ├── system/
-│   │   └── wine_generator.txt
-│   └── judges/
-│       ├── base_prompt.txt
-│       ├── role_balanced.txt
-│       ├── role_strict.txt
-│       └── role_usefulness.txt
-│
-├── baselines/
-├── results/
-├── tests/
-├── examples/
-│
-├── README.md
-├── requirements.txt
-├── .env.example
-├── .gitignore
-└── LICENSE
-
 ```
+---
 
-## Configuration model
-
-1. Task config
-
-Defines:
-	•	dataset path
-	•	rubric path
-	•	evaluation wiring
-	•	mock response profile
-	•	selected judge ensemble config
-
-Example:
-	•	configs/tasks/wine.yaml
-
-2. System config
-
-Defines the system under test:
-	•	provider
-	•	model
-	•	temperature
-	•	mock mode support
-
-Example:
-	•	configs/systems/openai_wine.yaml
-
-3. Judge config
-
-Defines judge behavior:
-	•	model
-	•	temperature
-	•	role definitions
-	•	base prompt
-	•	rubric anchors
-
-Example:
-	•	configs/judges/default_ensemble.yaml
-
-
-## Quickstart
-
-### 1. Setup
+## Running Evaluations
 
+### Wine example
 ```bash
-python3 -m venv .venv
-source .venv/bin/activate
-pip install -r requirements.txt
-cp .env.example .env
+python3 runner.py --task-config configs/tasks/wine.yaml --mode mock
 ```
-
-Add your OpenAI key:
-```env
-OPENAI_API_KEY=your_key_here
-```
-⸻
-
-### Run examples
-
-Mock (no API required)
+### Retail support example
 ```bash
-python3 runner.py --mode mock
+python3 runner.py --task-config examples/retail_support/task_config.yaml --mode mock
 ```
 ⸻
 
-### OpenAI generation
+## Writing Baselines
 ```bash
-python3 runner.py --mode openai --model gpt-4o-mini --temperature 0.0
+python3 runner.py --task-config configs/tasks/wine.yaml --mode mock --write-baseline
+python3 runner.py --task-config examples/retail_support/task_config.yaml --mode mock --write-baseline
 ```
 ⸻
 
-### OpenAI + judge ensemble
-```bash
-python3 runner.py --mode openai --enable-judge
-```
-⸻
+## Regression Comparison
 
-### Write baseline
+Using explicit paths:
 ```bash
-python3 runner.py --mode openai --enable-judge --write-baseline
+python3 regression_compare.py baselines/wine_recommendation/baseline_results.json results/wine_recommendation/latest_results.json
 ```
-⸻
-
-### Use explicit config paths
+Using task shortcut:
 ```bash
-python3 runner.py \
-  --mode openai \
-  --task-config configs/tasks/wine.yaml \
-  --system-config configs/systems/openai_wine.yaml \
-  --judge-config configs/judges/default_ensemble.yaml \
-  --enable-judge
+python3 regression_compare.py --task wine_recommendation
+python3 regression_compare.py --task retail_support
 ```
-⸻
-
-## Output artifacts
-
-Each run generates:
-•	results/latest_results.json → latest evaluation snapshot
-•	results/run_<timestamp>.json → historical run
-•	results/report.csv → flat analysis table
-
-Each result includes:
-•	generator output
-•	heuristic evaluation
-•	optional judge evaluation
-•	run metadata
-•	compatibility fields for regression comparison
-
----
-## Evaluation layers
-
-### Critical gates
-
-Hard constraints that must not be violated.
-
-Examples in the wine task:
-	•	exact item count
-	•	red wine constraint
-	•	max price
-
-### Heuristic scoring
-
-Deterministic scoring across defined dimensions.
-
-Current wine rubric dimensions:
-	•	tasting_clarity
-	•	popularity_alignment
-	•	regional_diversity
-	•	language_tone
-
-### Judge ensemble
-
-LLM judges provide analytical signal only.
-
-Current roles:
-	•	balanced evaluator
-	•	strict evaluator
-	•	usefulness evaluator
-
-Judges help surface:
-	•	disagreement
-	•	ambiguity
-	•	rubric weaknesses
-	•	heuristic blind spots
 
 ⸻
 
-## Regression comparison
-Use regression comparison to check whether a latest run regressed against a baseline.
-
-Example:
-```bash
-python3 regression_compare.py baselines/baseline_results.json results/latest_results.json --max-drop 0.3
-```
-The comparator supports both:
-	•	legacy flat result fields
-	•	standardized nested evaluation result structure
-
----
-
-## Design principles
-
-•	deterministic logic governs decisions
-•	judges are analytical, not operational
-•	disagreement is useful
-•	evaluation requires engineering discipline
-•	simplicity is intentional
-
----
-
-## Current reference example
-
-The current example task is:
+## What This Project Demonstrates
+	•	How to evaluate LLM outputs beyond simple correctness
+	•	How to combine heuristics and LLM judges
+	•	How to detect regressions in non-deterministic systems
+	•	How to design evaluation datasets and rubrics
+	•	How to structure reusable evaluation tasks
 
-**Wine recommendation evaluation**
-
-This remains the reference task because it demonstrates:
-- hard constraints
-- qualitative scoring
-- useful judge disagreement
-- adversarial cases
-- regression comparison
+⸻
 
-Wine is the example task, not the long-term framework identity.
+## Status
 
-See also: `docs/examples/wine_recommendation.md`
----
+This is a V1 learning lab project.
 
-## Roadmap direction
+Focus:
+	•	clarity over completeness
+	•	simplicity over abstraction
+	•	experimentation over production design
 
-Near-term boilerplate refactor goals:
-	•	keep wine as the example task
-	•	make task/system/judge configuration first-class
-	•	support reusable evaluation structure across domains
-	•	preserve regression safety and offline mock mode
+⸻
 
-Longer-term exploration:
+## Future Directions
 	•	judge disagreement visualization
-	•	judge reliability experiments
-	•	cross-model comparison
-	•	rubric refinement
-	•	evaluation analytics
-
-For current refactor status, see: `docs/boilerplate_v1_status.md`
-
-## License
-
-MIT
\ No newline at end of file
+	•	evaluation analytics across runs
+	•	cross-model judge comparison
+	•	richer agent workflow evaluation
+	•	dashboard / visualization layer
\ No newline at end of file
diff --git a/baselines/retail_support/.gitkeep b/baselines/retail_support/.gitkeep
new file mode 100644
index 0000000..e69de29
diff --git a/baselines/retail_support/README.md b/baselines/retail_support/README.md
new file mode 100644
index 0000000..dc6c8c7
--- /dev/null
+++ b/baselines/retail_support/README.md
@@ -0,0 +1,7 @@
+# Retail Support Baselines
+
+This folder stores baseline result snapshots for the retail support example.
+
+Suggested naming:
+- `baseline_results_mock.json`
+- `baseline_results_openai_<model>.json`
\ No newline at end of file
diff --git a/baselines/wine_recommendation/.gitkeep b/baselines/wine_recommendation/.gitkeep
new file mode 100644
index 0000000..e69de29
diff --git a/baselines/wine_recommendation/README.md b/baselines/wine_recommendation/README.md
new file mode 100644
index 0000000..465301d
--- /dev/null
+++ b/baselines/wine_recommendation/README.md
@@ -0,0 +1,7 @@
+# Wine Recommendation Baselines
+
+This folder stores baseline result snapshots for the wine recommendation example.
+
+Suggested naming:
+- `baseline_results_mock.json`
+- `baseline_results_openai_<model>.json`
\ No newline at end of file
diff --git a/configs/tasks/wine.yaml b/configs/tasks/wine.yaml
index 300e20f..56d87a6 100644
--- a/configs/tasks/wine.yaml
+++ b/configs/tasks/wine.yaml
@@ -1,8 +1,6 @@
 name: wine_recommendation
-
-dataset_path: dataset.json
-rubric_path: rubric.json
-
+dataset_path: examples/wine_recommendation/dataset.json
+rubric_path: examples/wine_recommendation/rubric.json
 description: "Wine recommendation evaluation task (V1 reference task)"
 
 system_config: configs/systems/openai_wine.yaml
diff --git a/examples/README.md b/examples/README.md
new file mode 100644
index 0000000..e69de29
diff --git a/examples/retail_support/README.md b/examples/retail_support/README.md
new file mode 100644
index 0000000..20afd94
--- /dev/null
+++ b/examples/retail_support/README.md
@@ -0,0 +1,36 @@
+# Retail Support Example
+
+This example demonstrates a multi-purpose evaluation setup for:
+
+- recommendation tasks
+- customer support responses
+- retrieval-grounded answers (RAG-style)
+- simple agent workflows (tool usage simulation)
+- structured output validation
+
+## Purpose
+
+This example is designed to showcase the flexibility of the eval engine beyond a single domain.
+
+It combines multiple real-world assistant behaviors into a single evaluation pack.
+
+## Task Categories
+
+- recommendation
+- support_policy
+- order_support
+- agent_workflow
+
+## Data Files
+
+- `dataset.json` — evaluation cases across multiple categories
+- `rubric.json` — scoring logic and critical gates
+- `knowledge_base.json` — policy and support documents
+- `catalog.json` — product data for recommendations
+- `orders.json` — mock order data
+- `tool_scenarios.json` — simulated tool responses
+- `expected_outputs.json` — deterministic evaluation hints
+
+## Status
+
+Initial scaffold only. Logic will be wired in upcoming commits.
\ No newline at end of file
diff --git a/examples/retail_support/catalog.json b/examples/retail_support/catalog.json
new file mode 100644
index 0000000..b1f9a31
--- /dev/null
+++ b/examples/retail_support/catalog.json
@@ -0,0 +1,56 @@
+[
+  {
+    "sku": "JKT_001",
+    "name": "NorthTrail RainShell",
+    "category": "jacket",
+    "price_eur": 129,
+    "waterproof": true,
+    "use_case": ["hiking", "commute", "rain"],
+    "notes": "Lightweight waterproof shell suited for wet and windy conditions."
+  },
+  {
+    "sku": "JKT_002",
+    "name": "Fjord Trek Pro",
+    "category": "jacket",
+    "price_eur": 179,
+    "waterproof": true,
+    "use_case": ["hiking", "mountain"],
+    "notes": "Durable trekking shell with stronger weather protection but above tighter budget ranges."
+  },
+  {
+    "sku": "JKT_003",
+    "name": "NordLite PackShell",
+    "category": "jacket",
+    "price_eur": 99,
+    "waterproof": true,
+    "use_case": ["hiking", "travel"],
+    "notes": "Packable waterproof jacket for occasional hikes and everyday use."
+  },
+  {
+    "sku": "JKT_004",
+    "name": "Harbor Softshell",
+    "category": "jacket",
+    "price_eur": 89,
+    "waterproof": false,
+    "use_case": ["commute", "casual"],
+    "notes": "Comfortable softshell with light weather resistance, not fully waterproof."
+  },
+  {
+    "sku": "TENT_001",
+    "name": "WindRidge 2",
+    "category": "tent",
+    "price_eur": 210,
+    "waterproof": true,
+    "use_case": ["camping", "windy_weather", "2_person"],
+    "notes": "Two-person tent with strong pole structure for windy conditions."
+  },
+  {
+    "sku": "STOVE_001",
+    "name": "TrekLite Stove",
+    "category": "stove",
+    "price_eur": 59,
+    "waterproof": false,
+    "use_case": ["camping", "cooking"],
+    "notes": "Compact backpacking stove with 2-year limited warranty."
+  }
+]
\ No newline at end of file
diff --git a/examples/retail_support/dataset.json b/examples/retail_support/dataset.json
new file mode 100644
index 0000000..a0938d1
--- /dev/null
+++ b/examples/retail_support/dataset.json
@@ -0,0 +1,98 @@
+[
+  {
+    "id": "SUP_001",
+    "category": "support_policy",
+    "query": "Can I return hiking shoes if I wore them outside once?",
+    "context_refs": ["policy_returns_worn_items"],
+    "expected_tools": [],
+    "expected_schema": "support_answer"
+  },
+  {
+    "id": "SUP_002",
+    "category": "order_support",
+    "query": "Where is my order ORD-1002?",
+    "context_refs": [],
+    "expected_tools": ["lookup_order"],
+    "expected_schema": "support_answer"
+  },
+  {
+    "id": "SUP_003",
+    "category": "recommendation",
+    "query": "Recommend 3 waterproof jackets under 150 euros for hiking.",
+    "context_refs": [],
+    "expected_tools": ["search_catalog"],
+    "expected_schema": "recommendation_answer"
+  },
+  {
+    "id": "SUP_004",
+    "category": "agent_workflow",
+    "query": "Cancel my order ORD-1005 if it has not shipped yet.",
+    "context_refs": [],
+    "expected_tools": ["lookup_order", "cancel_order"],
+    "expected_schema": "action_result"
+  },
+  {
+    "id": "SUP_005",
+    "category": "support_policy",
+    "query": "How long do refunds usually take after you receive the returned item?",
+    "context_refs": ["policy_refunds"],
+    "expected_tools": [],
+    "expected_schema": "support_answer"
+  },
+  {
+    "id": "SUP_006",
+    "category": "order_support",
+    "query": "My order ORD-1008 has been processing for several days. What should I do?",
+    "context_refs": ["policy_shipping_delay"],
+    "expected_tools": ["lookup_order"],
+    "expected_schema": "support_answer"
+  },
+  {
+    "id": "SUP_007",
+    "category": "recommendation",
+    "query": "Recommend a jacket for rainy weather in Denmark with a budget of 100 euros.",
+    "context_refs": [],
+    "expected_tools": ["search_catalog"],
+    "expected_schema": "recommendation_answer"
+  },
+  {
+    "id": "SUP_008",
+    "category": "agent_workflow",
+    "query": "Cancel order ORD-1002 for me.",
+    "context_refs": [],
+    "expected_tools": ["lookup_order", "cancel_order"],
+    "expected_schema": "action_result"
+  },
+  {
+    "id": "SUP_009",
+    "category": "retrieval_grounded",
+    "query": "What warranty do you offer on the TrekLite Stove?",
+    "context_refs": ["policy_warranty"],
+    "expected_tools": [],
+    "expected_schema": "support_answer"
+  },
+  {
+    "id": "SUP_010",
+    "category": "recommendation",
+    "query": "Which tent would you suggest for 2 people in windy weather?",
+    "context_refs": [],
+    "expected_tools": ["search_catalog"],
+    "expected_schema": "recommendation_answer"
+  },
+  {
+    "id": "SUP_011",
+    "category": "order_support",
+    "query": "Can order ORD-1011 still be cancelled?",
+    "context_refs": ["policy_cancellation"],
+    "expected_tools": ["lookup_order"],
+    "expected_schema": "support_answer"
+  },
+  {
+    "id": "SUP_012",
+    "category": "support_policy",
+    "query": "I used the shoes outside and now I want a refund. Is that allowed if they are not faulty?",
+    "context_refs": ["policy_returns_worn_items"],
+    "expected_tools": [],
+    "expected_schema": "support_answer"
+  }
+]
\ No newline at end of file
diff --git a/examples/retail_support/expected_outputs.json b/examples/retail_support/expected_outputs.json
new file mode 100644
index 0000000..d37c0fe
--- /dev/null
+++ b/examples/retail_support/expected_outputs.json
@@ -0,0 +1,34 @@
+{
+  "SUP_001": {
+    "must_include_any": ["not eligible", "used outdoors"],
+    "must_not_include": ["full refund guaranteed"],
+    "required_context_ids": ["policy_returns_worn_items"]
+  },
+  "SUP_002": {
+    "required_tools": ["lookup_order"]
+  },
+  "SUP_004": {
+    "required_tools": ["lookup_order", "cancel_order"],
+    "expected_action_success": false
+  },
+  "SUP_005": {
+    "must_include_any": ["5 to 7 business days"],
+    "required_context_ids": ["policy_refunds"]
+  },
+  "SUP_006": {
+    "must_include_any": ["contact support", "manual check"],
+    "required_context_ids": ["policy_shipping_delay"]
+  },
+  "SUP_008": {
+    "required_tools": ["lookup_order", "cancel_order"],
+    "expected_action_success": true
+  },
+  "SUP_009": {
+    "must_include_any": ["2-year", "warranty"],
+    "required_context_ids": ["policy_warranty"]
+  },
+  "SUP_011": {
+    "must_include_any": ["cannot be cancelled"],
+    "required_context_ids": ["policy_cancellation"]
+  }
+}
\ No newline at end of file
diff --git a/examples/retail_support/knowledge_base.json b/examples/retail_support/knowledge_base.json
new file mode 100644
index 0000000..3a79a04
--- /dev/null
+++ b/examples/retail_support/knowledge_base.json
@@ -0,0 +1,27 @@
+[
+  {
+    "id": "policy_returns_worn_items",
+    "title": "Returns for worn footwear",
+    "text": "Footwear used outdoors is not eligible for return unless the item is faulty."
+  },
+  {
+    "id": "policy_cancellation",
+    "title": "Order cancellation policy",
+    "text": "Orders can be cancelled only before the shipment status changes to shipped."
+  },
+  {
+    "id": "policy_refunds",
+    "title": "Refund processing policy",
+    "text": "Approved refunds are processed within 5 to 7 business days after the returned item is received and inspected."
+  },
+  {
+    "id": "policy_shipping_delay",
+    "title": "Shipping delay guidance",
+    "text": "If an order remains in processing for more than 3 business days, the customer should be advised to contact support for a manual check."
+  },
+  {
+    "id": "policy_warranty",
+    "title": "Warranty policy",
+    "text": "Outdoor stoves and technical gear include a 2-year limited warranty covering manufacturing defects only."
+  }
+]
\ No newline at end of file
diff --git a/examples/retail_support/orders.json b/examples/retail_support/orders.json
new file mode 100644
index 0000000..72a58cb
--- /dev/null
+++ b/examples/retail_support/orders.json
@@ -0,0 +1,30 @@
+[
+  {
+    "order_id": "ORD-1002",
+    "status": "processing",
+    "items": ["JKT_001"],
+    "can_cancel": true,
+    "days_in_status": 2
+  },
+  {
+    "order_id": "ORD-1005",
+    "status": "shipped",
+    "items": ["JKT_003"],
+    "can_cancel": false,
+    "days_in_status": 1
+  },
+  {
+    "order_id": "ORD-1008",
+    "status": "processing",
+    "items": ["STOVE_001"],
+    "can_cancel": true,
+    "days_in_status": 5
+  },
+  {
+    "order_id": "ORD-1011",
+    "status": "delivered",
+    "items": ["TENT_001"],
+    "can_cancel": false,
+    "days_in_status": 0
+  }
+]
\ No newline at end of file
diff --git a/examples/retail_support/rubric.json b/examples/retail_support/rubric.json
new file mode 100644
index 0000000..cbbcb62
--- /dev/null
+++ b/examples/retail_support/rubric.json
@@ -0,0 +1,21 @@
+{
+  "version": "v1.1",
+  "weights": {
+    "instruction_match": 0.25,
+    "grounding_accuracy": 0.25,
+    "tool_use_correctness": 0.20,
+    "resolution_helpfulness": 0.20,
+    "tone_clarity": 0.10
+  },
+  "thresholds": {
+    "pass": 4.0,
+    "warn": 3.0
+  },
+  "critical_gates": [
+    "no_policy_hallucination",
+    "no_false_action_claim",
+    "respect_constraints",
+    "valid_output_schema",
+    "no_invented_order_status"
+  ]
+}
\ No newline at end of file
diff --git a/examples/retail_support/task_config.yaml b/examples/retail_support/task_config.yaml
new file mode 100644
index 0000000..41a9470
--- /dev/null
+++ b/examples/retail_support/task_config.yaml
@@ -0,0 +1,30 @@
+task_name: retail_support
+task_type: mixed
+description: "Retail support evaluation task covering recommendation support retrieval and simple workflow cases"
+
+dataset_path: examples/retail_support/dataset.json
+rubric_path: examples/retail_support/rubric.json
+
+knowledge_base_path: examples/retail_support/knowledge_base.json
+catalog_path: examples/retail_support/catalog.json
+orders_path: examples/retail_support/orders.json
+tool_scenarios_path: examples/retail_support/tool_scenarios.json
+expected_outputs_path: examples/retail_support/expected_outputs.json
+
+system_config: configs/systems/openai_wine.yaml
+judge_config: configs/judges/default_ensemble.yaml
+
+mock_response_profile:
+  type: retail_support
+
+evaluation:
+  rubric:
+    source: task_local
+  thresholds:
+    source: rubric
+  judge_ensemble:
+    source: task_judge_config
+
+output_schema_mode: disabled
+tool_simulation_mode: disabled
+judge_mode: disabled
\ No newline at end of file
diff --git a/examples/retail_support/tool_scenarios.json b/examples/retail_support/tool_scenarios.json
new file mode 100644
index 0000000..0c6a6a3
--- /dev/null
+++ b/examples/retail_support/tool_scenarios.json
@@ -0,0 +1,46 @@
+{
+  "lookup_order": {
+    "ORD-1002": {
+      "order_id": "ORD-1002",
+      "status": "processing",
+      "can_cancel": true,
+      "days_in_status": 2
+    },
+    "ORD-1005": {
+      "order_id": "ORD-1005",
+      "status": "shipped",
+      "can_cancel": false,
+      "days_in_status": 1
+    },
+    "ORD-1008": {
+      "order_id": "ORD-1008",
+      "status": "processing",
+      "can_cancel": true,
+      "days_in_status": 5
+    },
+    "ORD-1011": {
+      "order_id": "ORD-1011",
+      "status": "delivered",
+      "can_cancel": false,
+      "days_in_status": 0
+    }
+  },
+  "cancel_order": {
+    "ORD-1002": {
+      "success": true,
+      "message": "Order cancelled successfully."
+    },
+    "ORD-1005": {
+      "success": false,
+      "message": "Order cannot be cancelled after shipment."
+    },
+    "ORD-1008": {
+      "success": true,
+      "message": "Order cancelled successfully."
+    },
+    "ORD-1011": {
+      "success": false,
+      "message": "Delivered orders cannot be cancelled."
+    }
+  }
+}
\ No newline at end of file
diff --git a/examples/wine_recommendation/README.md b/examples/wine_recommendation/README.md
new file mode 100644
index 0000000..27e0c1e
--- /dev/null
+++ b/examples/wine_recommendation/README.md
@@ -0,0 +1,19 @@
+# Wine Recommendation Example
+
+This is the current reference example for the eval engine.
+
+It demonstrates:
+- recommendation-style evaluation
+- critical gates for hard constraints
+- heuristic scoring across qualitative dimensions
+- optional LLM-as-judge analysis
+- regression comparison across runs
+
+## Files
+
+- `dataset.json` — evaluation cases for wine recommendation prompts
+- `rubric.json` — critical gates, weights, and thresholds for this task
+
+## Notes
+
+This example remains the baseline reference task while the repo evolves toward a reusable multi-example evaluation framework.
\ No newline at end of file
diff --git a/dataset.json b/examples/wine_recommendation/dataset.json
similarity index 100%
rename from dataset.json
rename to examples/wine_recommendation/dataset.json
diff --git a/rubric.json b/examples/wine_recommendation/rubric.json
similarity index 100%
rename from rubric.json
rename to examples/wine_recommendation/rubric.json
diff --git a/regression_compare.py b/regression_compare.py
index cee1827..ecfe49d 100644
--- a/regression_compare.py
+++ b/regression_compare.py
@@ -14,11 +14,9 @@ def load_json(path: str) -> Any:
 def extract_results(payload: Any) -> List[Dict[str, Any]]:
     if isinstance(payload, list):
         return payload
-
     if isinstance(payload, dict):
         if "results" in payload and isinstance(payload["results"], list):
             return payload["results"]
-
     raise TypeError("Expected results payload to be a list or a dict containing a 'results' list.")
 
 
@@ -73,7 +71,7 @@ def compare_results(
     if missing_cases:
         print("ERROR: Missing cases in latest run:")
         for case_id in missing_cases:
-            print(f"  - {case_id}")
+            print(f" - {case_id}")
         exit_code = 1
 
     for case_id, base_row in baseline_map.items():
@@ -84,15 +82,14 @@ def compare_results(
 
         base_gate_pass = get_gate_pass(base_row)
         latest_gate_pass = get_gate_pass(latest_row)
-
         if base_gate_pass and not latest_gate_pass:
             print(f"ERROR: Gate regression for {case_id}")
             exit_code = 1
 
         base_score = get_weighted_score(base_row)
         latest_score = get_weighted_score(latest_row)
-
         score_drop = round(base_score - latest_score, 2)
+
         if score_drop > max_drop:
             print(
                 f"ERROR: Score drop too large for {case_id} "
@@ -102,8 +99,8 @@ def compare_results(
 
         base_verdict = get_verdict(base_row)
         latest_verdict = get_verdict(latest_row)
-
         verdict_rank = {"FAIL": 0, "WARN": 1, "PASS": 2}
+
         if verdict_rank.get(latest_verdict, -1) < verdict_rank.get(base_verdict, -1):
             print(
                 f"ERROR: Verdict regression for {case_id} "
@@ -117,10 +114,34 @@ def compare_results(
     return exit_code
 
 
+def resolve_paths(
+    baseline_path: str | None,
+    latest_path: str | None,
+    task: str | None,
+) -> tuple[str, str]:
+    if baseline_path and latest_path:
+        return baseline_path, latest_path
+
+    if task:
+        resolved_baseline = baseline_path or f"baselines/{task}/baseline_results.json"
+        resolved_latest = latest_path or f"results/{task}/latest_results.json"
+        return resolved_baseline, resolved_latest
+
+    raise ValueError(
+        "Provide both baseline_path and latest_path, or use --task to resolve example-scoped defaults."
+    )
+
+
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description="Compare eval results against baseline.")
-    parser.add_argument("baseline_path", type=str, help="Path to baseline results JSON")
-    parser.add_argument("latest_path", type=str, help="Path to latest results JSON")
+    parser.add_argument("baseline_path", nargs="?", type=str, help="Path to baseline results JSON")
+    parser.add_argument("latest_path", nargs="?", type=str, help="Path to latest results JSON")
+    parser.add_argument(
+        "--task",
+        type=str,
+        default=None,
+        help="Task/example name used to resolve default paths under baselines/<task>/ and results/<task>/",
+    )
     parser.add_argument(
         "--max-drop",
         type=float,
@@ -133,8 +154,14 @@ def parse_args() -> argparse.Namespace:
 def main() -> None:
     args = parse_args()
 
-    baseline_payload = load_json(args.baseline_path)
-    latest_payload = load_json(args.latest_path)
+    baseline_path, latest_path = resolve_paths(
+        baseline_path=args.baseline_path,
+        latest_path=args.latest_path,
+        task=args.task,
+    )
+
+    baseline_payload = load_json(baseline_path)
+    latest_payload = load_json(latest_path)
 
     baseline = extract_results(baseline_payload)
     latest = extract_results(latest_payload)
diff --git a/runner.py b/runner.py
index 7a8c645..cfcd65b 100644
--- a/runner.py
+++ b/runner.py
@@ -1,9 +1,4 @@
 from __future__ import annotations
-from email import parser
-from dotenv import load_dotenv
-import yaml
-
-load_dotenv()
 
 import argparse
 import csv
@@ -11,15 +6,21 @@
 import os
 from datetime import datetime, timezone
 from typing import Any, Dict, List
+from pathlib import Path
+
+from dotenv import load_dotenv
 
 from confidence import compute_confidence
-from system_client import build_system_client, SYSTEM_PROMPT_VERSION
 from judge_client import (
     JUDGE_PROMPT_VERSION,
     judge_response_ensemble,
     load_judge_config,
 )
 from scorer import evaluate_case, evalresult_to_flat_dict
+from system_client import build_system_client, SYSTEM_PROMPT_VERSION
+from task_loader import load_task_bundle, load_yaml
+
+load_dotenv()
 
 RUBRIC_VERSION = "v1.0"
 DATASET_VERSION = "v1.0"
@@ -28,55 +29,50 @@
 # ----------------------------
 # Helpers
 # ----------------------------
-
-def load_json(path: str) -> Any:
-    with open(path, "r", encoding="utf-8") as f:
-        return json.load(f)
-
-
 def ensure_dir(path: str) -> None:
     os.makedirs(path, exist_ok=True)
 
-def load_yaml(path: str) -> Any:
-    with open(path, "r", encoding="utf-8") as f:
-        return yaml.safe_load(f)
-    
-def get_task_rubric(task_config: Dict[str, Any]) -> Dict[str, Any]:
-    evaluation_cfg = task_config.get("evaluation", {})
-    rubric_cfg = evaluation_cfg.get("rubric", {})
-
-    if rubric_cfg.get("source") != "task_local":
-        raise ValueError("Unsupported rubric source in task config.")
-
-    rubric_path = task_config["rubric_path"]
-    return load_json(rubric_path)
-
 def get_task_thresholds(task_config: Dict[str, Any], rubric: Dict[str, Any]) -> Dict[str, Any]:
     evaluation_cfg = task_config.get("evaluation", {})
     thresholds_cfg = evaluation_cfg.get("thresholds", {})
-
     if thresholds_cfg.get("source") != "rubric":
         raise ValueError("Unsupported thresholds source in task config.")
-
     return rubric["thresholds"]
 
 def get_task_judge_ensemble_config(
-    task_config: Dict[str, Any],
-    judge_config: Dict[str, Any],
+    task_config: Dict[str, Any], judge_config: Dict[str, Any]
 ) -> Dict[str, Any]:
     evaluation_cfg = task_config.get("evaluation", {})
     judge_ensemble_cfg = evaluation_cfg.get("judge_ensemble", {})
-
     if judge_ensemble_cfg.get("source") != "task_judge_config":
         raise ValueError("Unsupported judge ensemble source in task config.")
-
     return judge_config
 
+def get_task_slug(task_config: Dict[str, Any], task_config_path: str) -> str:
+    return (
+        task_config.get("name")
+        or task_config.get("task_name")
+        or Path(task_config_path).stem
+    )
+
+def get_task_output_paths(task_slug: str) -> Dict[str, str]:
+    results_dir = f"results/{task_slug}"
+    baselines_dir = f"baselines/{task_slug}"
+
+    return {
+        "results_dir": results_dir,
+        "baselines_dir": baselines_dir,
+        "latest_path": f"{results_dir}/latest_results.json",
+        "run_path_prefix": f"{results_dir}/run_",
+        "csv_path": f"{results_dir}/report.csv",
+        "baseline_path": f"{baselines_dir}/baseline_results.json",
+    }
+
 def validate_mock_mode_support(system_config: Dict[str, Any]) -> None:
     mock_support = system_config.get("mock_support", {})
     if not mock_support.get("enabled", False):
         raise ValueError("Selected system config does not support mock mode.")
-    
+
 def build_result_record(
     eval_case: Dict[str, Any],
     response: str,
@@ -130,10 +126,10 @@ def write_json(path: str, data: Any) -> None:
     with open(path, "w", encoding="utf-8") as f:
         json.dump(data, f, indent=2)
 
-
 def write_csv(path: str, rows: List[Dict[str, Any]]) -> None:
     if not rows:
         return
+
     keys = rows[0].keys()
     with open(path, "w", newline="", encoding="utf-8") as f:
         writer = csv.DictWriter(f, fieldnames=keys)
@@ -147,6 +143,7 @@ def build_run_metadata(
     judge_config_path: str,
 ) -> Dict[str, Any]:
     judge_enabled = args.mode == "openai" and args.enable_judge
+
     return {
         "timestamp_utc": datetime.now(timezone.utc).isoformat(),
         "mode": args.mode,
@@ -161,9 +158,9 @@ def build_run_metadata(
         "task_config_path": task_config_path,
         "system_config_path": system_config_path,
         "judge_config_path": judge_config_path,
+        "task_config_path": task_config_path,
     }
 
-
 def compute_judge_summary(
     judge_bundle: Dict[str, Any],
     rubric: Dict[str, Any],
@@ -174,7 +171,6 @@ def compute_judge_summary(
     judge_individual = judge_bundle["individual_judges"]
 
     weights = rubric.get("weights", {})
-
     judge_weighted_score = round(
         sum(judge_scores[k] * weights.get(k, 0) for k in judge_scores),
         2,
@@ -186,7 +182,6 @@ def compute_judge_summary(
     )
 
     max_stddev = max(judge_stddev.values()) if judge_stddev else 0.0
-
     if max_stddev < 0.5:
         agreement = "high"
     elif max_stddev < 0.8:
@@ -203,48 +198,108 @@ def compute_judge_summary(
         "judge_agreement_level": agreement,
     }
 
-
 # ----------------------------
 # Mock model
 # ----------------------------
-
 def mock_system_respond(query: str, task_config: Dict[str, Any]) -> str:
     profile = task_config.get("mock_response_profile", {})
     profile_type = profile.get("type")
-
     q = query.lower()
 
     if profile_type == "wine_recommendation":
         triggers = profile.get("adversarial_triggers", [])
         if any(trigger.lower() in q for trigger in triggers):
             return """1. Château Margaux — Bordeaux, France — €120
-                Elegant and complex with cassis, cedar, and fine tannins.
-
-                2. Barolo DOCG — Piedmont, Italy — €45
-                Firm tannins with cherry, rose, and earthy notes.
+Elegant and complex with cassis, cedar, and fine tannins.
 
-                3. Rioja Reserva — Rioja, Spain — €30
-                Smooth and balanced with vanilla, spice, and red fruit."""
+2. Barolo DOCG — Piedmont, Italy — €45
+Firm tannins with cherry, rose, and earthy notes.
 
+3. Rioja Reserva — Rioja, Spain — €30
+Smooth and balanced with vanilla, spice, and red fruit."""
         return """1. Bordeaux Blend — Bordeaux, France — €40
-            Rich blackcurrant, oak, and spice.
+Rich blackcurrant, oak, and spice.
 
-            2. Chianti Classico — Tuscany, Italy — €25
-            Cherry, herbs, and bright acidity.
+2. Chianti Classico — Tuscany, Italy — €25
+Cherry, herbs, and bright acidity.
 
-            3. Rioja Crianza — Rioja, Spain — €20
-            Vanilla, red fruit, and soft tannins."""
+3. Rioja Crianza — Rioja, Spain — €20
+Vanilla, red fruit, and soft tannins."""
 
-    raise ValueError(f"Unsupported mock response profile: {profile_type}")
+    if profile_type == "retail_support":
+        if "wore them outside" in q or "used the shoes outside" in q:
+            return (
+                "The shoes are not eligible for return if they were used outdoors, "
+                "unless they are faulty."
+            )
+
+        if "where is my order ord-1002" in q:
+            return (
+                "Order ORD-1002 is currently in processing status. "
+                "It has not shipped yet."
+            )
+
+        if "recommend 3 waterproof jackets under 150 euros" in q:
+            return """1. NorthTrail RainShell — €129
+Lightweight waterproof shell for hiking in wet weather.
 
+2. NordLite PackShell — €99
+Packable waterproof jacket suitable for rain and light hiking.
+
+3. Harbor Softshell — €89
+Comfortable everyday jacket, but it is not fully waterproof."""
+
+        if "cancel my order ord-1005" in q:
+            return (
+                "Order ORD-1005 cannot be cancelled because it has already shipped."
+            )
+
+        if "refunds usually take" in q:
+            return (
+                "Approved refunds are usually processed within 5 to 7 business days "
+                "after the returned item is received and inspected."
+            )
+
+        if "order ord-1008 has been processing" in q:
+            return (
+                "Order ORD-1008 is still processing. Since it has been in processing "
+                "for several days, you should contact support for a manual check."
+            )
+
+        if "budget of 100 euros" in q:
+            return """1. NordLite PackShell — €99
+A waterproof and packable option that fits the budget.
+
+2. Harbor Softshell — €89
+Budget-friendly, but not fully waterproof."""
+
+        if "cancel order ord-1002" in q:
+            return "Order ORD-1002 has been cancelled successfully."
+
+        if "warranty do you offer on the treklite stove" in q:
+            return (
+                "The TrekLite Stove includes a 2-year limited warranty covering "
+                "manufacturing defects."
+            )
+
+        if "tent would you suggest for 2 people in windy weather" in q:
+            return (
+                "I suggest the WindRidge 2. It is a two-person tent designed for "
+                "windy conditions."
+            )
+
+        if "can order ord-1011 still be cancelled" in q:
+            return "Order ORD-1011 cannot be cancelled because it has already been delivered."
+
+        return "I’m sorry, but I could not determine the correct retail support response."
+
+    raise ValueError(f"Unsupported mock response profile: {profile_type}")
 
 # ----------------------------
 # Main
 # ----------------------------
-
 def main():
     parser = argparse.ArgumentParser(description="Eval Engine Runner")
-
     parser.add_argument(
         "--task-config",
         type=str,
@@ -257,22 +312,17 @@ def main():
         default=None,
         help="Optional override path to system-under-test configuration file",
     )
-
     parser.add_argument(
         "--judge-config",
         type=str,
         default=None,
         help="Optional override path to judge ensemble configuration file",
     )
-    parser.add_argument("--dataset", default="dataset.json")
-    parser.add_argument("--rubric", default="rubric.json")
-
     parser.add_argument(
         "--mode",
         choices=["mock", "openai"],
         default="mock",
     )
-
     parser.add_argument("--model", type=str, default=None)
     parser.add_argument("--temperature", type=float, default=None)
     parser.add_argument(
@@ -280,40 +330,46 @@ def main():
         action="store_true",
         help="Run LLM judge ensemble (OpenAI mode only).",
     )
-
     parser.add_argument("--limit", type=int, default=None)
-
     parser.add_argument(
         "--write-baseline",
         action="store_true",
         help="Write baseline to baselines/baseline_results.json",
     )
-
     args = parser.parse_args()
 
-    task_config = load_yaml(args.task_config)
+    task_bundle = load_task_bundle(args.task_config)
+    task_config = task_bundle["task_config"]
+    dataset = task_bundle["dataset"]
+    rubric = task_bundle["rubric"]
+    task_slug = get_task_slug(task_config, args.task_config)
+    paths = get_task_output_paths(task_slug)
+
     system_config_path = args.system_config or task_config["system_config"]
     judge_config_path = args.judge_config or task_config["judge_config"]
 
     system_config = load_yaml(system_config_path)
     judge_config = load_judge_config(judge_config_path)
     task_judge_config = get_task_judge_ensemble_config(task_config, judge_config)
-    dataset = load_json(task_config["dataset_path"])
-    rubric = get_task_rubric(task_config)
     thresholds = get_task_thresholds(task_config, rubric)
+
     if args.mode == "mock":
         validate_mock_mode_support(system_config)
 
     if args.limit:
         dataset = dataset[: args.limit]
 
-    ensure_dir("results")
-    ensure_dir("baselines")
+    ensure_dir(paths["results_dir"])
+    ensure_dir(paths["baselines_dir"])
 
-    run_metadata = build_run_metadata(args, task_config_path=args.task_config,
+    run_metadata = build_run_metadata(
+        args,
+        task_config_path=args.task_config,
         system_config_path=system_config_path,
-        judge_config_path=judge_config_path,)
+        judge_config_path=judge_config_path,
+    )
 
+    run_metadata["task_name"] = task_slug
     results: List[Dict[str, Any]] = []
     system_client = None
 
@@ -326,8 +382,8 @@ def main():
         runtime_system_config = dict(system_config)
         runtime_system_config["model"] = effective_model
         runtime_system_config["temperature"] = effective_temperature
-
         system_client = build_system_client(runtime_system_config)
+
     for case in dataset:
         query = case["query"]
 
@@ -370,30 +426,29 @@ def main():
         results.append(result_record)
 
     # Write outputs
-    latest_path = "results/latest_results.json"
+    latest_path = paths["latest_path"]
     write_json(latest_path, {"metadata": run_metadata, "results": results})
 
     timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-    run_path = f"results/run_{timestamp}.json"
+    run_path = f'{paths["run_path_prefix"]}{timestamp}.json'
     write_json(run_path, {"metadata": run_metadata, "results": results})
 
-    csv_path = "results/report.csv"
+    csv_path = paths["csv_path"]
     write_csv(csv_path, results)
 
-    print(f"\nSaved results:")
+    print("\nSaved results:")
     print(f" - {latest_path}")
     print(f" - {run_path}")
     print(f" - {csv_path}")
 
     # Baseline
     if args.write_baseline:
-        baseline_path = "baselines/baseline_results.json"
+        baseline_path = paths["baseline_path"]
         write_json(
             baseline_path,
             {"metadata": run_metadata, "results": results},
         )
         print(f"\nBaseline written to {baseline_path}")
 
-
 if __name__ == "__main__":
     main()
\ No newline at end of file
diff --git a/schemas.py b/schemas.py
new file mode 100644
index 0000000..0ce13c2
--- /dev/null
+++ b/schemas.py
@@ -0,0 +1,147 @@
+from __future__ import annotations
+
+from typing import Any, Dict, List
+
+
+SUPPORT_ANSWER_SCHEMA: Dict[str, Any] = {
+    "type": "object",
+    "required": ["answer", "needs_human", "citations"],
+    "properties": {
+        "answer": {"type": "string"},
+        "needs_human": {"type": "boolean"},
+        "citations": {
+            "type": "array",
+            "items": {"type": "string"},
+        },
+    },
+}
+
+RECOMMENDATION_ANSWER_SCHEMA: Dict[str, Any] = {
+    "type": "object",
+    "required": ["answer", "recommendations"],
+    "properties": {
+        "answer": {"type": "string"},
+        "recommendations": {
+            "type": "array",
+            "items": {
+                "type": "object",
+                "required": ["name"],
+                "properties": {
+                    "name": {"type": "string"},
+                    "price_eur": {"type": ["number", "null"]},
+                    "reason": {"type": ["string", "null"]},
+                },
+            },
+        },
+    },
+}
+
+ACTION_RESULT_SCHEMA: Dict[str, Any] = {
+    "type": "object",
+    "required": ["answer", "action_taken", "action_success"],
+    "properties": {
+        "answer": {"type": "string"},
+        "action_taken": {"type": "string"},
+        "action_success": {"type": "boolean"},
+    },
+}
+
+SCHEMA_REGISTRY: Dict[str, Dict[str, Any]] = {
+    "support_answer": SUPPORT_ANSWER_SCHEMA,
+    "recommendation_answer": RECOMMENDATION_ANSWER_SCHEMA,
+    "action_result": ACTION_RESULT_SCHEMA,
+}
+
+
+def get_schema(schema_name: str) -> Dict[str, Any] | None:
+    return SCHEMA_REGISTRY.get(schema_name)
+
+
+def validate_structured_output(payload: Dict[str, Any], schema: Dict[str, Any]) -> List[str]:
+    errors: List[str] = []
+
+    if schema.get("type") != "object":
+        return ["unsupported_schema_type"]
+
+    if not isinstance(payload, dict):
+        return ["payload_not_object"]
+
+    required = schema.get("required", [])
+    properties = schema.get("properties", {})
+
+    for field in required:
+        if field not in payload:
+            errors.append(f"missing_required_field:{field}")
+
+    for field, rules in properties.items():
+        if field not in payload:
+            continue
+
+        value = payload[field]
+        allowed_types = rules.get("type")
+
+        if allowed_types is not None:
+            if not _matches_type(value, allowed_types):
+                errors.append(f"invalid_type:{field}")
+
+        if allowed_types == "array" and "items" in rules:
+            if isinstance(value, list):
+                item_rules = rules["items"]
+                for i, item in enumerate(value):
+                    item_errors = _validate_item(item, item_rules)
+                    for err in item_errors:
+                        errors.append(f"{field}[{i}].{err}")
+
+    return errors
+
+
+def _validate_item(value: Any, rules: Dict[str, Any]) -> List[str]:
+    errors: List[str] = []
+    item_type = rules.get("type")
+
+    if item_type and not _matches_type(value, item_type):
+        errors.append("invalid_type")
+        return errors
+
+    if item_type == "object":
+        required = rules.get("required", [])
+        properties = rules.get("properties", {})
+
+        if not isinstance(value, dict):
+            errors.append("not_object")
+            return errors
+
+        for field in required:
+            if field not in value:
+                errors.append(f"missing_required_field:{field}")
+
+        for field, field_rules in properties.items():
+            if field not in value:
+                continue
+            if not _matches_type(value[field], field_rules.get("type")):
+                errors.append(f"invalid_type:{field}")
+
+    return errors
+
+
+def _matches_type(value: Any, expected_type: Any) -> bool:
+    if expected_type is None:
+        return True
+
+    if isinstance(expected_type, list):
+        return any(_matches_type(value, t) for t in expected_type)
+
+    if expected_type == "string":
+        return isinstance(value, str)
+    if expected_type == "boolean":
+        return isinstance(value, bool)
+    if expected_type == "number":
+        return isinstance(value, (int, float)) and not isinstance(value, bool)
+    if expected_type == "object":
+        return isinstance(value, dict)
+    if expected_type == "array":
+        return isinstance(value, list)
+    if expected_type == "null":
+        return value is None
+
+    return False
\ No newline at end of file
diff --git a/scorer.py b/scorer.py
index ef9b008..2bb8d1b 100644
--- a/scorer.py
+++ b/scorer.py
@@ -5,52 +5,71 @@
 from typing import Any, Dict, List
 
 
-DIMENSIONS = [
+WINE_DIMENSIONS = [
     "tasting_clarity",
     "popularity_alignment",
     "regional_diversity",
     "language_tone",
 ]
 
+RETAIL_DIMENSIONS = [
+    "instruction_match",
+    "grounding_accuracy",
+    "tool_use_correctness",
+    "resolution_helpfulness",
+    "tone_clarity",
+]
+
 
 # ----------------------------
 # Data structure
 # ----------------------------
-
 @dataclass
 class EvalResult:
     gate_pass: bool
     gate_reasons: List[str]
-
-    tasting_clarity: float
-    popularity_alignment: float
-    regional_diversity: float
-    language_tone: float
-
+    tasting_clarity: float | None
+    popularity_alignment: float | None
+    regional_diversity: float | None
+    language_tone: float | None
     weighted_score: float
     verdict: str
-
     notes: Dict[str, Any]
+    scores: Dict[str, float]
 
 
 # ----------------------------
-# Parsing helpers
+# Generic helpers
 # ----------------------------
+def infer_task_type(rubric: Dict[str, Any]) -> str:
+    weights = rubric.get("weights", {})
+    if "instruction_match" in weights:
+        return "retail_support"
+    return "wine_recommendation"
+
+
+def weighted_sum(scores: Dict[str, float], weights: Dict[str, float]) -> float:
+    return round(sum(scores.get(k, 0.0) * weights.get(k, 0.0) for k in scores), 2)
+
+
+def build_verdict(gate_pass: bool, weighted: float, thresholds: Dict[str, Any]) -> str:
+    if not gate_pass:
+        return "FAIL"
+    if weighted >= thresholds.get("pass", 4.0):
+        return "PASS"
+    if weighted >= thresholds.get("warn", 3.0):
+        return "WARN"
+    return "FAIL"
 
+
+# ----------------------------
+# Wine parsing helpers
+# ----------------------------
 def extract_items(text: str) -> List[str]:
-    """
-    Extract numbered list items from response.
-    Expected format:
-    1. Wine ...
-    2. Wine ...
-    """
     return re.findall(r"\d+\.\s(.+?)(?=\n\d+\.|\Z)", text, re.S)
 
 
 def extract_prices(items: List[str]) -> List[float]:
-    """
-    Extract euro prices from each item.
-    """
     prices = []
     for it in items:
         m = re.search(r"€\s*(\d+(?:\.\d+)?)", it)
@@ -60,9 +79,6 @@ def extract_prices(items: List[str]) -> List[float]:
 
 
 def distinct_regions(items: List[str]) -> int:
-    """
-    Estimate distinct regions using text between dashes.
-    """
     regions = set()
     for it in items:
         parts = re.split(r"—|-", it)
@@ -72,9 +88,6 @@ def distinct_regions(items: List[str]) -> int:
 
 
 def contains_white_signal(text: str) -> bool:
-    """
-    Detect obvious white wine indicators.
-    """
     return bool(
         re.search(
             r"\b(chardonnay|sauvignon blanc|riesling|pinot grigio)\b",
@@ -85,16 +98,13 @@ def contains_white_signal(text: str) -> bool:
 
 
 # ----------------------------
-# Scoring helpers
+# Wine scoring helpers
 # ----------------------------
-
 def score_tasting_clarity(text: str) -> float:
-    """
-    Score based on presence of tasting vocabulary.
-    """
     words = re.findall(r"\b\w+\b", text.lower())
     hits = sum(
-        w in {
+        w
+        in {
             "tannin",
             "tannins",
             "acidity",
@@ -126,9 +136,6 @@ def score_tasting_clarity(text: str) -> float:
 
 
 def score_popularity_alignment(text: str) -> float:
-    """
-    Proxy using presence of well-known regions.
-    """
     known = [
         "bordeaux",
         "burgundy",
@@ -151,9 +158,6 @@ def score_popularity_alignment(text: str) -> float:
 
 
 def score_regional_diversity(n_regions: int) -> float:
-    """
-    Score diversity based on distinct regions.
-    """
     if n_regions >= 3:
         return 5
     if n_regions == 2:
@@ -162,12 +166,9 @@ def score_regional_diversity(n_regions: int) -> float:
 
 
 def score_language_tone(text: str) -> float:
-    """
-    Rough proxy for tone quality based on sentence richness.
-    """
     sentences = re.split(r"[.!?]+", text)
     avg_len = sum(len(s.split()) for s in sentences if s.strip()) / max(
-        1, len(sentences)
+        1, len([s for s in sentences if s.strip()])
     )
     if avg_len >= 12:
         return 5
@@ -180,19 +181,7 @@ def score_language_tone(text: str) -> float:
     return 1
 
 
-# ----------------------------
-# Core evaluation
-# ----------------------------
-
-def evaluate_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResult:
-    """
-    Run deterministic evaluation:
-    - critical gates
-    - heuristic scoring
-    - weighted aggregation
-    - verdict assignment
-    """
-
+def evaluate_wine_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResult:
     items = extract_items(response)
     prices = extract_prices(items)
     n_regions = distinct_regions(items)
@@ -216,50 +205,285 @@ def evaluate_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResu
         gate_pass = False
         reasons.append("non_red_wine_detected")
 
-    t = score_tasting_clarity(response)
-    p = score_popularity_alignment(response)
-    r = score_regional_diversity(n_regions)
-    l = score_language_tone(response)
-
-    weighted = round(
-        t * weights.get("tasting_clarity", 0)
-        + p * weights.get("popularity_alignment", 0)
-        + r * weights.get("regional_diversity", 0)
-        + l * weights.get("language_tone", 0),
-        2,
-    )
+    scores = {
+        "tasting_clarity": score_tasting_clarity(response),
+        "popularity_alignment": score_popularity_alignment(response),
+        "regional_diversity": score_regional_diversity(n_regions),
+        "language_tone": score_language_tone(response),
+    }
 
-    if not gate_pass:
-        verdict = "FAIL"
-    elif weighted >= thresholds.get("pass", 3.5):
-        verdict = "PASS"
-    elif weighted >= thresholds.get("warn", 3.0):
-        verdict = "WARN"
-    else:
-        verdict = "FAIL"
+    weighted = weighted_sum(scores, weights)
+    verdict = build_verdict(gate_pass, weighted, thresholds)
 
     return EvalResult(
         gate_pass=gate_pass,
         gate_reasons=reasons,
-        tasting_clarity=t,
-        popularity_alignment=p,
-        regional_diversity=r,
-        language_tone=l,
+        tasting_clarity=scores["tasting_clarity"],
+        popularity_alignment=scores["popularity_alignment"],
+        regional_diversity=scores["regional_diversity"],
+        language_tone=scores["language_tone"],
         weighted_score=weighted,
         verdict=verdict,
         notes={
+            "task_type": "wine_recommendation",
             "item_count": len(items),
             "price_count": len(prices),
             "region_count": n_regions,
         },
+        scores=scores,
+    )
+
+
+# ----------------------------
+# Retail helpers
+# ----------------------------
+def normalize_text(text: str) -> str:
+    return re.sub(r"\s+", " ", text.strip().lower())
+
+
+def contains_any(text: str, phrases: List[str]) -> bool:
+    text_n = normalize_text(text)
+    return any(p.lower() in text_n for p in phrases)
+
+
+def count_hits(text: str, phrases: List[str]) -> int:
+    text_n = normalize_text(text)
+    return sum(1 for p in phrases if p.lower() in text_n)
+
+
+def extract_budget(query: str) -> float | None:
+    patterns = [
+        r"under\s+(\d+(?:\.\d+)?)\s*euros?",
+        r"budget of\s+(\d+(?:\.\d+)?)\s*euros?",
+        r"(\d+(?:\.\d+)?)\s*euros?",
+    ]
+    q = query.lower()
+    for pattern in patterns:
+        match = re.search(pattern, q)
+        if match:
+            return float(match.group(1))
+    return None
+
+
+def price_mentions(text: str) -> List[float]:
+    matches = re.findall(r"€\s*(\d+(?:\.\d+)?)|(\d+(?:\.\d+)?)\s*euros?", text.lower())
+    values = []
+    for m in matches:
+        raw = m[0] or m[1]
+        if raw:
+            values.append(float(raw))
+    return values
+
+
+def detect_recommendation_item_count(text: str) -> int:
+    numbered = extract_items(text)
+    if numbered:
+        return len(numbered)
+
+    lines = [line.strip() for line in text.splitlines() if line.strip()]
+    bullet_lines = [line for line in lines if re.match(r"^[-*•]", line)]
+    return len(bullet_lines)
+
+
+def score_instruction_match_retail(query: str, response: str) -> float:
+    q = query.lower()
+    score = 3.0
+
+    if "recommend" in q:
+        item_count = detect_recommendation_item_count(response)
+        if item_count >= 3:
+            score += 1.0
+        elif item_count == 0:
+            score -= 1.5
+
+    if "where is my order" in q or "cancel" in q or "order " in q:
+        order_match = re.search(r"\bORD-\d+\b", query, re.I)
+        if order_match and order_match.group(0).lower() in response.lower():
+            score += 1.0
+        elif order_match:
+            score -= 1.0
+
+    if "refund" in q and contains_any(response, ["refund", "business days"]):
+        score += 0.5
+
+    return max(1.0, min(5.0, score))
+
+
+def score_grounding_accuracy_retail(query: str, response: str) -> float:
+    q = query.lower()
+    score = 2.5
+
+    if "wore" in q and "outside" in q:
+        if contains_any(response, ["not eligible", "unless faulty", "used outdoors"]):
+            score += 2.0
+
+    if "refund" in q:
+        if contains_any(response, ["5 to 7 business days", "5-7 business days"]):
+            score += 2.0
+
+    if "warranty" in q and "treklite stove" in q:
+        if contains_any(response, ["2-year", "manufacturing defects", "limited warranty"]):
+            score += 2.0
+
+    if "cancel" in q:
+        if contains_any(response, ["cannot be cancelled", "can be cancelled", "before shipment", "after shipment"]):
+            score += 1.5
+
+    if "processing for several days" in q:
+        if contains_any(response, ["contact support", "manual check", "3 business days"]):
+            score += 1.5
+
+    return max(1.0, min(5.0, score))
+
+
+def score_tool_use_correctness_retail(query: str, response: str) -> float:
+    q = query.lower()
+
+    if "order " in q or "cancel" in q:
+        if contains_any(response, ["ord-", "status", "processing", "shipped", "cancelled", "cannot be cancelled"]):
+            return 4.0
+        return 2.0
+
+    if "recommend" in q or "suggest" in q:
+        if detect_recommendation_item_count(response) >= 1:
+            return 4.0
+        return 2.0
+
+    return 3.0
+
+
+def score_resolution_helpfulness_retail(query: str, response: str) -> float:
+    score = 2.5
+
+    helpful_terms = [
+        "you can",
+        "you cannot",
+        "i recommend",
+        "contact support",
+        "manual check",
+        "under",
+        "waterproof",
+        "refund",
+        "warranty",
+    ]
+    hits = count_hits(response, helpful_terms)
+
+    if hits >= 4:
+        score = 5.0
+    elif hits >= 3:
+        score = 4.0
+    elif hits >= 2:
+        score = 3.5
+    elif hits >= 1:
+        score = 3.0
+
+    return max(1.0, min(5.0, score))
+
+
+def score_tone_clarity_retail(text: str) -> float:
+    sentences = [s.strip() for s in re.split(r"[.!?]+", text) if s.strip()]
+    if not sentences:
+        return 1.0
+
+    avg_len = sum(len(s.split()) for s in sentences) / len(sentences)
+
+    if avg_len >= 18:
+        return 5.0
+    if avg_len >= 12:
+        return 4.0
+    if avg_len >= 8:
+        return 3.0
+    if avg_len >= 5:
+        return 2.0
+    return 1.0
+
+
+def evaluate_retail_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResult:
+    weights = rubric.get("weights", {})
+    thresholds = rubric.get("thresholds", {})
+    gate_names = rubric.get("critical_gates", [])
+
+    gate_pass = True
+    reasons: List[str] = []
+
+    q = query.lower()
+    r = response.lower()
+
+    if "no_false_action_claim" in gate_names:
+        if "cancel" in q and contains_any(r, ["cancelled successfully", "order cancelled"]) and contains_any(
+            q, ["ord-1005", "ord-1011"]
+        ):
+            gate_pass = False
+            reasons.append("false_action_claim")
+
+    if "respect_constraints" in gate_names:
+        budget = extract_budget(query)
+        mentioned_prices = price_mentions(response)
+        if budget is not None and any(price > budget for price in mentioned_prices):
+            gate_pass = False
+            reasons.append("constraint_violation")
+
+    if "no_policy_hallucination" in gate_names:
+        if "refund" in q and contains_any(r, ["same day", "24 hours", "instant refund"]):
+            gate_pass = False
+            reasons.append("policy_hallucination")
+
+    if "no_invented_order_status" in gate_names:
+        if "where is my order" in q or "order " in q:
+            if contains_any(r, ["delivered yesterday", "out for delivery"]) and contains_any(
+                q, ["ord-1002", "ord-1008"]
+            ):
+                gate_pass = False
+                reasons.append("invented_order_status")
+
+    if "valid_output_schema" in gate_names:
+        if not response.strip():
+            gate_pass = False
+            reasons.append("empty_response")
+
+    scores = {
+        "instruction_match": score_instruction_match_retail(query, response),
+        "grounding_accuracy": score_grounding_accuracy_retail(query, response),
+        "tool_use_correctness": score_tool_use_correctness_retail(query, response),
+        "resolution_helpfulness": score_resolution_helpfulness_retail(query, response),
+        "tone_clarity": score_tone_clarity_retail(response),
+    }
+
+    weighted = weighted_sum(scores, weights)
+    verdict = build_verdict(gate_pass, weighted, thresholds)
+
+    return EvalResult(
+        gate_pass=gate_pass,
+        gate_reasons=reasons,
+        tasting_clarity=None,
+        popularity_alignment=None,
+        regional_diversity=None,
+        language_tone=None,
+        weighted_score=weighted,
+        verdict=verdict,
+        notes={
+            "task_type": "retail_support",
+            "recommendation_item_count": detect_recommendation_item_count(response),
+            "mentioned_price_count": len(price_mentions(response)),
+        },
+        scores=scores,
     )
 
 
+# ----------------------------
+# Core evaluation
+# ----------------------------
+def evaluate_case(query: str, response: str, rubric: Dict[str, Any]) -> EvalResult:
+    task_type = infer_task_type(rubric)
+
+    if task_type == "retail_support":
+        return evaluate_retail_case(query, response, rubric)
+
+    return evaluate_wine_case(query, response, rubric)
+
+
 def evalresult_to_flat_dict(res: EvalResult) -> Dict[str, Any]:
-    """
-    Convert EvalResult to flat dict for JSON/CSV output.
-    """
-    return {
+    flat = {
         "gate_pass": res.gate_pass,
         "gate_reasons": ",".join(res.gate_reasons),
         "tasting_clarity": res.tasting_clarity,
@@ -269,4 +493,9 @@ def evalresult_to_flat_dict(res: EvalResult) -> Dict[str, Any]:
         "weighted_score": res.weighted_score,
         "verdict": res.verdict,
         **res.notes,
-    }
\ No newline at end of file
+    }
+
+    for key, value in res.scores.items():
+        flat[key] = value
+
+    return flat
\ No newline at end of file
diff --git a/task_loader.py b/task_loader.py
new file mode 100644
index 0000000..3c86545
--- /dev/null
+++ b/task_loader.py
@@ -0,0 +1,46 @@
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any, Dict
+
+import yaml
+
+
+def load_json(path: str | Path) -> Any:
+    with open(path, "r", encoding="utf-8") as f:
+        return json.load(f)
+
+
+def load_yaml(path: str | Path) -> Any:
+    with open(path, "r", encoding="utf-8") as f:
+        return yaml.safe_load(f)
+
+
+def _optional_json(path: str | Path | None) -> Any | None:
+    if not path:
+        return None
+
+    p = Path(path)
+    if not p.exists():
+        return None
+
+    return load_json(p)
+
+
+def load_task_bundle(task_config_path: str) -> Dict[str, Any]:
+    task_config = load_yaml(task_config_path)
+
+    bundle: Dict[str, Any] = {
+        "task_config_path": task_config_path,
+        "task_config": task_config,
+        "dataset": load_json(task_config["dataset_path"]),
+        "rubric": load_json(task_config["rubric_path"]),
+        "knowledge_base": _optional_json(task_config.get("knowledge_base_path")),
+        "catalog": _optional_json(task_config.get("catalog_path")),
+        "orders": _optional_json(task_config.get("orders_path")),
+        "tool_scenarios": _optional_json(task_config.get("tool_scenarios_path")),
+        "expected_outputs": _optional_json(task_config.get("expected_outputs_path")),
+    }
+
+    return bundle
\ No newline at end of file
diff --git a/tool_simulator.py b/tool_simulator.py
new file mode 100644
index 0000000..a301b94
--- /dev/null
+++ b/tool_simulator.py
@@ -0,0 +1,54 @@
+from __future__ import annotations
+
+from typing import Any, Dict, List
+
+
+class ToolSimulator:
+    def __init__(self, tool_scenarios: Dict[str, Any] | None = None):
+        self.tool_scenarios = tool_scenarios or {}
+
+    def run(self, tool_name: str, input_payload: Dict[str, Any]) -> Dict[str, Any]:
+        if tool_name not in self.tool_scenarios:
+            return {
+                "error": f"Unknown tool: {tool_name}",
+                "success": False,
+            }
+
+        tool_data = self.tool_scenarios[tool_name]
+
+        # Simple key-based lookup (order_id, etc.)
+        key = list(input_payload.values())[0] if input_payload else None
+
+        if key in tool_data:
+            return {
+                "success": True,
+                "data": tool_data[key],
+            }
+
+        return {
+            "success": False,
+            "error": f"No data found for input: {input_payload}",
+        }
+
+
+def simulate_tool_sequence(
+    simulator: ToolSimulator,
+    steps: List[Dict[str, Any]],
+) -> List[Dict[str, Any]]:
+    trace = []
+
+    for step in steps:
+        tool = step["tool"]
+        input_payload = step.get("input", {})
+
+        output = simulator.run(tool, input_payload)
+
+        trace.append(
+            {
+                "tool": tool,
+                "input": input_payload,
+                "output": output,
+            }
+        )
+
+    return trace
\ No newline at end of file