A three-layer evaluation harness for running the tax-analyzer skill at scale, built on the Claude Agent SDK.
Inspired by the composable skill evaluation philosophy: skills are portable, governed behavioral units. The eval harness wraps them with deterministic guardrails, semantic quality checks, and agent-based cross-verification.
Prerequisites: Python 3.10+, Node.js 18+, uv
git clone https://github.com/singlarity-seekers/tax-eval-harness.git
cd tax-eval-harness
# Install dependencies
uv sync
# Set your API key (get one from console.anthropic.com)
export ANTHROPIC_API_KEY="sk-ant-..."uv run tax-eval inspectExpected output:
Skill name: tax-analyzer
SKILL.md: 4929 chars
References: ['document-parsing.md', 'tax-rules.md']
Scripts: ['generate_tax_summary.py']
TAX_FOLDER=/path/to/2025_Tax OUTPUT_DIR=/path/to/output uv run tax-eval singleBATCH_DIR=/path/to/clients OUTPUT_DIR=/path/to/output uv run tax-eval batchTAX_FOLDER=/path/to/2025_Tax OUTPUT_DIR=/path/to/output uv run tax-eval customThe tax-analyzer skill lives in .claude/skills/tax-analyzer/ and is discovered by name:
from tax_eval import TaxAnalyzerAgent
agent = TaxAnalyzerAgent(
tax_folder="/path/to/2025_Tax",
output_dir="/path/to/output",
skill_name="tax-analyzer", # discovers from .claude/skills/
)The skill contains everything the agent needs: instructions (SKILL.md), domain knowledge (references/), and helper scripts (scripts/). At runtime, SkillLoader reads these and assembles them into a system prompt.
For portable distribution, skills can be packaged as .skill zip archives:
agent = TaxAnalyzerAgent(
tax_folder="/path/to/2025_Tax",
output_dir="/path/to/output",
skill_file="tax-analyzer.skill", # portable zip
)Every agent run is wrapped with three layers of evaluation hooks:
Layer 1 — Deterministic (PreToolUse / PostToolUse): Hard security boundaries that never change. Blocks dangerous shell commands, prevents PII/SSN leaks in output files, restricts file writes to the output directory, and flags suspiciously large dollar amounts.
Layer 2 — Semantic (Stop): LLM-evaluated quality checks using Haiku. Assesses completeness (all income sources found?), methodology soundness (correct brackets, no double-counting ESPP/W-2 Box 14), and calculation quality (CPA-level review).
Layer 3 — Agent-Based (post-run): An independent verification agent that reads the Excel output, recomputes totals in Python, and produces a JSON verification report.
Custom hooks can be layered on top of the base three without modifying them:
from tax_eval import EvalReport, build_eval_hooks
from tax_eval.run_example import add_pre_tool_hook, add_stop_hook
report = EvalReport(case_id="my_case")
hooks_dict, agent_hooks = build_eval_hooks(report, "/output")
# Add your own checks
add_pre_tool_hook(hooks_dict, "Bash", my_security_hook)
add_stop_hook(hooks_dict, my_quality_hook)tax-eval-harness/
├── .claude/skills/tax-analyzer/ # Skill (version-controlled)
│ ├── SKILL.md
│ ├── references/
│ └── scripts/
├── src/tax_eval/ # Python package
│ ├── skill_loader.py # SkillLoader — discover/unpack skills
│ ├── eval_hooks.py # Three-layer eval hooks
│ ├── tax_agent.py # TaxAnalyzerAgent + BatchTaxAnalyzer
│ └── run_example.py # CLI entry point
├── pyproject.toml # uv project config
├── CLAUDE.md # Project overview for AI assistants
└── AGENTS.md # Agent architecture details
| Variable | Required | Default | Description |
|---|---|---|---|
ANTHROPIC_API_KEY |
For agent runs | — | API key from console.anthropic.com |
SKILL_NAME |
No | tax-analyzer |
Skill to discover from .claude/skills/ |
TAX_FOLDER |
For agent runs | ./sample_tax_docs |
Path to tax documents |
OUTPUT_DIR |
For agent runs | ./output |
Path for output files |
BATCH_DIR |
For batch mode | ./clients |
Base directory containing client folders |
uv sync --extra dev # Install dev dependencies
uv run pytest # Run tests
uv run ruff check src/ # Lint
uv run ruff format src/ # FormatMIT