Skip to content

singlarity-seekers/tax-eval-harness

Repository files navigation

tax-eval-harness

A three-layer evaluation harness for running the tax-analyzer skill at scale, built on the Claude Agent SDK.

Inspired by the composable skill evaluation philosophy: skills are portable, governed behavioral units. The eval harness wraps them with deterministic guardrails, semantic quality checks, and agent-based cross-verification.

Setup

Prerequisites: Python 3.10+, Node.js 18+, uv

git clone https://github.com/singlarity-seekers/tax-eval-harness.git
cd tax-eval-harness

# Install dependencies
uv sync

# Set your API key (get one from console.anthropic.com)
export ANTHROPIC_API_KEY="sk-ant-..."

Quick Start

Verify the skill loads (no API key needed)

uv run tax-eval inspect

Expected output:

Skill name:    tax-analyzer
SKILL.md:      4929 chars
References:    ['document-parsing.md', 'tax-rules.md']
Scripts:       ['generate_tax_summary.py']

Run a single tax analysis

TAX_FOLDER=/path/to/2025_Tax OUTPUT_DIR=/path/to/output uv run tax-eval single

Run batch processing

BATCH_DIR=/path/to/clients OUTPUT_DIR=/path/to/output uv run tax-eval batch

Run with custom eval hooks

TAX_FOLDER=/path/to/2025_Tax OUTPUT_DIR=/path/to/output uv run tax-eval custom

How It Works

Skill Discovery

The tax-analyzer skill lives in .claude/skills/tax-analyzer/ and is discovered by name:

from tax_eval import TaxAnalyzerAgent

agent = TaxAnalyzerAgent(
    tax_folder="/path/to/2025_Tax",
    output_dir="/path/to/output",
    skill_name="tax-analyzer",        # discovers from .claude/skills/
)

The skill contains everything the agent needs: instructions (SKILL.md), domain knowledge (references/), and helper scripts (scripts/). At runtime, SkillLoader reads these and assembles them into a system prompt.

For portable distribution, skills can be packaged as .skill zip archives:

agent = TaxAnalyzerAgent(
    tax_folder="/path/to/2025_Tax",
    output_dir="/path/to/output",
    skill_file="tax-analyzer.skill",  # portable zip
)

Three-Layer Evaluation

Every agent run is wrapped with three layers of evaluation hooks:

Layer 1 — Deterministic (PreToolUse / PostToolUse): Hard security boundaries that never change. Blocks dangerous shell commands, prevents PII/SSN leaks in output files, restricts file writes to the output directory, and flags suspiciously large dollar amounts.

Layer 2 — Semantic (Stop): LLM-evaluated quality checks using Haiku. Assesses completeness (all income sources found?), methodology soundness (correct brackets, no double-counting ESPP/W-2 Box 14), and calculation quality (CPA-level review).

Layer 3 — Agent-Based (post-run): An independent verification agent that reads the Excel output, recomputes totals in Python, and produces a JSON verification report.

Composability

Custom hooks can be layered on top of the base three without modifying them:

from tax_eval import EvalReport, build_eval_hooks
from tax_eval.run_example import add_pre_tool_hook, add_stop_hook

report = EvalReport(case_id="my_case")
hooks_dict, agent_hooks = build_eval_hooks(report, "/output")

# Add your own checks
add_pre_tool_hook(hooks_dict, "Bash", my_security_hook)
add_stop_hook(hooks_dict, my_quality_hook)

Project Structure

tax-eval-harness/
├── .claude/skills/tax-analyzer/     # Skill (version-controlled)
│   ├── SKILL.md
│   ├── references/
│   └── scripts/
├── src/tax_eval/                    # Python package
│   ├── skill_loader.py              # SkillLoader — discover/unpack skills
│   ├── eval_hooks.py                # Three-layer eval hooks
│   ├── tax_agent.py                 # TaxAnalyzerAgent + BatchTaxAnalyzer
│   └── run_example.py               # CLI entry point
├── pyproject.toml                   # uv project config
├── CLAUDE.md                        # Project overview for AI assistants
└── AGENTS.md                        # Agent architecture details

Environment Variables

Variable Required Default Description
ANTHROPIC_API_KEY For agent runs API key from console.anthropic.com
SKILL_NAME No tax-analyzer Skill to discover from .claude/skills/
TAX_FOLDER For agent runs ./sample_tax_docs Path to tax documents
OUTPUT_DIR For agent runs ./output Path for output files
BATCH_DIR For batch mode ./clients Base directory containing client folders

Development

uv sync --extra dev          # Install dev dependencies
uv run pytest                # Run tests
uv run ruff check src/       # Lint
uv run ruff format src/      # Format

License

MIT

About

A three-layer evaluation harness for running the `tax-analyzer` skill at scale, built on the Claude Agent SDK

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages