tax-eval-harness

A three-layer evaluation harness for running the tax-analyzer skill at scale, built on the Claude Agent SDK.

Inspired by the composable skill evaluation philosophy: skills are portable, governed behavioral units. The eval harness wraps them with deterministic guardrails, semantic quality checks, and agent-based cross-verification.

Setup

Prerequisites: Python 3.10+, Node.js 18+, uv

git clone https://github.com/singlarity-seekers/tax-eval-harness.git
cd tax-eval-harness

# Install dependencies
uv sync

# Set your API key (get one from console.anthropic.com)
export ANTHROPIC_API_KEY="sk-ant-..."

Quick Start

Verify the skill loads (no API key needed)

uv run tax-eval inspect

Expected output:

Skill name:    tax-analyzer
SKILL.md:      4929 chars
References:    ['document-parsing.md', 'tax-rules.md']
Scripts:       ['generate_tax_summary.py']

Run a single tax analysis

TAX_FOLDER=/path/to/2025_Tax OUTPUT_DIR=/path/to/output uv run tax-eval single

Run batch processing

BATCH_DIR=/path/to/clients OUTPUT_DIR=/path/to/output uv run tax-eval batch

Run with custom eval hooks

TAX_FOLDER=/path/to/2025_Tax OUTPUT_DIR=/path/to/output uv run tax-eval custom

How It Works

Skill Discovery

The tax-analyzer skill lives in .claude/skills/tax-analyzer/ and is discovered by name:

from tax_eval import TaxAnalyzerAgent

agent = TaxAnalyzerAgent(
    tax_folder="/path/to/2025_Tax",
    output_dir="/path/to/output",
    skill_name="tax-analyzer",        # discovers from .claude/skills/
)

The skill contains everything the agent needs: instructions (SKILL.md), domain knowledge (references/), and helper scripts (scripts/). At runtime, SkillLoader reads these and assembles them into a system prompt.

For portable distribution, skills can be packaged as .skill zip archives:

agent = TaxAnalyzerAgent(
    tax_folder="/path/to/2025_Tax",
    output_dir="/path/to/output",
    skill_file="tax-analyzer.skill",  # portable zip
)

Three-Layer Evaluation

Every agent run is wrapped with three layers of evaluation hooks:

Layer 1 — Deterministic (PreToolUse / PostToolUse): Hard security boundaries that never change. Blocks dangerous shell commands, prevents PII/SSN leaks in output files, restricts file writes to the output directory, and flags suspiciously large dollar amounts.

Layer 2 — Semantic (Stop): LLM-evaluated quality checks using Haiku. Assesses completeness (all income sources found?), methodology soundness (correct brackets, no double-counting ESPP/W-2 Box 14), and calculation quality (CPA-level review).

Layer 3 — Agent-Based (post-run): An independent verification agent that reads the Excel output, recomputes totals in Python, and produces a JSON verification report.

Composability

Custom hooks can be layered on top of the base three without modifying them:

from tax_eval import EvalReport, build_eval_hooks
from tax_eval.run_example import add_pre_tool_hook, add_stop_hook

report = EvalReport(case_id="my_case")
hooks_dict, agent_hooks = build_eval_hooks(report, "/output")

# Add your own checks
add_pre_tool_hook(hooks_dict, "Bash", my_security_hook)
add_stop_hook(hooks_dict, my_quality_hook)

Project Structure

tax-eval-harness/
├── .claude/skills/tax-analyzer/     # Skill (version-controlled)
│   ├── SKILL.md
│   ├── references/
│   └── scripts/
├── src/tax_eval/                    # Python package
│   ├── skill_loader.py              # SkillLoader — discover/unpack skills
│   ├── eval_hooks.py                # Three-layer eval hooks
│   ├── tax_agent.py                 # TaxAnalyzerAgent + BatchTaxAnalyzer
│   └── run_example.py               # CLI entry point
├── pyproject.toml                   # uv project config
├── CLAUDE.md                        # Project overview for AI assistants
└── AGENTS.md                        # Agent architecture details

Environment Variables

Variable	Required	Default	Description
`ANTHROPIC_API_KEY`	For agent runs	—	API key from console.anthropic.com
`SKILL_NAME`	No	`tax-analyzer`	Skill to discover from `.claude/skills/`
`TAX_FOLDER`	For agent runs	`./sample_tax_docs`	Path to tax documents
`OUTPUT_DIR`	For agent runs	`./output`	Path for output files
`BATCH_DIR`	For batch mode	`./clients`	Base directory containing client folders

Development

uv sync --extra dev          # Install dev dependencies
uv run pytest                # Run tests
uv run ruff check src/       # Lint
uv run ruff format src/      # Format

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude/skills/tax-analyzer		.claude/skills/tax-analyzer
src/tax_eval		src/tax_eval
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
DESIGN_NOTES.md		DESIGN_NOTES.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tax-eval-harness

Setup

Quick Start

Verify the skill loads (no API key needed)

Run a single tax analysis

Run batch processing

Run with custom eval hooks

How It Works

Skill Discovery

Three-Layer Evaluation

Composability

Project Structure

Environment Variables

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tax-eval-harness

Setup

Quick Start

Verify the skill loads (no API key needed)

Run a single tax analysis

Run batch processing

Run with custom eval hooks

How It Works

Skill Discovery

Three-Layer Evaluation

Composability

Project Structure

Environment Variables

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages