An async Python implementation of the Claude Code agent harness. ~1,890 lines, 14 commits, every subsystem readable in one sitting.
Anthropic's leaked source revealed that ~98.4% of Claude Code is harness, ~1.6% is model interaction. claude-code-mini is the smallest faithful reproduction of that ratio: agent loop, real filesystem tools, prompt/semantic/exact caching, multi-stage context compaction, allow/ask/deny permissions, Pre/PostToolUse hooks, slash commands, and subprocess subagents — all async, all in src/claude_code_mini/.
You can run it as a daily-driver REPL on your laptop. You can also git log your way through 14 commits and watch the architecture get built one feature at a time.
uv sync
cp .env.example .env # add an API key, or skip if using Ollama
uv run ccm --tools real --yoloclaude-code-mini (provider=openai, model=gpt-4o-mini, tools=real)
[1] > what's the biggest python file under src and why?
The biggest file is src/claude_code_mini/harness.py at 244 lines; it owns
the 7-step request cascade (cache → context → LLM → permission → hook →
tool → format → cache writeback) plus the microcompact integration.
[2] > /cost
calls=4 in=8316 out=184 cached_read=1024 cost=$0.00136
[3] > /exit
Walking an interviewer through your own ~1.9 KLOC reproduction of the Claude Code architecture is a level of credibility no leetcode loop produces. Three audiences:
- AI/ML infra engineers who want a working reference for prompt caching, context compaction, permission engines, and hook protocols — without reading 512K LOC of TypeScript.
- Learners who want to read agent-harness internals end-to-end. Each of the 14 commits is small enough to digest in 5 minutes; together they're the architectural narrative.
- Daily-driver users who want a hackable, async, single-file-ish Claude Code clone they can extend. Built-in providers: OpenAI, Anthropic, and Ollama for fully-local runs.
This is not trying to be production Claude Code. Scope tradeoffs are documented in the What's missing section.
| Subsystem | What it does | Source |
|---|---|---|
| Async harness | 7-step cascade: caches → context assembly → LLM → permissions → hooks → tool → format → cache writeback | harness.py |
| Three-tier cache | Exact match (SHA-256), semantic (cosine + threshold), Anthropic prompt cache (cache_control markers) — composable middleware with per-tier hit-rate metrics | caching.py |
| Real tools | Bash (with timeout + truncation), Read (line numbers + offset/limit), Glob, Grep (path:line:match) | tools/ |
| Context compaction | Stage 1 (Budget Reduce), Stage 2 (type-aware Snip), Stage 3 (LLM-summarize old chunks) | context.py |
| Permission engine | allow / ask / deny patterns (Bash(git:*), Read(./src/**)); ordered deny → allow → ask → default; settings.json |
permissions.py |
| Hook system | PreToolUse + PostToolUse; shell command over JSON stdin/stdout; pre can block, post can rewrite | hooks.py |
| Slash commands + REPL | /help /tools /cost /cache /clear /compact /save; markdown commands from .claude/commands/*.md |
slash.py, cli.py |
| Subagent isolation | asyncio.create_subprocess_exec; child runs its own harness, prints one-line JSON summary |
subagent.py |
| CLAUDE.md loader | Walk-up + project/user-global merge; cache-friendly priority-1 placement | claude_md.py |
| Token & cost accounting | tiktoken counters, LLMCallRecord, per-call CSV report, dollar costs for 5 models |
tokens.py |
| JSON ↔ TOON | Token-Oriented Object Notation for tabular tool outputs; ~40% token savings | formats.py |
git clone https://github.com/zyziyun/claude-code-mini
cd claude-code-mini
uv sync # installs deps + dev extras
cp .env.example .env # add OPENAI_API_KEY and/or ANTHROPIC_API_KEY
uv run pytest -q # 65/65 tests should passuv run ccm # demo tools (no filesystem access)
uv run ccm --tools real --yolo # Bash/Read/Glob/Grep, auto-allow
uv run ccm --provider anthropic # Claude Sonnet 4.6
uv run ccm --provider ollama --tools real # local model, no API key neededuv run python -m claude_code_mini.demo \
--provider openai --tools real --yolo \
--query "what is the highest-token file under src and why" \
--report runs/report.csvuv run python -m benchmarks.compare_formats # JSON vs TOON
uv run python -m benchmarks.cache_hitrate # 3-tier cache, 20 queries
uv run python -m benchmarks.microcompact # 50-turn context savings
uv run python -m benchmarks.subagent_isolation # parent-context isolation| Provider | Setup | Notes |
|---|---|---|
| OpenAI | OPENAI_API_KEY in .env |
Tool calling, auto-prompt-cache (≥1024 token prefix) |
| Anthropic | ANTHROPIC_API_KEY in .env |
Tool calling, explicit prompt cache with cache_control |
| Ollama | ollama serve + ollama pull qwen2.5-coder:7b |
OpenAI-compatible endpoint at http://localhost:11434/v1; no API key required |
Provider is one flag — the harness, tools, permissions, and hooks all run identically.
user query
│
┌────────────▼─────────────┐
(0) │ slash dispatcher │ /compact /clear /tools /cost /help
└────────────┬─────────────┘
┌────────────▼─────────────┐
(1) │ ExactMatchCache │── hit ──► return
└────────────┬─────────────┘
┌────────────▼─────────────┐
(2) │ SemanticCache (cos) │── hit ──► return
└────────────┬─────────────┘
┌────────────▼─────────────┐
(3) │ assemble_context │ Stages 1+2 (+3 microcompact)
└────────────┬─────────────┘
┌────────────▼─────────────┐
(4) │ async LLM call │ AsyncOpenAI / AsyncAnthropic / Ollama
│ + Anthropic cache_ctrl │
└────────────┬─────────────┘
┌────────────▼─────────────┐
(5) │ permission engine │── deny ──► block
│ PreToolUse hook │── block ──► skip
│ tool.execute() │ async
│ PostToolUse hook │── rewrite ──► replace output
└────────────┬─────────────┘
┌────────────▼─────────────┐
(6) │ format (json / toon) │
└────────────┬─────────────┘
┌────────────▼─────────────┐
(7) │ write back to caches │
└────────────┬─────────────┘
▼
final text
uv run pytest -q → 65/65 pass in ~1s. Every subsystem has unit coverage; live API calls are mocked.
Benchmarks ship deterministic offline runs (no API key needed) plus optional --live flags:
| Benchmark | Headline number |
|---|---|
| JSON vs TOON (5-case eval) | -44.1% tokens with TOON; wins on every case |
| 3-tier cache (20-query synthetic workload) | 45% cost saved; 6 exact + 3 semantic hits |
| Microcompact (50-turn session, 20K-token budget) | 66% average / 88% peak context savings vs Stage-1-only |
| Subagent isolation (4-turn mock task) | 98.9% parent-context isolation — 21-token summary vs 1,972 inline tokens |
Read the 14 commits in order and you've read the whole codebase:
| # | Commit | Adds |
|---|---|---|
| 1 | chore: initial async scaffold for agent harness |
tokens, formats, context, caching, llm, demo tools, harness, demo CLI |
| 2 | feat(tokens): write per-call CSV cost report |
--report PATH + LLMCallRecord writer |
| 3 | feat(benchmarks): add JSON vs TOON A/B with fixed eval set |
5-case eval + comparison script |
| 4 | refactor(caching): split into exact/semantic/prompt middleware with hit-rate metrics |
CacheLayer protocol, CacheStack, per-tier metrics |
| 5 | feat(context): add Stage 3 microcompact for long conversations |
LLM-based summarization gated by watermark |
| 6 | feat(subagent): isolate sub-task in subprocess via asyncio.create_subprocess_exec |
parent-context isolation |
| 7 | feat(tools): add Bash, Read, Glob, Grep with timeouts and output truncation |
real filesystem tools |
| 8 | feat(claude-md): hierarchical CLAUDE.md loader with user-global merge |
project context injection |
| 9 | feat(permissions): allow/ask/deny engine with settings.json patterns |
safety layer |
| 10 | feat(hooks): PreToolUse and PostToolUse over JSON stdin/stdout |
extensibility layer |
| 11 | feat(cli): slash commands and interactive REPL via 'ccm' |
UX layer |
| 12 | feat(llm): add Ollama provider via OpenAI-compatible endpoint |
local model support |
| 13 | chore: merge ollama provider support |
merge |
| 14 | feat(env): auto-load .env from project root via python-dotenv |
DX polish |
Each commit is independently buildable (git checkout <sha> && uv run pytest -q).
| Feature | claude-code-mini | Real Claude Code |
|---|---|---|
| Agent loop | Async ReAct, 1 main loop | nO main + 13 sub-loops |
| Tools | 7 (4 real: Bash/Read/Glob/Grep) | 15+ (Edit, MultiEdit, NotebookEdit, Task, TodoWrite, WebSearch, WebFetch, ...) |
| Permissions | allow / ask / deny + patterns | + 7 modes + ML classifier (Auto Mode) |
| Hooks | PreToolUse, PostToolUse | 27 event types + matcher system |
| Compaction | Stages 1 + 2 + 3 | All 5 (incl. Auto-Compact via forked subagent) |
| Streaming | none | SSE token-by-token |
| MCP client | none | Full host: stdio + HTTP + SSE transports |
| Plan mode | none | Independent read-only mode + plan artifact |
| Skills | none | YAML frontmatter + progressive disclosure |
| Sandbox | permission engine | macOS sandbox profile + path-traversal guard |
| Code size | ~1,890 LOC | ~512,000 LOC TypeScript |
Each is a focused 4–8 hour addition that maps to one row above:
hw11-edit-tool—EditTool+MultiEditToolwith diff preview and atomic apply.hw12-mcp-client— MCP host over stdio; registry integration so external servers register tools at startup.hw13-streaming— SSE streaming on the LLM call; surface partial output in the REPL.hw14-plan-mode— read-only mode withPlanartifact + transition logic.hw15-skills—SkillsLoaderwith YAML frontmatter + progressive disclosure breakpoints.
PRs welcome.
- VILA-Lab "Dive into Claude Code" (arxiv 2604.14228) — academic 5-stage compaction analysis.
- TOON format spec — open standard for tabular LLM outputs.
- Codewithmukesh, Anatomy of a Claude Code Session — turn-by-turn cost breakdown.
- Fareed Khan, Building Claude Code with Harness Engineering — ~250-line reproducible harness.
- ProjectDiscovery's caching writeup — 7%→84% prompt-cache hit-rate case study.
The benchmark numbers and "98.4% harness" framing come from these sources, verified against the claude-code-mini reproductions where applicable.
MIT (see LICENSE — to be added).