claude-code-mini

An async Python implementation of the Claude Code agent harness. ~1,890 lines, 14 commits, every subsystem readable in one sitting.

Anthropic's leaked source revealed that ~98.4% of Claude Code is harness, ~1.6% is model interaction. claude-code-mini is the smallest faithful reproduction of that ratio: agent loop, real filesystem tools, prompt/semantic/exact caching, multi-stage context compaction, allow/ask/deny permissions, Pre/PostToolUse hooks, slash commands, and subprocess subagents — all async, all in src/claude_code_mini/.

You can run it as a daily-driver REPL on your laptop. You can also git log your way through 14 commits and watch the architecture get built one feature at a time.

uv sync
cp .env.example .env  # add an API key, or skip if using Ollama
uv run ccm --tools real --yolo

claude-code-mini (provider=openai, model=gpt-4o-mini, tools=real)
[1] > what's the biggest python file under src and why?
The biggest file is src/claude_code_mini/harness.py at 244 lines; it owns
the 7-step request cascade (cache → context → LLM → permission → hook →
tool → format → cache writeback) plus the microcompact integration.
[2] > /cost
calls=4  in=8316  out=184  cached_read=1024  cost=$0.00136
[3] > /exit

Why this exists

Walking an interviewer through your own ~1.9 KLOC reproduction of the Claude Code architecture is a level of credibility no leetcode loop produces. Three audiences:

AI/ML infra engineers who want a working reference for prompt caching, context compaction, permission engines, and hook protocols — without reading 512K LOC of TypeScript.
Learners who want to read agent-harness internals end-to-end. Each of the 14 commits is small enough to digest in 5 minutes; together they're the architectural narrative.
Daily-driver users who want a hackable, async, single-file-ish Claude Code clone they can extend. Built-in providers: OpenAI, Anthropic, and Ollama for fully-local runs.

This is not trying to be production Claude Code. Scope tradeoffs are documented in the What's missing section.

Features

Subsystem	What it does	Source
Async harness	7-step cascade: caches → context assembly → LLM → permissions → hooks → tool → format → cache writeback	`harness.py`
Three-tier cache	Exact match (SHA-256), semantic (cosine + threshold), Anthropic prompt cache (cache_control markers) — composable middleware with per-tier hit-rate metrics	`caching.py`
Real tools	Bash (with timeout + truncation), Read (line numbers + offset/limit), Glob, Grep (path:line:match)	`tools/`
Context compaction	Stage 1 (Budget Reduce), Stage 2 (type-aware Snip), Stage 3 (LLM-summarize old chunks)	`context.py`
Permission engine	allow / ask / deny patterns (`Bash(git:)`, `Read(./src/*)`); ordered `deny → allow → ask → default`; settings.json	`permissions.py`
Hook system	PreToolUse + PostToolUse; shell command over JSON stdin/stdout; pre can block, post can rewrite	`hooks.py`
Slash commands + REPL	`/help /tools /cost /cache /clear /compact /save`; markdown commands from `.claude/commands/*.md`	`slash.py`, `cli.py`
Subagent isolation	`asyncio.create_subprocess_exec`; child runs its own harness, prints one-line JSON summary	`subagent.py`
CLAUDE.md loader	Walk-up + project/user-global merge; cache-friendly priority-1 placement	`claude_md.py`
Token & cost accounting	tiktoken counters, `LLMCallRecord`, per-call CSV report, dollar costs for 5 models	`tokens.py`
JSON ↔ TOON	Token-Oriented Object Notation for tabular tool outputs; ~40% token savings	`formats.py`

Quick start

git clone https://github.com/zyziyun/claude-code-mini
cd claude-code-mini
uv sync                       # installs deps + dev extras
cp .env.example .env          # add OPENAI_API_KEY and/or ANTHROPIC_API_KEY
uv run pytest -q              # 65/65 tests should pass

Use it as a REPL

uv run ccm                                # demo tools (no filesystem access)
uv run ccm --tools real --yolo            # Bash/Read/Glob/Grep, auto-allow
uv run ccm --provider anthropic           # Claude Sonnet 4.6
uv run ccm --provider ollama --tools real # local model, no API key needed

Use it one-shot

uv run python -m claude_code_mini.demo \
    --provider openai --tools real --yolo \
    --query "what is the highest-token file under src and why" \
    --report runs/report.csv

Run the benchmarks

uv run python -m benchmarks.compare_formats     # JSON vs TOON
uv run python -m benchmarks.cache_hitrate       # 3-tier cache, 20 queries
uv run python -m benchmarks.microcompact        # 50-turn context savings
uv run python -m benchmarks.subagent_isolation  # parent-context isolation

Provider support

Provider	Setup	Notes
OpenAI	`OPENAI_API_KEY` in `.env`	Tool calling, auto-prompt-cache (≥1024 token prefix)
Anthropic	`ANTHROPIC_API_KEY` in `.env`	Tool calling, explicit prompt cache with `cache_control`
Ollama	`ollama serve` + `ollama pull qwen2.5-coder:7b`	OpenAI-compatible endpoint at `http://localhost:11434/v1`; no API key required

Provider is one flag — the harness, tools, permissions, and hooks all run identically.

Architecture

                 user query
                     │
        ┌────────────▼─────────────┐
   (0)  │   slash dispatcher        │   /compact /clear /tools /cost /help
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (1)  │   ExactMatchCache         │── hit ──► return
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (2)  │   SemanticCache (cos)     │── hit ──► return
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (3)  │   assemble_context        │   Stages 1+2 (+3 microcompact)
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (4)  │   async LLM call          │   AsyncOpenAI / AsyncAnthropic / Ollama
        │   + Anthropic cache_ctrl  │
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (5)  │   permission engine       │── deny ──► block
        │   PreToolUse hook         │── block ──► skip
        │   tool.execute()          │   async
        │   PostToolUse hook        │── rewrite ──► replace output
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (6)  │   format (json / toon)    │
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (7)  │   write back to caches    │
        └────────────┬─────────────┘
                     ▼
                 final text

Tests & benchmarks

uv run pytest -q → 65/65 pass in ~1s. Every subsystem has unit coverage; live API calls are mocked.

Benchmarks ship deterministic offline runs (no API key needed) plus optional --live flags:

Benchmark	Headline number
JSON vs TOON (5-case eval)	-44.1% tokens with TOON; wins on every case
3-tier cache (20-query synthetic workload)	45% cost saved; 6 exact + 3 semantic hits
Microcompact (50-turn session, 20K-token budget)	66% average / 88% peak context savings vs Stage-1-only
Subagent isolation (4-turn mock task)	98.9% parent-context isolation — 21-token summary vs 1,972 inline tokens

Commit history is the architecture

Read the 14 commits in order and you've read the whole codebase:

#	Commit	Adds
1	`chore: initial async scaffold for agent harness`	tokens, formats, context, caching, llm, demo tools, harness, demo CLI
2	`feat(tokens): write per-call CSV cost report`	`--report PATH` + `LLMCallRecord` writer
3	`feat(benchmarks): add JSON vs TOON A/B with fixed eval set`	5-case eval + comparison script
4	`refactor(caching): split into exact/semantic/prompt middleware with hit-rate metrics`	`CacheLayer` protocol, `CacheStack`, per-tier metrics
5	`feat(context): add Stage 3 microcompact for long conversations`	LLM-based summarization gated by watermark
6	`feat(subagent): isolate sub-task in subprocess via asyncio.create_subprocess_exec`	parent-context isolation
7	`feat(tools): add Bash, Read, Glob, Grep with timeouts and output truncation`	real filesystem tools
8	`feat(claude-md): hierarchical CLAUDE.md loader with user-global merge`	project context injection
9	`feat(permissions): allow/ask/deny engine with settings.json patterns`	safety layer
10	`feat(hooks): PreToolUse and PostToolUse over JSON stdin/stdout`	extensibility layer
11	`feat(cli): slash commands and interactive REPL via 'ccm'`	UX layer
12	`feat(llm): add Ollama provider via OpenAI-compatible endpoint`	local model support
13	`chore: merge ollama provider support`	merge
14	`feat(env): auto-load .env from project root via python-dotenv`	DX polish

Each commit is independently buildable (git checkout <sha> && uv run pytest -q).

What's missing vs. real Claude Code

Feature	claude-code-mini	Real Claude Code
Agent loop	Async ReAct, 1 main loop	nO main + 13 sub-loops
Tools	7 (4 real: Bash/Read/Glob/Grep)	15+ (Edit, MultiEdit, NotebookEdit, Task, TodoWrite, WebSearch, WebFetch, ...)
Permissions	allow / ask / deny + patterns	+ 7 modes + ML classifier (Auto Mode)
Hooks	PreToolUse, PostToolUse	27 event types + matcher system
Compaction	Stages 1 + 2 + 3	All 5 (incl. Auto-Compact via forked subagent)
Streaming	none	SSE token-by-token
MCP client	none	Full host: stdio + HTTP + SSE transports
Plan mode	none	Independent read-only mode + plan artifact
Skills	none	YAML frontmatter + progressive disclosure
Sandbox	permission engine	macOS sandbox profile + path-traversal guard
Code size	~1,890 LOC	~512,000 LOC TypeScript

Roadmap — `hw11..hw15` branches

Each is a focused 4–8 hour addition that maps to one row above:

hw11-edit-tool — EditTool + MultiEditTool with diff preview and atomic apply.
hw12-mcp-client — MCP host over stdio; registry integration so external servers register tools at startup.
hw13-streaming — SSE streaming on the LLM call; surface partial output in the REPL.
hw14-plan-mode — read-only mode with Plan artifact + transition logic.
hw15-skills — SkillsLoader with YAML frontmatter + progressive disclosure breakpoints.

PRs welcome.

Credits & inspiration

VILA-Lab "Dive into Claude Code" (arxiv 2604.14228) — academic 5-stage compaction analysis.
TOON format spec — open standard for tabular LLM outputs.
Codewithmukesh, Anatomy of a Claude Code Session — turn-by-turn cost breakdown.
Fareed Khan, Building Claude Code with Harness Engineering — ~250-line reproducible harness.
ProjectDiscovery's caching writeup — 7%→84% prompt-cache hit-rate case study.

The benchmark numbers and "98.4% harness" framing come from these sources, verified against the claude-code-mini reproductions where applicable.

License

MIT (see LICENSE — to be added).

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
benchmarks		benchmarks
src/claude_code_mini		src/claude_code_mini
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

claude-code-mini

Why this exists

Features

Quick start

Use it as a REPL

Use it one-shot

Run the benchmarks

Provider support

Architecture

Tests & benchmarks

Commit history is the architecture

What's missing vs. real Claude Code

Roadmap — `hw11..hw15` branches

Credits & inspiration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

claude-code-mini

Why this exists

Features

Quick start

Use it as a REPL

Use it one-shot

Run the benchmarks

Provider support

Architecture

Tests & benchmarks

Commit history is the architecture

What's missing vs. real Claude Code

Roadmap — hw11..hw15 branches

Credits & inspiration

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Roadmap — `hw11..hw15` branches

Packages