Skip to content

zyziyun/claude-code-mini

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

claude-code-mini

An async Python implementation of the Claude Code agent harness. ~1,890 lines, 14 commits, every subsystem readable in one sitting.

Python 3.11+ License: MIT Tests

Anthropic's leaked source revealed that ~98.4% of Claude Code is harness, ~1.6% is model interaction. claude-code-mini is the smallest faithful reproduction of that ratio: agent loop, real filesystem tools, prompt/semantic/exact caching, multi-stage context compaction, allow/ask/deny permissions, Pre/PostToolUse hooks, slash commands, and subprocess subagents — all async, all in src/claude_code_mini/.

You can run it as a daily-driver REPL on your laptop. You can also git log your way through 14 commits and watch the architecture get built one feature at a time.

uv sync
cp .env.example .env  # add an API key, or skip if using Ollama
uv run ccm --tools real --yolo
claude-code-mini (provider=openai, model=gpt-4o-mini, tools=real)
[1] > what's the biggest python file under src and why?
The biggest file is src/claude_code_mini/harness.py at 244 lines; it owns
the 7-step request cascade (cache → context → LLM → permission → hook →
tool → format → cache writeback) plus the microcompact integration.
[2] > /cost
calls=4  in=8316  out=184  cached_read=1024  cost=$0.00136
[3] > /exit

Why this exists

Walking an interviewer through your own ~1.9 KLOC reproduction of the Claude Code architecture is a level of credibility no leetcode loop produces. Three audiences:

  • AI/ML infra engineers who want a working reference for prompt caching, context compaction, permission engines, and hook protocols — without reading 512K LOC of TypeScript.
  • Learners who want to read agent-harness internals end-to-end. Each of the 14 commits is small enough to digest in 5 minutes; together they're the architectural narrative.
  • Daily-driver users who want a hackable, async, single-file-ish Claude Code clone they can extend. Built-in providers: OpenAI, Anthropic, and Ollama for fully-local runs.

This is not trying to be production Claude Code. Scope tradeoffs are documented in the What's missing section.


Features

Subsystem What it does Source
Async harness 7-step cascade: caches → context assembly → LLM → permissions → hooks → tool → format → cache writeback harness.py
Three-tier cache Exact match (SHA-256), semantic (cosine + threshold), Anthropic prompt cache (cache_control markers) — composable middleware with per-tier hit-rate metrics caching.py
Real tools Bash (with timeout + truncation), Read (line numbers + offset/limit), Glob, Grep (path:line:match) tools/
Context compaction Stage 1 (Budget Reduce), Stage 2 (type-aware Snip), Stage 3 (LLM-summarize old chunks) context.py
Permission engine allow / ask / deny patterns (Bash(git:*), Read(./src/**)); ordered deny → allow → ask → default; settings.json permissions.py
Hook system PreToolUse + PostToolUse; shell command over JSON stdin/stdout; pre can block, post can rewrite hooks.py
Slash commands + REPL /help /tools /cost /cache /clear /compact /save; markdown commands from .claude/commands/*.md slash.py, cli.py
Subagent isolation asyncio.create_subprocess_exec; child runs its own harness, prints one-line JSON summary subagent.py
CLAUDE.md loader Walk-up + project/user-global merge; cache-friendly priority-1 placement claude_md.py
Token & cost accounting tiktoken counters, LLMCallRecord, per-call CSV report, dollar costs for 5 models tokens.py
JSON ↔ TOON Token-Oriented Object Notation for tabular tool outputs; ~40% token savings formats.py

Quick start

git clone https://github.com/zyziyun/claude-code-mini
cd claude-code-mini
uv sync                       # installs deps + dev extras
cp .env.example .env          # add OPENAI_API_KEY and/or ANTHROPIC_API_KEY
uv run pytest -q              # 65/65 tests should pass

Use it as a REPL

uv run ccm                                # demo tools (no filesystem access)
uv run ccm --tools real --yolo            # Bash/Read/Glob/Grep, auto-allow
uv run ccm --provider anthropic           # Claude Sonnet 4.6
uv run ccm --provider ollama --tools real # local model, no API key needed

Use it one-shot

uv run python -m claude_code_mini.demo \
    --provider openai --tools real --yolo \
    --query "what is the highest-token file under src and why" \
    --report runs/report.csv

Run the benchmarks

uv run python -m benchmarks.compare_formats     # JSON vs TOON
uv run python -m benchmarks.cache_hitrate       # 3-tier cache, 20 queries
uv run python -m benchmarks.microcompact        # 50-turn context savings
uv run python -m benchmarks.subagent_isolation  # parent-context isolation

Provider support

Provider Setup Notes
OpenAI OPENAI_API_KEY in .env Tool calling, auto-prompt-cache (≥1024 token prefix)
Anthropic ANTHROPIC_API_KEY in .env Tool calling, explicit prompt cache with cache_control
Ollama ollama serve + ollama pull qwen2.5-coder:7b OpenAI-compatible endpoint at http://localhost:11434/v1; no API key required

Provider is one flag — the harness, tools, permissions, and hooks all run identically.


Architecture

                 user query
                     │
        ┌────────────▼─────────────┐
   (0)  │   slash dispatcher        │   /compact /clear /tools /cost /help
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (1)  │   ExactMatchCache         │── hit ──► return
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (2)  │   SemanticCache (cos)     │── hit ──► return
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (3)  │   assemble_context        │   Stages 1+2 (+3 microcompact)
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (4)  │   async LLM call          │   AsyncOpenAI / AsyncAnthropic / Ollama
        │   + Anthropic cache_ctrl  │
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (5)  │   permission engine       │── deny ──► block
        │   PreToolUse hook         │── block ──► skip
        │   tool.execute()          │   async
        │   PostToolUse hook        │── rewrite ──► replace output
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (6)  │   format (json / toon)    │
        └────────────┬─────────────┘
        ┌────────────▼─────────────┐
   (7)  │   write back to caches    │
        └────────────┬─────────────┘
                     ▼
                 final text

Tests & benchmarks

uv run pytest -q65/65 pass in ~1s. Every subsystem has unit coverage; live API calls are mocked.

Benchmarks ship deterministic offline runs (no API key needed) plus optional --live flags:

Benchmark Headline number
JSON vs TOON (5-case eval) -44.1% tokens with TOON; wins on every case
3-tier cache (20-query synthetic workload) 45% cost saved; 6 exact + 3 semantic hits
Microcompact (50-turn session, 20K-token budget) 66% average / 88% peak context savings vs Stage-1-only
Subagent isolation (4-turn mock task) 98.9% parent-context isolation — 21-token summary vs 1,972 inline tokens

Commit history is the architecture

Read the 14 commits in order and you've read the whole codebase:

# Commit Adds
1 chore: initial async scaffold for agent harness tokens, formats, context, caching, llm, demo tools, harness, demo CLI
2 feat(tokens): write per-call CSV cost report --report PATH + LLMCallRecord writer
3 feat(benchmarks): add JSON vs TOON A/B with fixed eval set 5-case eval + comparison script
4 refactor(caching): split into exact/semantic/prompt middleware with hit-rate metrics CacheLayer protocol, CacheStack, per-tier metrics
5 feat(context): add Stage 3 microcompact for long conversations LLM-based summarization gated by watermark
6 feat(subagent): isolate sub-task in subprocess via asyncio.create_subprocess_exec parent-context isolation
7 feat(tools): add Bash, Read, Glob, Grep with timeouts and output truncation real filesystem tools
8 feat(claude-md): hierarchical CLAUDE.md loader with user-global merge project context injection
9 feat(permissions): allow/ask/deny engine with settings.json patterns safety layer
10 feat(hooks): PreToolUse and PostToolUse over JSON stdin/stdout extensibility layer
11 feat(cli): slash commands and interactive REPL via 'ccm' UX layer
12 feat(llm): add Ollama provider via OpenAI-compatible endpoint local model support
13 chore: merge ollama provider support merge
14 feat(env): auto-load .env from project root via python-dotenv DX polish

Each commit is independently buildable (git checkout <sha> && uv run pytest -q).


What's missing vs. real Claude Code

Feature claude-code-mini Real Claude Code
Agent loop Async ReAct, 1 main loop nO main + 13 sub-loops
Tools 7 (4 real: Bash/Read/Glob/Grep) 15+ (Edit, MultiEdit, NotebookEdit, Task, TodoWrite, WebSearch, WebFetch, ...)
Permissions allow / ask / deny + patterns + 7 modes + ML classifier (Auto Mode)
Hooks PreToolUse, PostToolUse 27 event types + matcher system
Compaction Stages 1 + 2 + 3 All 5 (incl. Auto-Compact via forked subagent)
Streaming none SSE token-by-token
MCP client none Full host: stdio + HTTP + SSE transports
Plan mode none Independent read-only mode + plan artifact
Skills none YAML frontmatter + progressive disclosure
Sandbox permission engine macOS sandbox profile + path-traversal guard
Code size ~1,890 LOC ~512,000 LOC TypeScript

Roadmap — hw11..hw15 branches

Each is a focused 4–8 hour addition that maps to one row above:

  • hw11-edit-toolEditTool + MultiEditTool with diff preview and atomic apply.
  • hw12-mcp-client — MCP host over stdio; registry integration so external servers register tools at startup.
  • hw13-streaming — SSE streaming on the LLM call; surface partial output in the REPL.
  • hw14-plan-mode — read-only mode with Plan artifact + transition logic.
  • hw15-skillsSkillsLoader with YAML frontmatter + progressive disclosure breakpoints.

PRs welcome.


Credits & inspiration

  • VILA-Lab "Dive into Claude Code" (arxiv 2604.14228) — academic 5-stage compaction analysis.
  • TOON format spec — open standard for tabular LLM outputs.
  • Codewithmukesh, Anatomy of a Claude Code Session — turn-by-turn cost breakdown.
  • Fareed Khan, Building Claude Code with Harness Engineering — ~250-line reproducible harness.
  • ProjectDiscovery's caching writeup — 7%→84% prompt-cache hit-rate case study.

The benchmark numbers and "98.4% harness" framing come from these sources, verified against the claude-code-mini reproductions where applicable.


License

MIT (see LICENSE — to be added).

About

Async Python implementation of the Claude Code agent harness — real Bash/Read/Glob/Grep tools, three-tier caching, multi-stage compaction, allow/ask/deny permissions, Pre/PostToolUse hooks, slash commands, subagent isolation. OpenAI + Anthropic + Ollama. ~1.9K LOC, 14 commits.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages