A production-grade multi-agent AI system that takes a plain-English coding task, writes Python code, runs it in a sandbox, self-critiques the result, fixes errors automatically, and streams the answer back — with optional human-in-the-loop review at every step.
Built with LangGraph · Groq · Gemini · DeepSeek · OpenAI · Anthropic · FastAPI · Streamlit.
| Copilot / Cursor | This project | |
|---|---|---|
| Suggests code | ✅ | ✅ |
| Actually runs it | ❌ | ✅ |
| Self-corrects on failure | ❌ | ✅ |
| Human approves before deploy | ❌ | ✅ optional (HITL toggle — on or off) |
| Runs on private infra | ❌ | ✅ |
| Customizable agents | ❌ | ✅ |
| REST API for integration | ❌ | ✅ |
| Cost tracking per call | ❌ | ✅ |
The gap is execution + self-correction + control. That's what enterprises pay for.
Banks, healthcare, and data teams can't send sensitive code or data to external tools. They build this architecture internally — planner → executor → critic → human review — on their own infrastructure. This project demonstrates exactly that pattern.
Step 1 — Enter the task
Step 2 — Planner creates steps, Executor writes and runs the code
Step 3 — Full K-means implementation visible in the code panel
Step 4 — Plot generated automatically inside the sandboxed subprocess
Step 5 — HITL: human inspects code + plot before approving
Step 6 — Approve — agent completes, final answer streamed
Step 7 — Final cluster plot: 3 clusters, centroids marked X
User input (plain English task)
│
▼
┌─────────────────┐
│ Planner │ 1 LLM call → JSON array of 3-5 steps
└────────┬────────┘
│ plan[]
▼
┌──────────────────────────────────────────────┐
│ Executor — ReAct loop │
│ │
│ ┌──────────┐ tool_call ┌──────────────┐ │
│ │ Agent │ ──────────► │ Tools │ │
│ │ (LLM) │ ◄────────── │ │ │
│ └──────────┘ tool_result │ python_repl │ │
│ │ │ read_file │ │
│ (loops until LLM │ list_dir │ │
│ calls no more tools) └──────────────┘ │
│ ▼ │
│ final_answer + code_runs[] │
└──────────────────────────────────────────────┘
│
│ ◄── [HITL ON: graph pauses here]
│ Human sees: code + output + plot
│ Human clicks: Approve or Request revision
│
▼
┌─────────────────┐
│ Critic │ 1 LLM call → APPROVE or REVISE + feedback
└────────┬────────┘
│
├── REVISE ──► back to Executor (max 3 attempts)
│ (critic feedback injected as context)
│ *** self-correction loop ***
│
└── APPROVE ──► Summarize → save to history → END
Final answer + plot rendered in UI
Planner — 1 LLM call. Receives the task + last 3 conversation history entries. Returns a JSON array of 3–5 concrete steps. No tools, pure reasoning. Falls back to a default plan if the LLM fails.
Executor — ReAct loop. The LLM decides which tool to call, reads the output, then either calls another tool or stops and writes the final answer. Each python_repl call runs in a fresh sandboxed subprocess — no state, imports, or variables carry over between calls. Loop ends when the LLM produces a response with no tool calls.
HITL pause (optional) — when Human-in-the-loop toggle is ON, the graph pauses here before the Critic runs. You see the full code, stdout output, and any generated plots. Click Approve to accept, or type feedback and click Request revision to send it back.
Critic — 1 LLM call. Scores the executor's output and returns APPROVE or REVISE with written feedback.
Self-correction loop — if REVISE, the critic's feedback is injected as context into the next Executor run. This repeats up to MAX_FIX_ATTEMPTS times (default: 3). After 3 failed attempts, the best result so far is accepted automatically — the agent never gets stuck forever.
Summarize — on APPROVE, a short summary of the task + answer is appended to history using operator.add. The next task's Planner receives this as context, enabling natural follow-up conversations.
CodingState:
task str # user's original request
plan List[str] # planner output
code_runs List[dict] # each {code, output} pair from executor
verdict str # APPROVE | REVISE
critique str # critic's feedback
fix_attempts int # how many REVISE loops so far
final_answer str # executor's written explanation
errors List[str] # any caught exceptions
history List[dict] # operator.add — conversation memory across turnshistory uses operator.add reducer — each completed task appends a summary. The next task's Planner receives the last 3 entries as context. This enables natural follow-ups:
Task 1: "Implement binary search"
Task 2: "Now add unit tests for it" ← agent remembers task 1
Task 3: "Make it work with strings too" ← agent remembers both
- One LLM call, no tools
- Returns a strict JSON array of 3–5 steps
- Receives last 3 conversation history entries as context
- Failure fallback: returns a sensible default plan
- LangGraph sub-graph:
Agent node ↔ ToolNode - Agent decides which tool to call, reads output, loops until done
recursion_limit = 2 × (max_fix_attempts + 3) + 1prevents infinite loops- Repetition guard:
_trim_repetitive()caps answer at 3000 chars and stops at first repeated line - If critic sends REVISE feedback, it's injected as context into the next executor run
- Receives: task + plan + all code runs + outputs
- Returns:
{"verdict": "APPROVE"|"REVISE", "feedback": "..."} - Failure fallback: auto-APPROVE if code ran with output, REVISE if no code ran
- Special rule: plot/chart tasks approved when output contains a saved file path
interrupt_before=["critic"]— graph suspends before critic runs- UI shows: code, stdout, generated plots
- Human clicks Approve → injects
verdict=APPROVEviaupdate_state(as_node="critic") - Human types feedback + clicks Request revision → injects
verdict=REVISE+ feedback - Graph resumes from critic node with
graph.invoke(None, config)
- Each run: fresh
subprocess+ isolatedtempfileworking directory python -Iflag: ignores user site-packages andPYTHONSTARTUP- Env vars injected:
MPLBACKEND=Agg,MPLCONFIGDIR,HOME,USERPROFILE,SANDBOX_OUTPUT_DIR - POSIX only:
RLIMIT_CPU=20s,RLIMIT_AS=512MBviaresource.setrlimit - Timeout: configurable via
EXEC_TIMEOUT(default 15s) — process killed on overrun - Output capped at 10,000 chars to prevent memory issues
- SHA-256 result cache (128-entry FIFO) — avoids re-running identical code
- Plot rescue: copies
*.png/jpg/svgfrom tmpdir tooutputs/plots/before cleanup - Temp directory deleted in
finallyblock — no leftover files
| Feature | Implementation |
|---|---|
| Process isolation | Each code run = separate subprocess, killed after timeout |
| Isolated filesystem | tempfile.mkdtemp — each run gets its own temp dir, deleted after |
| Isolated imports | python -I flag ignores user site-packages |
| Memory cap (POSIX) | RLIMIT_AS = 512 MB via resource.setrlimit |
| CPU cap (POSIX) | RLIMIT_CPU = 20s hard limit |
| Output cap | Stdout/stderr capped at 10,000 chars |
| API authentication | API_KEY header required on all API endpoints |
| Production key check | ENVIRONMENT=production with API_KEY=dev-key raises ConfigError at startup |
| Feature | Implementation |
|---|---|
| Auto self-correction | Critic → Executor loop, up to MAX_FIX_ATTEMPTS (default 3) |
| Critic fallback | Auto-approves if code ran, auto-revises if no code ran — never crashes |
| Redis fallback | _REDIS_UNAVAILABLE sentinel — fails once, switches to in-memory, no retry storm |
| LLM retry + backoff | max_retries=3 on all providers, Gemini has exponential backoff wrapper |
| Repetition guard | _trim_repetitive() — prevents wall-of-text from small models |
max_tokens=2048 |
Hard cap on all providers — prevents runaway generation |
| Recursion limit | 2 × (max_fix_attempts + 3) + 1 — prevents infinite ReAct loops |
| Config validation | get_settings() raises ConfigError with clear message on missing/invalid env vars |
| Feature | Implementation |
|---|---|
| LangSmith tracing | Auto-wired via os.environ when LANGCHAIN_API_KEY set — traces every node |
| Token cost tracking | CostTracker(BaseCallbackHandler) on every LLM call — logs to costs.jsonl |
| Agent progress streaming | stream_mode="values" — UI updates after each node, not just at end |
| Session history | InMemorySaver checkpointer — full state replay per thread_id |
| Feature | Implementation |
|---|---|
| REST API | FastAPI + Uvicorn — submit jobs, poll results, HITL review |
| Async job queue | Celery with Redis broker — decouples submission from execution |
| Paginated job listing | GET /jobs?limit=20&offset=0 — handles large job histories |
| Thread-safe graph cache | @lru_cache(maxsize=1) — graph compiled once, reused across requests |
| Multi-provider LLM | Switch with one env var — no vendor lock-in |
git clone https://github.com/YOUR_USERNAME/coding-agent.git
cd coding-agent
pip install -r requirements.txtcp .env.example .envMinimum required:
GROQ_API_KEY=your_key_here # free at console.groq.com.\run.ps1 # Streamlit UI → http://localhost:8501
.\run.ps1 api # FastAPI → http://localhost:8000/docs
.\run.ps1 test # pytest (52 tests)
.\run.ps1 cli "implement binary search" # one-shot CLI# ── LLM Provider ──────────────────────────────────────────────────────────────
LLM_PROVIDER=groq # groq | openai | anthropic | deepseek | gemini
# Groq — free at console.groq.com
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile
# Google Gemini — free at aistudio.google.com (1500 req/day)
# LLM_PROVIDER=gemini
# GEMINI_API_KEY=AIza...
# GEMINI_MODEL=gemini-2.0-flash
# OpenAI
# LLM_PROVIDER=openai
# OPENAI_API_KEY=sk-...
# OPENAI_MODEL=gpt-4o
# Anthropic
# LLM_PROVIDER=anthropic
# ANTHROPIC_API_KEY=sk-ant-...
# ANTHROPIC_MODEL=claude-opus-4-8
# DeepSeek — OpenAI-compatible API
# LLM_PROVIDER=deepseek
# DEEPSEEK_API_KEY=sk-...
# DEEPSEEK_MODEL=deepseek-chat
# ── Agent Behaviour ───────────────────────────────────────────────────────────
MAX_FIX_ATTEMPTS=3 # max critic→executor loops before accepting best attempt
EXEC_TIMEOUT=15 # sandbox subprocess wall-clock timeout (seconds)
# ── API ───────────────────────────────────────────────────────────────────────
ENVIRONMENT=development # set to "production" to enforce strong API_KEY
API_KEY=dev-key
RATE_LIMIT=10/minute
# ── Redis (optional) ──────────────────────────────────────────────────────────
# REDIS_URL=redis://localhost:6379/0 # enables Celery queue + distributed rate limit
# ── LangSmith Tracing (optional) ─────────────────────────────────────────────
# LANGCHAIN_API_KEY=ls__... # set this to enable full pipeline tracingcoding-agent/
├── app.py # Streamlit UI — streaming, HITL, plots, cost display
├── api.py # FastAPI — job CRUD, pagination, HITL review endpoint
├── cli.py # CLI — interactive + one-shot modes
├── run.ps1 # One-click launcher
├── requirements.txt
├── .env.example
│
├── src/
│ ├── agents/
│ │ ├── planner.py # LLM → JSON plan, history context
│ │ ├── executor.py # ReAct loop, tool calls, _trim_repetitive()
│ │ └── critic.py # LLM-as-judge: APPROVE/REVISE, fallback logic
│ ├── graph/
│ │ └── workflow.py # LangGraph StateGraph, HITL wiring, resume_task()
│ ├── tools/
│ │ ├── sandbox.py # Subprocess sandbox, cache, plot rescue, POSIX limits
│ │ ├── repl.py # python_repl LangChain tool
│ │ └── file_tools.py # read_file + list_dir tools
│ ├── llm.py # 5-provider LLM factory, Gemini backoff wrapper
│ ├── config.py # Settings dataclass, dotenv, production key check
│ ├── costs.py # CostTracker callback, costs.jsonl log
│ ├── streaming.py # Token streaming via astream_events v2
│ ├── job_store.py # Redis + in-memory fallback, pagination
│ ├── ratelimit.py # Configurable rate limiting middleware
│ ├── tasks.py # Celery task definitions
│ ├── worker.py # Celery worker entry point
│ └── prompts.py # System prompts for all three agents
│
├── evals/
│ └── llm_judge.py # LLM-as-judge scoring 0–3
│
├── tests/ # 52 pytest tests
│
├── scripts/
│ ├── demo.py # 5 master-prompt test cases
│ └── run_eval.py # End-to-end eval runner
│
├── docs/
│ └── screenshots/ # Demo screenshots
│
└── outputs/
├── plots/ # Rescued matplotlib plots
└── transcripts/ # Agent run transcripts
# Submit job
POST /jobs
{"task": "implement binary search"}
# List jobs (paginated)
GET /jobs?limit=20&offset=0
# Get single job
GET /jobs/{job_id}
# HITL review
POST /jobs/{job_id}/review
{"action": "approve"}
{"action": "revise", "feedback": "add error handling"}All endpoints require X-API-Key: your_key header.
Interactive docs → http://localhost:8000/docs
| Task | Demonstrates |
|---|---|
Implement K-means from scratch, 150 points, 3 clusters, plot with centroids X |
ML + plot + HITL |
Generate 100 random numbers and plot a histogram |
Matplotlib pipeline |
Implement binary search and test on 20 numbers |
Algorithm + self-test |
Fix this: def divide(a,b): return a/b — print(divide(10,0)) |
Bug fix loop |
Write a prime checker |
Basic task |
(follow-up) Now make it handle floats |
Conversation memory |
| Layer | Technology | Purpose |
|---|---|---|
| Agent framework | LangGraph 0.2+ | StateGraph, HITL, checkpointing, streaming |
| LLM | Groq / Gemini / OpenAI / Anthropic / DeepSeek | Pluggable via factory |
| UI | Streamlit | Streaming UI, HITL controls, plot rendering |
| API | FastAPI + Uvicorn | REST endpoints, auth, rate limiting |
| Job store | Redis + in-memory fallback | Distributed job tracking |
| Code execution | Python subprocess | Sandboxed, isolated, timeout-enforced |
| Resource limits | POSIX resource.setrlimit |
CPU + memory cap on Linux/Mac |
| Plot rendering | Matplotlib Agg + PIL | Headless, RGBA fix, base64 HTML |
| Async queue | Celery + Redis | Decoupled task execution |
| Observability | LangSmith | Full pipeline tracing |
| Cost tracking | LangChain callbacks | Per-call token logging |
| Evals | LLM-as-judge (0–3) | Automated quality scoring |
| Tests | pytest | 52 tests |
MIT






