Coding Assistant Agent

A production-grade multi-agent AI system that takes a plain-English coding task, writes Python code, runs it in a sandbox, self-critiques the result, fixes errors automatically, and streams the answer back — with optional human-in-the-loop review at every step.

Built with LangGraph · Groq · Gemini · DeepSeek · OpenAI · Anthropic · FastAPI · Streamlit.

Why this — we already have Copilot / Cursor?

	Copilot / Cursor	This project
Suggests code	✅	✅
Actually runs it	❌	✅
Self-corrects on failure	❌	✅
Human approves before deploy	❌	✅ optional (HITL toggle — on or off)
Runs on private infra	❌	✅
Customizable agents	❌	✅
REST API for integration	❌	✅
Cost tracking per call	❌	✅

The gap is execution + self-correction + control. That's what enterprises pay for.

Banks, healthcare, and data teams can't send sensitive code or data to external tools. They build this architecture internally — planner → executor → critic → human review — on their own infrastructure. This project demonstrates exactly that pattern.

Live Demo — K-Means Clustering (full HITL flow)

Step 1 — Enter the task

Step 2 — Planner creates steps, Executor writes and runs the code

Step 3 — Full K-means implementation visible in the code panel

Step 4 — Plot generated automatically inside the sandboxed subprocess

Step 5 — HITL: human inspects code + plot before approving

Step 6 — Approve — agent completes, final answer streamed

Step 7 — Final cluster plot: 3 clusters, centroids marked X

Architecture

User input (plain English task)
          │
          ▼
┌─────────────────┐
│    Planner      │  1 LLM call → JSON array of 3-5 steps
└────────┬────────┘
         │ plan[]
         ▼
┌──────────────────────────────────────────────┐
│          Executor  — ReAct loop              │
│                                              │
│  ┌──────────┐  tool_call   ┌──────────────┐  │
│  │  Agent   │ ──────────►  │    Tools     │  │
│  │  (LLM)   │ ◄──────────  │              │  │
│  └──────────┘ tool_result  │ python_repl  │  │
│       │                    │ read_file    │  │
│  (loops until LLM          │ list_dir     │  │
│   calls no more tools)     └──────────────┘  │
│       ▼                                      │
│  final_answer + code_runs[]                  │
└──────────────────────────────────────────────┘
         │
         │  ◄── [HITL ON: graph pauses here]
         │       Human sees: code + output + plot
         │       Human clicks: Approve or Request revision
         │
         ▼
┌─────────────────┐
│     Critic      │  1 LLM call → APPROVE or REVISE + feedback
└────────┬────────┘
         │
         ├── REVISE ──► back to Executor  (max 3 attempts)
         │              (critic feedback injected as context)
         │              *** self-correction loop ***
         │
         └── APPROVE ──► Summarize → save to history → END
                         Final answer + plot rendered in UI

Planner — 1 LLM call. Receives the task + last 3 conversation history entries. Returns a JSON array of 3–5 concrete steps. No tools, pure reasoning. Falls back to a default plan if the LLM fails.

Executor — ReAct loop. The LLM decides which tool to call, reads the output, then either calls another tool or stops and writes the final answer. Each python_repl call runs in a fresh sandboxed subprocess — no state, imports, or variables carry over between calls. Loop ends when the LLM produces a response with no tool calls.

HITL pause (optional) — when Human-in-the-loop toggle is ON, the graph pauses here before the Critic runs. You see the full code, stdout output, and any generated plots. Click Approve to accept, or type feedback and click Request revision to send it back.

Critic — 1 LLM call. Scores the executor's output and returns APPROVE or REVISE with written feedback.

Self-correction loop — if REVISE, the critic's feedback is injected as context into the next Executor run. This repeats up to MAX_FIX_ATTEMPTS times (default: 3). After 3 failed attempts, the best result so far is accepted automatically — the agent never gets stuck forever.

Summarize — on APPROVE, a short summary of the task + answer is appended to history using operator.add. The next task's Planner receives this as context, enabling natural follow-up conversations.

State machine (LangGraph `StateGraph`)

CodingState:
  task          str           # user's original request
  plan          List[str]     # planner output
  code_runs     List[dict]    # each {code, output} pair from executor
  verdict       str           # APPROVE | REVISE
  critique      str           # critic's feedback
  fix_attempts  int           # how many REVISE loops so far
  final_answer  str           # executor's written explanation
  errors        List[str]     # any caught exceptions
  history       List[dict]    # operator.add — conversation memory across turns

Conversation memory

history uses operator.add reducer — each completed task appends a summary. The next task's Planner receives the last 3 entries as context. This enables natural follow-ups:

Task 1: "Implement binary search"
Task 2: "Now add unit tests for it"      ← agent remembers task 1
Task 3: "Make it work with strings too"  ← agent remembers both

How Each Component Works

Planner

One LLM call, no tools
Returns a strict JSON array of 3–5 steps
Receives last 3 conversation history entries as context
Failure fallback: returns a sensible default plan

Executor (ReAct loop)

LangGraph sub-graph: Agent node ↔ ToolNode
Agent decides which tool to call, reads output, loops until done
recursion_limit = 2 × (max_fix_attempts + 3) + 1 prevents infinite loops
Repetition guard: _trim_repetitive() caps answer at 3000 chars and stops at first repeated line
If critic sends REVISE feedback, it's injected as context into the next executor run

Critic (LLM-as-judge)

Receives: task + plan + all code runs + outputs
Returns: {"verdict": "APPROVE"|"REVISE", "feedback": "..."}
Failure fallback: auto-APPROVE if code ran with output, REVISE if no code ran
Special rule: plot/chart tasks approved when output contains a saved file path

HITL (Human-in-the-loop)

interrupt_before=["critic"] — graph suspends before critic runs
UI shows: code, stdout, generated plots
Human clicks Approve → injects verdict=APPROVE via update_state(as_node="critic")
Human types feedback + clicks Request revision → injects verdict=REVISE + feedback
Graph resumes from critic node with graph.invoke(None, config)

Sandbox (Code execution)

Each run: fresh subprocess + isolated tempfile working directory
python -I flag: ignores user site-packages and PYTHONSTARTUP
Env vars injected: MPLBACKEND=Agg, MPLCONFIGDIR, HOME, USERPROFILE, SANDBOX_OUTPUT_DIR
POSIX only: RLIMIT_CPU=20s, RLIMIT_AS=512MB via resource.setrlimit
Timeout: configurable via EXEC_TIMEOUT (default 15s) — process killed on overrun
Output capped at 10,000 chars to prevent memory issues
SHA-256 result cache (128-entry FIFO) — avoids re-running identical code
Plot rescue: copies *.png/jpg/svg from tmpdir to outputs/plots/ before cleanup
Temp directory deleted in finally block — no leftover files

Production-Ready Features

Security

Feature	Implementation
Process isolation	Each code run = separate subprocess, killed after timeout
Isolated filesystem	`tempfile.mkdtemp` — each run gets its own temp dir, deleted after
Isolated imports	`python -I` flag ignores user site-packages
Memory cap (POSIX)	`RLIMIT_AS = 512 MB` via `resource.setrlimit`
CPU cap (POSIX)	`RLIMIT_CPU = 20s` hard limit
Output cap	Stdout/stderr capped at 10,000 chars
API authentication	`API_KEY` header required on all API endpoints
Production key check	`ENVIRONMENT=production` with `API_KEY=dev-key` raises `ConfigError` at startup

Reliability

Feature	Implementation
Auto self-correction	Critic → Executor loop, up to `MAX_FIX_ATTEMPTS` (default 3)
Critic fallback	Auto-approves if code ran, auto-revises if no code ran — never crashes
Redis fallback	`_REDIS_UNAVAILABLE` sentinel — fails once, switches to in-memory, no retry storm
LLM retry + backoff	`max_retries=3` on all providers, Gemini has exponential backoff wrapper
Repetition guard	`_trim_repetitive()` — prevents wall-of-text from small models
`max_tokens=2048`	Hard cap on all providers — prevents runaway generation
Recursion limit	`2 × (max_fix_attempts + 3) + 1` — prevents infinite ReAct loops
Config validation	`get_settings()` raises `ConfigError` with clear message on missing/invalid env vars

Observability

Feature	Implementation
LangSmith tracing	Auto-wired via `os.environ` when `LANGCHAIN_API_KEY` set — traces every node
Token cost tracking	`CostTracker(BaseCallbackHandler)` on every LLM call — logs to `costs.jsonl`
Agent progress streaming	`stream_mode="values"` — UI updates after each node, not just at end
Session history	`InMemorySaver` checkpointer — full state replay per `thread_id`

Scalability

Feature	Implementation
REST API	FastAPI + Uvicorn — submit jobs, poll results, HITL review
Async job queue	Celery with Redis broker — decouples submission from execution
Paginated job listing	`GET /jobs?limit=20&offset=0` — handles large job histories
Thread-safe graph cache	`@lru_cache(maxsize=1)` — graph compiled once, reused across requests
Multi-provider LLM	Switch with one env var — no vendor lock-in

Quick Start

1. Clone and install

git clone https://github.com/YOUR_USERNAME/coding-agent.git
cd coding-agent
pip install -r requirements.txt

2. Configure

cp .env.example .env

Minimum required:

GROQ_API_KEY=your_key_here   # free at console.groq.com

3. Run

.\run.ps1                              # Streamlit UI  → http://localhost:8501
.\run.ps1 api                          # FastAPI       → http://localhost:8000/docs
.\run.ps1 test                         # pytest (52 tests)
.\run.ps1 cli "implement binary search"  # one-shot CLI

Configuration

# ── LLM Provider ──────────────────────────────────────────────────────────────
LLM_PROVIDER=groq          # groq | openai | anthropic | deepseek | gemini

# Groq — free at console.groq.com
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile

# Google Gemini — free at aistudio.google.com (1500 req/day)
# LLM_PROVIDER=gemini
# GEMINI_API_KEY=AIza...
# GEMINI_MODEL=gemini-2.0-flash

# OpenAI
# LLM_PROVIDER=openai
# OPENAI_API_KEY=sk-...
# OPENAI_MODEL=gpt-4o

# Anthropic
# LLM_PROVIDER=anthropic
# ANTHROPIC_API_KEY=sk-ant-...
# ANTHROPIC_MODEL=claude-opus-4-8

# DeepSeek — OpenAI-compatible API
# LLM_PROVIDER=deepseek
# DEEPSEEK_API_KEY=sk-...
# DEEPSEEK_MODEL=deepseek-chat

# ── Agent Behaviour ───────────────────────────────────────────────────────────
MAX_FIX_ATTEMPTS=3    # max critic→executor loops before accepting best attempt
EXEC_TIMEOUT=15       # sandbox subprocess wall-clock timeout (seconds)

# ── API ───────────────────────────────────────────────────────────────────────
ENVIRONMENT=development   # set to "production" to enforce strong API_KEY
API_KEY=dev-key
RATE_LIMIT=10/minute

# ── Redis (optional) ──────────────────────────────────────────────────────────
# REDIS_URL=redis://localhost:6379/0   # enables Celery queue + distributed rate limit

# ── LangSmith Tracing (optional) ─────────────────────────────────────────────
# LANGCHAIN_API_KEY=ls__...   # set this to enable full pipeline tracing

Project Structure

coding-agent/
├── app.py                    # Streamlit UI — streaming, HITL, plots, cost display
├── api.py                    # FastAPI — job CRUD, pagination, HITL review endpoint
├── cli.py                    # CLI — interactive + one-shot modes
├── run.ps1                   # One-click launcher
├── requirements.txt
├── .env.example
│
├── src/
│   ├── agents/
│   │   ├── planner.py        # LLM → JSON plan, history context
│   │   ├── executor.py       # ReAct loop, tool calls, _trim_repetitive()
│   │   └── critic.py         # LLM-as-judge: APPROVE/REVISE, fallback logic
│   ├── graph/
│   │   └── workflow.py       # LangGraph StateGraph, HITL wiring, resume_task()
│   ├── tools/
│   │   ├── sandbox.py        # Subprocess sandbox, cache, plot rescue, POSIX limits
│   │   ├── repl.py           # python_repl LangChain tool
│   │   └── file_tools.py     # read_file + list_dir tools
│   ├── llm.py                # 5-provider LLM factory, Gemini backoff wrapper
│   ├── config.py             # Settings dataclass, dotenv, production key check
│   ├── costs.py              # CostTracker callback, costs.jsonl log
│   ├── streaming.py          # Token streaming via astream_events v2
│   ├── job_store.py          # Redis + in-memory fallback, pagination
│   ├── ratelimit.py          # Configurable rate limiting middleware
│   ├── tasks.py              # Celery task definitions
│   ├── worker.py             # Celery worker entry point
│   └── prompts.py            # System prompts for all three agents
│
├── evals/
│   └── llm_judge.py          # LLM-as-judge scoring 0–3
│
├── tests/                    # 52 pytest tests
│
├── scripts/
│   ├── demo.py               # 5 master-prompt test cases
│   └── run_eval.py           # End-to-end eval runner
│
├── docs/
│   └── screenshots/          # Demo screenshots
│
└── outputs/
    ├── plots/                # Rescued matplotlib plots
    └── transcripts/          # Agent run transcripts

REST API

# Submit job
POST   /jobs
       {"task": "implement binary search"}

# List jobs (paginated)
GET    /jobs?limit=20&offset=0

# Get single job
GET    /jobs/{job_id}

# HITL review
POST   /jobs/{job_id}/review
       {"action": "approve"}
       {"action": "revise", "feedback": "add error handling"}

All endpoints require X-API-Key: your_key header. Interactive docs → http://localhost:8000/docs

Example Tasks

Task	Demonstrates
`Implement K-means from scratch, 150 points, 3 clusters, plot with centroids X`	ML + plot + HITL
`Generate 100 random numbers and plot a histogram`	Matplotlib pipeline
`Implement binary search and test on 20 numbers`	Algorithm + self-test
`Fix this: def divide(a,b): return a/b — print(divide(10,0))`	Bug fix loop
`Write a prime checker`	Basic task
(follow-up) `Now make it handle floats`	Conversation memory

Tech Stack

Layer	Technology	Purpose
Agent framework	LangGraph 0.2+	StateGraph, HITL, checkpointing, streaming
LLM	Groq / Gemini / OpenAI / Anthropic / DeepSeek	Pluggable via factory
UI	Streamlit	Streaming UI, HITL controls, plot rendering
API	FastAPI + Uvicorn	REST endpoints, auth, rate limiting
Job store	Redis + in-memory fallback	Distributed job tracking
Code execution	Python subprocess	Sandboxed, isolated, timeout-enforced
Resource limits	POSIX `resource.setrlimit`	CPU + memory cap on Linux/Mac
Plot rendering	Matplotlib Agg + PIL	Headless, RGBA fix, base64 HTML
Async queue	Celery + Redis	Decoupled task execution
Observability	LangSmith	Full pipeline tracing
Cost tracking	LangChain callbacks	Per-call token logging
Evals	LLM-as-judge (0–3)	Automated quality scoring
Tests	pytest	52 tests

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coding Assistant Agent

Why this — we already have Copilot / Cursor?

Live Demo — K-Means Clustering (full HITL flow)

Architecture

State machine (LangGraph `StateGraph`)

Conversation memory

How Each Component Works

Planner

Executor (ReAct loop)

Critic (LLM-as-judge)

HITL (Human-in-the-loop)

Sandbox (Code execution)

Production-Ready Features

Security

Reliability

Observability

Scalability

Quick Start

1. Clone and install

2. Configure

3. Run

Configuration

Project Structure

REST API

Example Tasks

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs/screenshots		docs/screenshots
evals		evals
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
api.py		api.py
app.py		app.py
cli.py		cli.py
conftest.py		conftest.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run.ps1		run.ps1

Folders and files

Latest commit

History

Repository files navigation

Coding Assistant Agent

Why this — we already have Copilot / Cursor?

Live Demo — K-Means Clustering (full HITL flow)

Architecture

State machine (LangGraph StateGraph)

Conversation memory

How Each Component Works

Planner

Executor (ReAct loop)

Critic (LLM-as-judge)

HITL (Human-in-the-loop)

Sandbox (Code execution)

Production-Ready Features

Security

Reliability

Observability

Scalability

Quick Start

1. Clone and install

2. Configure

3. Run

Configuration

Project Structure

REST API

Example Tasks

Tech Stack

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

State machine (LangGraph `StateGraph`)

Packages