Skip to content

dharavathramdas101/coding-agent

Repository files navigation

Coding Assistant Agent

Python LangGraph Tests Providers License

A production-grade multi-agent AI system that takes a plain-English coding task, writes Python code, runs it in a sandbox, self-critiques the result, fixes errors automatically, and streams the answer back — with optional human-in-the-loop review at every step.

Built with LangGraph · Groq · Gemini · DeepSeek · OpenAI · Anthropic · FastAPI · Streamlit.


Why this — we already have Copilot / Cursor?

Copilot / Cursor This project
Suggests code
Actually runs it
Self-corrects on failure
Human approves before deploy ✅ optional (HITL toggle — on or off)
Runs on private infra
Customizable agents
REST API for integration
Cost tracking per call

The gap is execution + self-correction + control. That's what enterprises pay for.

Banks, healthcare, and data teams can't send sensitive code or data to external tools. They build this architecture internally — planner → executor → critic → human review — on their own infrastructure. This project demonstrates exactly that pattern.


Live Demo — K-Means Clustering (full HITL flow)

Step 1 — Enter the task

UI Overview


Step 2 — Planner creates steps, Executor writes and runs the code

Executor output


Step 3 — Full K-means implementation visible in the code panel

K-means code


Step 4 — Plot generated automatically inside the sandboxed subprocess

Plot generated


Step 5 — HITL: human inspects code + plot before approving

HITL approve panel


Step 6 — Approve — agent completes, final answer streamed

Approved


Step 7 — Final cluster plot: 3 clusters, centroids marked X

Final plot


Architecture

User input (plain English task)
          │
          ▼
┌─────────────────┐
│    Planner      │  1 LLM call → JSON array of 3-5 steps
└────────┬────────┘
         │ plan[]
         ▼
┌──────────────────────────────────────────────┐
│          Executor  — ReAct loop              │
│                                              │
│  ┌──────────┐  tool_call   ┌──────────────┐  │
│  │  Agent   │ ──────────►  │    Tools     │  │
│  │  (LLM)   │ ◄──────────  │              │  │
│  └──────────┘ tool_result  │ python_repl  │  │
│       │                    │ read_file    │  │
│  (loops until LLM          │ list_dir     │  │
│   calls no more tools)     └──────────────┘  │
│       ▼                                      │
│  final_answer + code_runs[]                  │
└──────────────────────────────────────────────┘
         │
         │  ◄── [HITL ON: graph pauses here]
         │       Human sees: code + output + plot
         │       Human clicks: Approve or Request revision
         │
         ▼
┌─────────────────┐
│     Critic      │  1 LLM call → APPROVE or REVISE + feedback
└────────┬────────┘
         │
         ├── REVISE ──► back to Executor  (max 3 attempts)
         │              (critic feedback injected as context)
         │              *** self-correction loop ***
         │
         └── APPROVE ──► Summarize → save to history → END
                         Final answer + plot rendered in UI

Planner — 1 LLM call. Receives the task + last 3 conversation history entries. Returns a JSON array of 3–5 concrete steps. No tools, pure reasoning. Falls back to a default plan if the LLM fails.

Executor — ReAct loop. The LLM decides which tool to call, reads the output, then either calls another tool or stops and writes the final answer. Each python_repl call runs in a fresh sandboxed subprocess — no state, imports, or variables carry over between calls. Loop ends when the LLM produces a response with no tool calls.

HITL pause (optional) — when Human-in-the-loop toggle is ON, the graph pauses here before the Critic runs. You see the full code, stdout output, and any generated plots. Click Approve to accept, or type feedback and click Request revision to send it back.

Critic — 1 LLM call. Scores the executor's output and returns APPROVE or REVISE with written feedback.

Self-correction loop — if REVISE, the critic's feedback is injected as context into the next Executor run. This repeats up to MAX_FIX_ATTEMPTS times (default: 3). After 3 failed attempts, the best result so far is accepted automatically — the agent never gets stuck forever.

Summarize — on APPROVE, a short summary of the task + answer is appended to history using operator.add. The next task's Planner receives this as context, enabling natural follow-up conversations.


State machine (LangGraph StateGraph)

CodingState:
  task          str           # user's original request
  plan          List[str]     # planner output
  code_runs     List[dict]    # each {code, output} pair from executor
  verdict       str           # APPROVE | REVISE
  critique      str           # critic's feedback
  fix_attempts  int           # how many REVISE loops so far
  final_answer  str           # executor's written explanation
  errors        List[str]     # any caught exceptions
  history       List[dict]    # operator.add — conversation memory across turns

Conversation memory

history uses operator.add reducer — each completed task appends a summary. The next task's Planner receives the last 3 entries as context. This enables natural follow-ups:

Task 1: "Implement binary search"
Task 2: "Now add unit tests for it"      ← agent remembers task 1
Task 3: "Make it work with strings too"  ← agent remembers both

How Each Component Works

Planner

  • One LLM call, no tools
  • Returns a strict JSON array of 3–5 steps
  • Receives last 3 conversation history entries as context
  • Failure fallback: returns a sensible default plan

Executor (ReAct loop)

  • LangGraph sub-graph: Agent node ↔ ToolNode
  • Agent decides which tool to call, reads output, loops until done
  • recursion_limit = 2 × (max_fix_attempts + 3) + 1 prevents infinite loops
  • Repetition guard: _trim_repetitive() caps answer at 3000 chars and stops at first repeated line
  • If critic sends REVISE feedback, it's injected as context into the next executor run

Critic (LLM-as-judge)

  • Receives: task + plan + all code runs + outputs
  • Returns: {"verdict": "APPROVE"|"REVISE", "feedback": "..."}
  • Failure fallback: auto-APPROVE if code ran with output, REVISE if no code ran
  • Special rule: plot/chart tasks approved when output contains a saved file path

HITL (Human-in-the-loop)

  • interrupt_before=["critic"] — graph suspends before critic runs
  • UI shows: code, stdout, generated plots
  • Human clicks Approve → injects verdict=APPROVE via update_state(as_node="critic")
  • Human types feedback + clicks Request revision → injects verdict=REVISE + feedback
  • Graph resumes from critic node with graph.invoke(None, config)

Sandbox (Code execution)

  • Each run: fresh subprocess + isolated tempfile working directory
  • python -I flag: ignores user site-packages and PYTHONSTARTUP
  • Env vars injected: MPLBACKEND=Agg, MPLCONFIGDIR, HOME, USERPROFILE, SANDBOX_OUTPUT_DIR
  • POSIX only: RLIMIT_CPU=20s, RLIMIT_AS=512MB via resource.setrlimit
  • Timeout: configurable via EXEC_TIMEOUT (default 15s) — process killed on overrun
  • Output capped at 10,000 chars to prevent memory issues
  • SHA-256 result cache (128-entry FIFO) — avoids re-running identical code
  • Plot rescue: copies *.png/jpg/svg from tmpdir to outputs/plots/ before cleanup
  • Temp directory deleted in finally block — no leftover files

Production-Ready Features

Security

Feature Implementation
Process isolation Each code run = separate subprocess, killed after timeout
Isolated filesystem tempfile.mkdtemp — each run gets its own temp dir, deleted after
Isolated imports python -I flag ignores user site-packages
Memory cap (POSIX) RLIMIT_AS = 512 MB via resource.setrlimit
CPU cap (POSIX) RLIMIT_CPU = 20s hard limit
Output cap Stdout/stderr capped at 10,000 chars
API authentication API_KEY header required on all API endpoints
Production key check ENVIRONMENT=production with API_KEY=dev-key raises ConfigError at startup

Reliability

Feature Implementation
Auto self-correction Critic → Executor loop, up to MAX_FIX_ATTEMPTS (default 3)
Critic fallback Auto-approves if code ran, auto-revises if no code ran — never crashes
Redis fallback _REDIS_UNAVAILABLE sentinel — fails once, switches to in-memory, no retry storm
LLM retry + backoff max_retries=3 on all providers, Gemini has exponential backoff wrapper
Repetition guard _trim_repetitive() — prevents wall-of-text from small models
max_tokens=2048 Hard cap on all providers — prevents runaway generation
Recursion limit 2 × (max_fix_attempts + 3) + 1 — prevents infinite ReAct loops
Config validation get_settings() raises ConfigError with clear message on missing/invalid env vars

Observability

Feature Implementation
LangSmith tracing Auto-wired via os.environ when LANGCHAIN_API_KEY set — traces every node
Token cost tracking CostTracker(BaseCallbackHandler) on every LLM call — logs to costs.jsonl
Agent progress streaming stream_mode="values" — UI updates after each node, not just at end
Session history InMemorySaver checkpointer — full state replay per thread_id

Scalability

Feature Implementation
REST API FastAPI + Uvicorn — submit jobs, poll results, HITL review
Async job queue Celery with Redis broker — decouples submission from execution
Paginated job listing GET /jobs?limit=20&offset=0 — handles large job histories
Thread-safe graph cache @lru_cache(maxsize=1) — graph compiled once, reused across requests
Multi-provider LLM Switch with one env var — no vendor lock-in

Quick Start

1. Clone and install

git clone https://github.com/YOUR_USERNAME/coding-agent.git
cd coding-agent
pip install -r requirements.txt

2. Configure

cp .env.example .env

Minimum required:

GROQ_API_KEY=your_key_here   # free at console.groq.com

3. Run

.\run.ps1                              # Streamlit UI  → http://localhost:8501
.\run.ps1 api                          # FastAPI       → http://localhost:8000/docs
.\run.ps1 test                         # pytest (52 tests)
.\run.ps1 cli "implement binary search"  # one-shot CLI

Configuration

# ── LLM Provider ──────────────────────────────────────────────────────────────
LLM_PROVIDER=groq          # groq | openai | anthropic | deepseek | gemini

# Groq — free at console.groq.com
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile

# Google Gemini — free at aistudio.google.com (1500 req/day)
# LLM_PROVIDER=gemini
# GEMINI_API_KEY=AIza...
# GEMINI_MODEL=gemini-2.0-flash

# OpenAI
# LLM_PROVIDER=openai
# OPENAI_API_KEY=sk-...
# OPENAI_MODEL=gpt-4o

# Anthropic
# LLM_PROVIDER=anthropic
# ANTHROPIC_API_KEY=sk-ant-...
# ANTHROPIC_MODEL=claude-opus-4-8

# DeepSeek — OpenAI-compatible API
# LLM_PROVIDER=deepseek
# DEEPSEEK_API_KEY=sk-...
# DEEPSEEK_MODEL=deepseek-chat

# ── Agent Behaviour ───────────────────────────────────────────────────────────
MAX_FIX_ATTEMPTS=3    # max critic→executor loops before accepting best attempt
EXEC_TIMEOUT=15       # sandbox subprocess wall-clock timeout (seconds)

# ── API ───────────────────────────────────────────────────────────────────────
ENVIRONMENT=development   # set to "production" to enforce strong API_KEY
API_KEY=dev-key
RATE_LIMIT=10/minute

# ── Redis (optional) ──────────────────────────────────────────────────────────
# REDIS_URL=redis://localhost:6379/0   # enables Celery queue + distributed rate limit

# ── LangSmith Tracing (optional) ─────────────────────────────────────────────
# LANGCHAIN_API_KEY=ls__...   # set this to enable full pipeline tracing

Project Structure

coding-agent/
├── app.py                    # Streamlit UI — streaming, HITL, plots, cost display
├── api.py                    # FastAPI — job CRUD, pagination, HITL review endpoint
├── cli.py                    # CLI — interactive + one-shot modes
├── run.ps1                   # One-click launcher
├── requirements.txt
├── .env.example
│
├── src/
│   ├── agents/
│   │   ├── planner.py        # LLM → JSON plan, history context
│   │   ├── executor.py       # ReAct loop, tool calls, _trim_repetitive()
│   │   └── critic.py         # LLM-as-judge: APPROVE/REVISE, fallback logic
│   ├── graph/
│   │   └── workflow.py       # LangGraph StateGraph, HITL wiring, resume_task()
│   ├── tools/
│   │   ├── sandbox.py        # Subprocess sandbox, cache, plot rescue, POSIX limits
│   │   ├── repl.py           # python_repl LangChain tool
│   │   └── file_tools.py     # read_file + list_dir tools
│   ├── llm.py                # 5-provider LLM factory, Gemini backoff wrapper
│   ├── config.py             # Settings dataclass, dotenv, production key check
│   ├── costs.py              # CostTracker callback, costs.jsonl log
│   ├── streaming.py          # Token streaming via astream_events v2
│   ├── job_store.py          # Redis + in-memory fallback, pagination
│   ├── ratelimit.py          # Configurable rate limiting middleware
│   ├── tasks.py              # Celery task definitions
│   ├── worker.py             # Celery worker entry point
│   └── prompts.py            # System prompts for all three agents
│
├── evals/
│   └── llm_judge.py          # LLM-as-judge scoring 0–3
│
├── tests/                    # 52 pytest tests
│
├── scripts/
│   ├── demo.py               # 5 master-prompt test cases
│   └── run_eval.py           # End-to-end eval runner
│
├── docs/
│   └── screenshots/          # Demo screenshots
│
└── outputs/
    ├── plots/                # Rescued matplotlib plots
    └── transcripts/          # Agent run transcripts

REST API

# Submit job
POST   /jobs
       {"task": "implement binary search"}

# List jobs (paginated)
GET    /jobs?limit=20&offset=0

# Get single job
GET    /jobs/{job_id}

# HITL review
POST   /jobs/{job_id}/review
       {"action": "approve"}
       {"action": "revise", "feedback": "add error handling"}

All endpoints require X-API-Key: your_key header. Interactive docs → http://localhost:8000/docs


Example Tasks

Task Demonstrates
Implement K-means from scratch, 150 points, 3 clusters, plot with centroids X ML + plot + HITL
Generate 100 random numbers and plot a histogram Matplotlib pipeline
Implement binary search and test on 20 numbers Algorithm + self-test
Fix this: def divide(a,b): return a/b — print(divide(10,0)) Bug fix loop
Write a prime checker Basic task
(follow-up) Now make it handle floats Conversation memory

Tech Stack

Layer Technology Purpose
Agent framework LangGraph 0.2+ StateGraph, HITL, checkpointing, streaming
LLM Groq / Gemini / OpenAI / Anthropic / DeepSeek Pluggable via factory
UI Streamlit Streaming UI, HITL controls, plot rendering
API FastAPI + Uvicorn REST endpoints, auth, rate limiting
Job store Redis + in-memory fallback Distributed job tracking
Code execution Python subprocess Sandboxed, isolated, timeout-enforced
Resource limits POSIX resource.setrlimit CPU + memory cap on Linux/Mac
Plot rendering Matplotlib Agg + PIL Headless, RGBA fix, base64 HTML
Async queue Celery + Redis Decoupled task execution
Observability LangSmith Full pipeline tracing
Cost tracking LangChain callbacks Per-call token logging
Evals LLM-as-judge (0–3) Automated quality scoring
Tests pytest 52 tests

License

MIT

About

langgraph langchain llm ai-agent python fastapi streamlit groq gemini openai human-in-the-loop multi-agent code-generation langsmith celery redis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors