diff --git a/docs/SANDBOX.md b/docs/SANDBOX.md
new file mode 100644
index 0000000..28ef188
--- /dev/null
+++ b/docs/SANDBOX.md
@@ -0,0 +1,225 @@
+# Sandbox Runtime
+
+GitPilot executes code in a configurable sandbox so the chat **▶ Run** button,
+the agent's autonomous build/test loop, and the HTTP API all share one
+runtime contract. Three backends ship in the box.
+
+## Backends
+
+| Backend          | Isolation                    | Use it when                                            |
+| ---------------- | ---------------------------- | ------------------------------------------------------ |
+| `subprocess`     | host process, cwd jail       | **Default.** Tries simple snippets locally.            |
+| `matrixlab`      | Docker container per snippet | Enterprise — untrusted code, multi-tenant, audit-able. |
+| `off`            | none (pass-through)          | Local dev only. No jail; equivalent to host shell.     |
+
+`subprocess` is the safe default so a fresh install runs hello-world without
+any setup. Operators pick `matrixlab` from **Settings → Sandbox runtime** for
+isolated, ephemeral, resource-limited execution.
+
+## Precedence
+
+Resolution order at every sandbox call:
+
+```
+explicit  >  GITPILOT_SANDBOX env  >  ~/.gitpilot/settings.json  >  "subprocess"
+```
+
+When an env var shadows the persisted choice, `GET /api/sandbox/status`
+returns `env_override: "GITPILOT_SANDBOX"` and the Settings panel renders an
+**env override** badge so the user understands why their UI selection isn't
+taking effect.
+
+## How the three surfaces share one path
+
+```
+┌─────────────────────┐      ┌──────────────────────┐
+│ Chat ▶ Run button   │      │ Agent run_in_sandbox │
+│ Chat run_command    │      │ Agent run_command    │
+└──────────┬──────────┘      └──────────┬───────────┘
+           │                            │
+           └──────────┬─────────────────┘
+                      ▼
+           ┌──────────────────────┐
+           │ POST /api/sandbox/run│      same backend, same policy,
+           │   {language, code}   │      same error envelope
+           └──────────┬───────────┘
+                      │
+        ┌─────────────┼──────────────┐
+        ▼             ▼              ▼
+   SubprocessSandbox  NullSandbox  MatrixLabSandbox ──► POST /code/run
+       (default)       (off)                              on the Runner
+```
+
+- The **frontend ▶ Run button** in chat (`frontend/components/RunnableCodeBlock.jsx`)
+  POSTs the fenced snippet to `/api/sandbox/run`.
+- The **agent's `run_in_sandbox` tool** is the same HTTP call wrapped as a
+  CrewAI tool, so a single binding governs both human and autonomous runs.
+- The **agent's `run_command` tool** routes through the same endpoint:
+  `bash` → `language=bash, code=<command>` against the configured backend.
+
+## Configuration
+
+### From the UI
+
+`Settings → Sandbox runtime` shows a radio (Local / MatrixLab / Pass-through)
+plus a MatrixLab card with URL, bearer token (write-only — saved tokens
+display as bullets), default image, network egress toggle, timeout, and a
+**Test connection** button.
+
+### From the environment
+
+| Var                                       | Effect                                                            |
+| ----------------------------------------- | ----------------------------------------------------------------- |
+| `GITPILOT_SANDBOX`                        | Pins backend (`subprocess` \| `matrixlab` \| `off`)               |
+| `GITPILOT_MATRIXLAB_URL`                  | MatrixLab Runner base URL (default `http://localhost:8000`)       |
+| `GITPILOT_MATRIXLAB_TOKEN`                | Bearer token sent on every request                                |
+| `GITPILOT_MATRIXLAB_IMAGE`                | Default image override (e.g. `matrix-lab-sandbox-python:latest`)  |
+| `GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE`     | Set to `1` to enable the Install / Start / Stop buttons           |
+
+### From `settings.json`
+
+```json
+{
+  "sandbox": {
+    "backend": "matrixlab",
+    "matrixlab_url": "http://localhost:8000",
+    "matrixlab_token": "",
+    "matrixlab_image": "",
+    "allow_network": false,
+    "timeout_sec": 120
+  }
+}
+```
+
+Secrets never round-trip to the browser: `GET /api/settings` returns
+`has_token: true|false` instead of the token itself.
+
+## HTTP API
+
+### `GET /api/sandbox/status`
+
+Returns the live backend, reachability of the configured MatrixLab Runner,
+and `env_override` if an env var is shadowing the persisted choice.
+
+### `PUT /api/sandbox/config`
+
+Updates any subset of the persisted `SandboxSettings`. Unknown backend
+values return `400` (only `subprocess`, `matrixlab`, `off` accepted).
+
+### `POST /api/sandbox/run`
+
+```jsonc
+// request
+{ "language": "python", "code": "print(2 + 2)", "timeout_sec": 60 }
+
+// response
+{
+  "backend": "matrixlab",
+  "language": "python",
+  "command": "python <snippet>",
+  "exit_code": 0,
+  "stdout": "4\n",
+  "stderr": "",
+  "duration_ms": 1868,
+  "truncated": false,
+  "timed_out": false,
+  "sandbox_id": "63baa623-…"   // assigned by MatrixLab when backend=matrixlab
+}
+```
+
+Supported languages: `python` (`py`), `javascript` (`js`/`node`), `bash`
+(`sh`/`shell`). Unknown languages return `400`. Snippets run in an
+ephemeral tempdir (not the workspace) so file-system side effects don't
+pollute the repo.
+
+### MatrixLab lifecycle
+
+`GET /api/sandbox/matrixlab/lifecycle` reports `installed` (Docker image
+present), `running` (URL reachable), `docker_available`, and
+`lifecycle_enabled` (the env-flag gate). Always safe to call — pure
+inspection.
+
+The mutating endpoints below are gated behind
+`GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE=1`. Without the flag they return
+`403`, never silently execute Docker on behalf of a browser POST.
+
+| Method | Path                              | Action                                  |
+| ------ | --------------------------------- | --------------------------------------- |
+| `POST` | `/api/sandbox/matrixlab/install`  | `docker pull` runner + sandbox images   |
+| `POST` | `/api/sandbox/matrixlab/start`    | `docker run -d` (idempotent by name)    |
+| `POST` | `/api/sandbox/matrixlab/stop`     | `docker stop gitpilot-matrixlab`        |
+
+Each response carries the full `steps` transcript (`cmd`, `exit_code`,
+`stdout`, `stderr`, `duration_ms` per step) so failures are debuggable
+without SSH'ing to the host.
+
+## Error retrieval
+
+The point of running through a sandbox is that failures come back as
+structured signals, not opaque silence. Every backend returns:
+
+- `exit_code` — non-zero on failure; `-1` for "could not launch"
+- `stderr` — full traceback / compiler diagnostic, verbatim
+- `timed_out` — `true` when the runner killed the process
+- `truncated` — `true` when output was clipped at the policy cap
+
+This is what makes autonomous loops productive: the agent can read a
+SyntaxError, plan the fix, and re-run. Same pattern Claude Code, Codex,
+and Cursor use.
+
+Example trace through `run_in_sandbox(language="python", code="raise ValueError('boom')")`:
+
+```
+Sandbox: MatrixLab
+Command: python <snippet>
+Exit code: 1
+Duration: 440 ms
+--- stderr ---
+Traceback (most recent call last):
+  File "/workspace/main.py", line 1, in <module>
+    raise ValueError("boom")
+ValueError: boom
+sandbox_id: db3e427d-…
+```
+
+## Resource policy
+
+`SandboxPolicy` enforces:
+
+- **Wall-clock timeout** — caller-supplied or `timeout_sec` default (120s,
+  clamped to 600s)
+- **Output cap** — 512 KB per stream; sets `truncated: true` when hit
+- **Network** — `allow_network: false` strips proxy env vars on
+  `subprocess`; rejected at egress on `matrixlab`
+- **Secret stripping** — `GITHUB_TOKEN`, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`,
+  `WATSONX_API_KEY`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN` are never
+  forwarded into the sandbox process
+- **Destructive patterns** — `rm -rf /`, `mkfs`, `dd if=/dev/zero`,
+  `:(){ :|:& };:`, `shutdown -h|-r` blocked before launch
+
+## Quick start
+
+1. `make install && make run` — defaults to `subprocess`, hello-world works.
+2. Switch to MatrixLab once you need real isolation:
+   ```bash
+   curl -X PUT http://localhost:8765/api/sandbox/config \
+     -H 'content-type: application/json' \
+     -d '{"backend": "matrixlab", "matrixlab_url": "http://localhost:8000"}'
+   ```
+   …or click the radio in **Settings → Sandbox runtime**.
+3. Run a snippet:
+   ```bash
+   curl -X POST http://localhost:8765/api/sandbox/run \
+     -H 'content-type: application/json' \
+     -d '{"language": "python", "code": "print(2 + 2)"}'
+   ```
+
+## See also
+
+- `gitpilot/sandbox.py` — backend abstraction (`NullSandbox`,
+  `SubprocessSandbox`, `MatrixLabSandbox`) + `SandboxPolicy`
+- `gitpilot/sandbox_api.py` — HTTP surface, lifecycle endpoints
+- `gitpilot/local_tools.py` — agent `run_command` + `run_in_sandbox` tools
+- `frontend/components/SettingsModal.jsx` — Sandbox runtime panel
+- `frontend/components/RunnableCodeBlock.jsx` — chat ▶ Run button
+- `tests/test_sandbox.py`, `tests/test_sandbox_api.py` — 28 unit tests
diff --git a/frontend/App.jsx b/frontend/App.jsx
index 693b5d2..72c9711 100644
--- a/frontend/App.jsx
+++ b/frontend/App.jsx
@@ -402,6 +402,47 @@ export default function App() {
    * first, ChatPanel would see an empty messages array, then our async
    * hydration would complete but ChatPanel wouldn't re-sync.
    */
+  // Resolve the branch we should jump to when reopening a session.
+  // Preference order:
+  //   1. session.repos[i].branch for the active_repo (multi-repo)
+  //   2. session.branch (legacy single-repo field)
+  // Returns ``null`` when nothing is recorded.
+  const resolveSessionBranch = (session) => {
+    if (!session) return null;
+    if (Array.isArray(session.repos) && session.repos.length > 0) {
+      const target =
+        session.repos.find(
+          (r) => session.active_repo && r?.full_name === session.active_repo,
+        ) || session.repos[0];
+      if (target?.branch) return target.branch;
+    }
+    return session.branch || null;
+  };
+
+  // Probe whether a branch still exists on GitHub.  We deliberately
+  // reuse the existing tree endpoint instead of adding a new one — a
+  // 200 means the ref resolves, anything else (most importantly 404)
+  // means the branch is gone or otherwise unreachable.  Failure
+  // degrades to "branch unknown" so a transient network blip falls
+  // back gracefully rather than misleading the user.
+  const probeBranchExists = async (repoFullName, branch) => {
+    if (!repoFullName || !branch) return false;
+    try {
+      const token = localStorage.getItem("github_token");
+      const headers = {};
+      if (token) headers["Authorization"] = `Bearer ${token}`;
+      const res = await fetch(
+        apiUrl(
+          `/api/repos/${repoFullName}/tree?ref=${encodeURIComponent(branch)}`,
+        ),
+        { headers },
+      );
+      return res.ok;
+    } catch {
+      return false;
+    }
+  };
+
   const handleSelectSession = useCallback(async (session) => {
     // 1. Fetch persisted messages first
     const messages = await fetchSessionMessages(session.id);
@@ -418,11 +459,31 @@ export default function App() {
     // 3. NOW activate the session — ChatPanel's sync effect will read
     //    the hydrated messages from chatBySession[session.id]
     setActiveSessionId(session.id);
-    if (session.branch && session.branch !== currentBranch) {
-      handleBranchChange(session.branch);
+
+    // 4. Jump to the branch this session last published to, but verify
+    //    it still exists on GitHub first.  When the branch was deleted
+    //    (rebased away, merged-and-pruned, …) fall back to the
+    //    repository's default branch and tell the user what happened —
+    //    silently landing on the default would mask data loss.
+    const target = resolveSessionBranch(session);
+    if (target && target !== currentBranch) {
+      const repoFullName =
+        session.repo ||
+        (Array.isArray(session.repos) && session.repos[0]?.full_name);
+      const exists = await probeBranchExists(repoFullName, target);
+      if (exists) {
+        handleBranchChange(target);
+      } else {
+        const fallback = defaultBranch || "main";
+        showToast(
+          "Branch not found",
+          `'${target}' was not found on GitHub. Switched to ${fallback}.`,
+        );
+        if (fallback !== currentBranch) handleBranchChange(fallback);
+      }
     }
   // eslint-disable-next-line react-hooks/exhaustive-deps
-  }, [fetchSessionMessages, currentBranch]);
+  }, [fetchSessionMessages, currentBranch, defaultBranch]);
 
   const handleDeleteSession = useCallback(
     (deletedId) => {
diff --git a/frontend/components/AssistantMessage.jsx b/frontend/components/AssistantMessage.jsx
index 9ec8c00..ec75621 100644
--- a/frontend/components/AssistantMessage.jsx
+++ b/frontend/components/AssistantMessage.jsx
@@ -1,5 +1,6 @@
 import React from "react";
 import PlanView from "./PlanView.jsx";
+import RunnableCodeBlock, { splitFences } from "./RunnableCodeBlock.jsx";
 
 export default function AssistantMessage({ answer, plan, executionLog, planStatus }) {
   // ``planStatus`` is optional metadata about the lifecycle of the plan
@@ -82,13 +83,22 @@ export default function AssistantMessage({ answer, plan, executionLog, planStatu
 
   return (
     <div className="chat-message-ai" style={styles.container}>
-      {/* Answer section */}
+      {/* Answer section.  ``splitFences`` cuts the answer at fenced code
+          blocks so each runnable snippet gets its own RunnableCodeBlock
+          (with a per-block Run button); the surrounding prose still
+          renders as the existing pre-wrapped paragraph. */}
       <section style={styles.section}>
         <header style={styles.header}>
           <h3 style={styles.title}>Answer</h3>
         </header>
         <div style={styles.content}>
-          <p style={{ margin: 0 }}>{answer}</p>
+          {splitFences(answer).map((seg, i) =>
+            seg.type === "code" ? (
+              <RunnableCodeBlock key={i} language={seg.language} code={seg.code} />
+            ) : (
+              <p key={i} style={{ margin: "0 0 8px" }}>{seg.value}</p>
+            )
+          )}
         </div>
       </section>
 
diff --git a/frontend/components/ChatPanel.jsx b/frontend/components/ChatPanel.jsx
index c66d274..4ccce56 100644
--- a/frontend/components/ChatPanel.jsx
+++ b/frontend/components/ChatPanel.jsx
@@ -3,6 +3,7 @@ import React, { useEffect, useRef, useState } from "react";
 import AssistantMessage from "./AssistantMessage.jsx";
 import ThinkingIndicator from "./ThinkingIndicator.jsx";
 import ContextMeter from "./ContextMeter.jsx";
+import TasksPanel from "./TasksPanel.jsx";
 import DiffStats from "./DiffStats.jsx";
 import DiffViewer from "./DiffViewer.jsx";
 import CreatePRButton from "./CreatePRButton.jsx";
@@ -35,6 +36,10 @@ export default function ChatPanel({
   const [loadingPlan, setLoadingPlan] = useState(false);
   const [executing, setExecuting] = useState(false);
   const [status, setStatus] = useState("");
+  // Batch B9 — populated when a plan whose first step was INDEX is
+  // rejected.  Lets us render a small "Run with grep instead?" prompt
+  // so the user doesn't have to retype the goal.
+  const [retryAfterIndexReject, setRetryAfterIndexReject] = useState(null);
 
   // Claude-Code-on-Web: WebSocket streaming + diff + PR
   const [wsConnected, setWsConnected] = useState(false);
@@ -255,16 +260,27 @@ export default function ChatPanel({
     if (m.executionLog) meta.executionLog = m.executionLog;
     if (m.diff)         meta.diff         = m.diff;
     if (m.actions)      meta.actions      = m.actions;
+    // Informational plans (READ-only answers to "what does X do?" style
+    // questions) carry no Approve/Reject controls — pin the flag so the
+    // session reload re-renders the same shape.
+    if (m.informational) meta.informational = true;
     return Object.keys(meta).length > 0 ? meta : null;
   };
 
-  const send = async () => {
-    if (!repo || !goal.trim()) return;
+  const send = async (overrides = {}) => {
+    if (!repo) return;
+    // Allow callers (e.g. the "Retry with grep" button on a rejected
+    // INDEX plan) to drive send() with a fixed goal and a router flag.
+    const overrideGoal = overrides.goal;
+    const force_no_rag = Boolean(overrides.force_no_rag);
+    const sourceText = overrideGoal != null ? overrideGoal : goal;
+    if (!sourceText || !sourceText.trim()) return;
 
-    const text = goal.trim();
+    const text = sourceText.trim();
 
-    // Clear input immediately (Claude Code behavior)
-    setGoal("");
+    // Clear input immediately (Claude Code behavior) — but only when
+    // the user typed; programmatic retries leave the input alone.
+    if (overrideGoal == null) setGoal("");
     // Reset textarea height
     const ta = document.querySelector(".chat-input");
     if (ta) ta.style.height = "40px";
@@ -319,6 +335,13 @@ export default function ChatPanel({
             repo_name: repo.name,
             goal: text,
             branch_name: effectiveBranch,
+            // Lets the backend record this plan as a Task on the
+            // session so the right-sidebar Tasks panel can trace it.
+            session_id: sid,
+            // Batch B9 — set on the "Retry with grep" path after the
+            // user rejects an INDEX-plan.  Tells the router to
+            // suppress RAG / INDEX recommendations.
+            force_no_rag,
           }),
           signal: planController.signal,
         });
@@ -349,29 +372,56 @@ export default function ChatPanel({
         throw new Error(detail || "Failed to generate plan");
       }
 
-      // Guard: a plan with no executable file actions is not a plan we
-      // can approve.  This happens when the planner/explorer agents
-      // refused (tool-loop hallucination or a real safety refusal) and
-      // CrewAI returned a schema-valid but empty payload.  Without
-      // this guard the Approve & execute / Reject plan buttons would
-      // render against a payload that can't actually be executed.
+      // Classify the plan into one of three kinds so we can render the
+      // right shape — not just "valid or banner":
+      //
+      // * executable    — at least one CREATE/MODIFY/DELETE → plan card
+      //                   with Approve & execute / Reject controls.
+      // * informational — every file is READ (or no files at all on a
+      //                   step that still has a meaningful description)
+      //                   AND the summary is a real answer, not the
+      //                   placeholder.  This is what happens when the
+      //                   user asks "what do you think about this
+      //                   project?" — the planner correctly READs the
+      //                   relevant files and the summary IS the answer.
+      //                   Render the summary as a normal assistant
+      //                   message; do not show plan controls.
+      // * empty         — no steps OR no actionable signal at all →
+      //                   honest failure banner.
+      //
+      // Before this classifier the second case was treated as the
+      // third, surfacing "I couldn't produce a plan" on perfectly
+      // valid READ-only plans.
       const planSteps = Array.isArray(data?.steps)
         ? data.steps
         : Array.isArray(data?.plan?.steps)
         ? data.plan.steps
         : [];
-      const hasExecutableFiles = planSteps.some(
+      const PLACEHOLDER_SUMMARY = "Here is the proposed plan for your request.";
+      const summary =
+        data.plan?.summary || data.summary || data.message || PLACEHOLDER_SUMMARY;
+      const hasExecutable = planSteps.some(
         (s) =>
           Array.isArray(s?.files) &&
           s.files.some((f) => ["CREATE", "MODIFY", "DELETE"].includes(f?.action)),
       );
-
-      // Extract summary from nested plan structure or top-level
-      const summary =
-        data.plan?.summary || data.summary || data.message ||
-        "Here is the proposed plan for your request.";
-
-      if (hasExecutableFiles) {
+      const isReadOnly =
+        planSteps.length > 0 &&
+        !hasExecutable &&
+        planSteps.every(
+          (s) =>
+            !Array.isArray(s?.files) ||
+            s.files.length === 0 ||
+            s.files.every((f) => f?.action === "READ"),
+        );
+      const hasRealSummary = Boolean(summary) && summary !== PLACEHOLDER_SUMMARY;
+      const planKind = hasExecutable
+        ? "executable"
+        : isReadOnly && hasRealSummary
+        ? "informational"
+        : "empty";
+
+      if (planKind === "executable") {
         setPlan(data);
         const assistantMsg = {
           from: "ai",
@@ -382,18 +432,30 @@ export default function ChatPanel({
         };
         setMessages((prev) => [...prev, assistantMsg]);
         persistMessage(sid, "assistant", summary, pickAssistantMetadata(assistantMsg));
+      } else if (planKind === "informational") {
+        // The summary is the answer.  No plan card, no Approve/Reject —
+        // there is nothing to execute.  We deliberately do NOT attach
+        // ``plan: data`` here so AssistantMessage renders this turn
+        // exactly like a chat reply.
+        setPlan(null);
+        const assistantMsg = {
+          from: "ai",
+          role: "assistant",
+          answer: summary,
+          content: summary,
+          informational: true,
+        };
+        setMessages((prev) => [...prev, assistantMsg]);
+        persistMessage(sid, "assistant", summary, pickAssistantMetadata(assistantMsg));
       } else {
-        // No executable steps — surface a clear failure to the user
-        // instead of half-rendering a plan card and dangling buttons.
-        // The most common cause is the explorer/planner agent loop
-        // (CrewAI same-input limiter blocks repeat tool calls, the
-        // agent panics and "refuses").  Encourage a retry rather than
-        // letting the user click Approve on nothing.
+        // empty — be honest about what we know.  The earlier wording
+        // ("got stuck reading the same file twice") was a guess from
+        // an older bug; for the cases that actually still hit this
+        // branch the real signal is just "no actionable steps".
         setPlan(null);
         const failureText =
-          "I couldn't produce a plan for that request. The agent may have " +
-          "got stuck reading the same file twice. Try rephrasing, or " +
-          "switch to a stronger model in Settings → Provider.";
+          "The model returned an empty plan. Try rephrasing more concretely, " +
+          "or pick a stronger model in Settings → Provider.";
         const failureMsg = {
           from: "ai",
           role: "system",
@@ -401,7 +463,7 @@ export default function ChatPanel({
         };
         setMessages((prev) => [...prev, failureMsg]);
         persistMessage(sid, "system", failureText);
-        setStatus("No executable plan produced.");
+        setStatus("No actionable plan produced.");
         return;
       }
     } catch (err) {
@@ -432,6 +494,17 @@ export default function ChatPanel({
   // ---------------------------------------------------------------------------
   const rejectPlan = () => {
     if (!plan || executing) return;
+
+    // Batch B9 — if the rejected plan contained an INDEX step, the
+    // user is implicitly saying "I don't want to build the semantic
+    // index right now".  Stash the original goal so we can offer a
+    // one-click "retry with grep" path on the next render.
+    const hadIndexStep = Array.isArray(plan?.steps) &&
+      plan.steps.some((s) =>
+        Array.isArray(s?.files) && s.files.some((f) => f?.action === "INDEX"),
+      );
+    const rejectedGoal = plan?.goal || "";
+
     setPlan(null);
     setStatus("Plan rejected. No files were changed.");
 
@@ -445,6 +518,12 @@ export default function ChatPanel({
     if (sessionId) {
       persistMessage(sessionId, "system", rejectionMsg.content);
     }
+
+    if (hadIndexStep && rejectedGoal) {
+      setRetryAfterIndexReject({ goal: rejectedGoal });
+    } else {
+      setRetryAfterIndexReject(null);
+    }
   };
 
   const execute = async () => {
@@ -471,6 +550,10 @@ export default function ChatPanel({
           repo_name: repo.name,
           plan,
           branch_name,
+          // Lets the backend persist the new branch on the session
+          // record so reopening this session lands on the published
+          // branch, not the one it was created on.
+          session_id: sessionId,
         }),
       });
 
@@ -778,6 +861,54 @@ export default function ChatPanel({
         <div ref={messagesEndRef} />
       </div>
 
+      {/* Batch B9 — post-Reject "retry with grep" prompt.  Renders
+          only when the user rejected a plan whose first step was an
+          INDEX action.  One click re-issues the same goal with
+          force_no_rag so the router falls back to grep. */}
+      {retryAfterIndexReject && !loadingPlan && (
+        <div
+          style={{
+            padding: "10px 16px",
+            borderTop: "1px solid #27272A",
+            background: "rgba(217, 92, 61, 0.06)",
+            display: "flex",
+            alignItems: "center",
+            justifyContent: "space-between",
+            gap: 12,
+            flexWrap: "wrap",
+          }}
+        >
+          <span style={{ fontSize: 13, color: "#D4D4D8" }}>
+            Index skipped.  Run the same goal with grep instead?
+          </span>
+          <span style={{ display: "inline-flex", gap: 8 }}>
+            <button
+              type="button"
+              className="chat-btn primary"
+              onClick={() => {
+                const g = retryAfterIndexReject.goal;
+                setRetryAfterIndexReject(null);
+                send({ goal: g, force_no_rag: true });
+              }}
+            >
+              Yes, use grep
+            </button>
+            <button
+              type="button"
+              className="chat-btn ghost"
+              onClick={() => setRetryAfterIndexReject(null)}
+              style={{
+                color: "#9CA3AF",
+                borderColor: "rgba(156, 163, 175, 0.35)",
+                background: "transparent",
+              }}
+            >
+              No, dismiss
+            </button>
+          </span>
+        </div>
+      )}
+
       {/* Diff stats bar (when agent has made changes) */}
       {diffData && (
         <div style={{
@@ -924,7 +1055,10 @@ export default function ChatPanel({
               </span>
             )}
           </span>
-          <ContextMeter sessionId={sessionId} />
+          <span style={{ display: "inline-flex", alignItems: "center", gap: 6 }}>
+            <TasksPanel sessionId={sessionId} />
+            <ContextMeter sessionId={sessionId} />
+          </span>
         </div>
       </div>
 
diff --git a/frontend/components/PlanView.jsx b/frontend/components/PlanView.jsx
index a67efb2..b543f82 100644
--- a/frontend/components/PlanView.jsx
+++ b/frontend/components/PlanView.jsx
@@ -4,7 +4,7 @@ export default function PlanView({ plan }) {
   if (!plan) return null;
 
   // Calculate totals for each action type
-  const totals = { CREATE: 0, MODIFY: 0, DELETE: 0 };
+  const totals = { CREATE: 0, MODIFY: 0, DELETE: 0, INDEX: 0 };
   plan.steps.forEach((step) => {
     step.files.forEach((file) => {
       totals[file.action] = (totals[file.action] || 0) + 1;
@@ -75,6 +75,25 @@ export default function PlanView({ plan }) {
       color: theme.dangerText,
       borderColor: "rgba(239, 68, 68, 0.2)",
     },
+    totalIndex: {
+      // GitPilot orange — the same brand colour the rest of the app
+      // uses for "infrastructure / one-time" actions.  Visually
+      // distinct from CREATE / MODIFY / DELETE so users know this
+      // step doesn't write code.
+      backgroundColor: "rgba(217, 92, 61, 0.10)",
+      color: "#D95C3D",
+      borderColor: "rgba(217, 92, 61, 0.25)",
+    },
+    indexNotice: {
+      marginTop: "8px",
+      fontSize: "12px",
+      color: "#D95C3D",
+      backgroundColor: "rgba(217, 92, 61, 0.05)",
+      padding: "8px 12px",
+      borderRadius: "6px",
+      border: "1px solid rgba(217, 92, 61, 0.15)",
+      lineHeight: "1.5",
+    },
     stepsList: {
       listStyle: "none",
       padding: 0,
@@ -161,6 +180,7 @@ export default function PlanView({ plan }) {
       case "CREATE": return styles.totalCreate;
       case "MODIFY": return styles.totalModify;
       case "DELETE": return styles.totalDelete;
+      case "INDEX":  return styles.totalIndex;
       default: return {};
     }
   };
@@ -190,6 +210,11 @@ export default function PlanView({ plan }) {
             {totals.DELETE} to delete
           </span>
         )}
+        {totals.INDEX > 0 && (
+          <span style={{ ...styles.totalBadge, ...styles.totalIndex }}>
+            {totals.INDEX === 1 ? "1 setup step" : `${totals.INDEX} setup steps`}
+          </span>
+        )}
       </div>
 
       {/* Steps List */}
@@ -210,12 +235,30 @@ export default function PlanView({ plan }) {
                     <span style={{ ...styles.actionBadge, ...getActionStyle(file.action) }}>
                       {file.action}
                     </span>
-                    <span style={styles.path}>{file.path}</span>
+                    <span style={styles.path}>
+                      {file.action === "INDEX"
+                        ? "Build semantic index for this repo"
+                        : file.path}
+                    </span>
                   </div>
                 ))}
               </div>
             )}
 
+            {/* B9: explain the INDEX step's cost so users can decide
+                informedly before clicking Approve. */}
+            {s.files && s.files.some((f) => f.action === "INDEX") && (
+              <div style={styles.indexNotice}>
+                📦 One-time semantic index build.
+                Embeds every file locally with MiniLM-L6-v2 (~80 MB
+                model on first run, ~30 s wall time for a typical
+                repo, ~12 MB on disk).  No cloud calls.  Makes future
+                "find / where / how" queries instant.  Click{" "}
+                <strong>Reject plan</strong> to skip — you'll be
+                offered the grep fallback.
+              </div>
+            )}
+
             {/* Risks */}
             {s.risks && (
               <div style={styles.risks}>
diff --git a/frontend/components/RunnableCodeBlock.jsx b/frontend/components/RunnableCodeBlock.jsx
new file mode 100644
index 0000000..4e03d66
--- /dev/null
+++ b/frontend/components/RunnableCodeBlock.jsx
@@ -0,0 +1,283 @@
+import React, { useState } from "react";
+
+// Languages the Run button supports.  Anything not in this set still
+// renders as a normal code block (no button) — keeps the visual contract
+// honest: if there's a button, the snippet really is executable.
+const RUNNABLE = new Set([
+  "python", "py",
+  "javascript", "js", "node",
+  "bash", "sh", "shell",
+]);
+
+// Friendly badge text per backend, surfaced so the user always knows
+// which sandbox actually ran their code.  Mirrors the labels in
+// SettingsModal so the two views agree.
+const BACKEND_LABELS = {
+  subprocess: "Local",
+  matrixlab: "MatrixLab",
+  off: "Pass-through",
+};
+
+// Map "py" → "python" etc. so the badge always shows the canonical
+// language name rather than whatever alias the LLM tagged the fence
+// with.
+const LANG_DISPLAY = {
+  py: "python",
+  js: "javascript",
+  node: "javascript",
+  sh: "bash",
+  shell: "bash",
+};
+
+/** A single fenced code block with a per-block Run button. */
+export default function RunnableCodeBlock({ language, code }) {
+  const lang = (language || "").trim().toLowerCase();
+  const canRun = RUNNABLE.has(lang);
+  const [busy, setBusy] = useState(false);
+  const [result, setResult] = useState(null);
+  const [error, setError] = useState(null);
+  const display = LANG_DISPLAY[lang] || lang || "text";
+
+  const onRun = async () => {
+    setBusy(true);
+    setResult(null);
+    setError(null);
+    try {
+      const res = await fetch("/api/sandbox/run", {
+        method: "POST",
+        headers: { "Content-Type": "application/json" },
+        body: JSON.stringify({ language: lang, code }),
+      });
+      const data = await res.json();
+      if (!res.ok) {
+        setError(data.detail || `HTTP ${res.status}`);
+        return;
+      }
+      setResult(data);
+    } catch (err) {
+      setError(err.message || "Run failed");
+    } finally {
+      setBusy(false);
+    }
+  };
+
+  const copy = () => {
+    if (navigator?.clipboard) navigator.clipboard.writeText(code).catch(() => {});
+  };
+
+  return (
+    <div style={styles.wrap}>
+      <div style={styles.head}>
+        <span style={styles.lang}>{display}</span>
+        <div style={styles.headRight}>
+          <button type="button" style={styles.iconBtn} onClick={copy} title="Copy code">
+            Copy
+          </button>
+          {canRun && (
+            <button
+              type="button"
+              style={{ ...styles.runBtn, opacity: busy ? 0.6 : 1 }}
+              onClick={onRun}
+              disabled={busy}
+              title="Execute this snippet in the configured sandbox"
+            >
+              {busy ? "Running…" : "▶ Run"}
+            </button>
+          )}
+        </div>
+      </div>
+      <pre style={styles.code}>{code}</pre>
+
+      {(result || error) && (
+        <div style={styles.output}>
+          <div style={styles.outputHead}>
+            <span style={styles.outputLabel}>Output</span>
+            {result && (
+              <span style={styles.metaRow}>
+                <span style={result.exit_code === 0 ? styles.okPill : styles.failPill}>
+                  exit {result.exit_code}
+                </span>
+                <span style={styles.backendPill}>
+                  {BACKEND_LABELS[result.backend] || result.backend}
+                </span>
+                {typeof result.duration_ms === "number" && (
+                  <span style={styles.dim}>{result.duration_ms} ms</span>
+                )}
+                {result.timed_out && <span style={styles.failPill}>timed out</span>}
+                {result.truncated && <span style={styles.warnPill}>truncated</span>}
+              </span>
+            )}
+          </div>
+          {error && <pre style={styles.stderr}>{error}</pre>}
+          {result?.stdout && <pre style={styles.stdout}>{result.stdout}</pre>}
+          {result?.stderr && <pre style={styles.stderr}>{result.stderr}</pre>}
+          {result && !result.stdout && !result.stderr && (
+            <div style={styles.dim}>(no output)</div>
+          )}
+        </div>
+      )}
+    </div>
+  );
+}
+
+/** Split a markdown-ish string into text and fenced-code segments.
+ *
+ * Returned shape: ``[{type: 'text', value} | {type: 'code', language, code}]``.
+ *
+ * Kept deliberately small — full markdown rendering is out of scope; this
+ * only needs to recognise ```lang fences so the Run button can attach to
+ * code blocks the model emits. */
+export function splitFences(input) {
+  if (!input) return [];
+  const out = [];
+  const re = /```([a-zA-Z0-9_+-]*)\s*\n([\s\S]*?)```/g;
+  let last = 0;
+  let m;
+  while ((m = re.exec(input)) !== null) {
+    if (m.index > last) {
+      out.push({ type: "text", value: input.slice(last, m.index) });
+    }
+    out.push({ type: "code", language: m[1] || "", code: m[2].replace(/\s+$/, "") });
+    last = m.index + m[0].length;
+  }
+  if (last < input.length) {
+    out.push({ type: "text", value: input.slice(last) });
+  }
+  return out;
+}
+
+const styles = {
+  wrap: {
+    margin: "8px 0",
+    background: "#09090B",
+    border: "1px solid #27272A",
+    borderRadius: 8,
+    overflow: "hidden",
+    fontFamily: '-apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif',
+  },
+  head: {
+    display: "flex",
+    alignItems: "center",
+    justifyContent: "space-between",
+    padding: "6px 12px",
+    background: "#18181B",
+    borderBottom: "1px solid #27272A",
+    fontSize: 11,
+  },
+  headRight: { display: "flex", gap: 6, alignItems: "center" },
+  lang: {
+    color: "#A1A1AA",
+    fontWeight: 600,
+    textTransform: "uppercase",
+    letterSpacing: "0.05em",
+    fontSize: 10,
+  },
+  iconBtn: {
+    background: "transparent",
+    color: "#A1A1AA",
+    border: "1px solid #3F3F46",
+    borderRadius: 4,
+    padding: "2px 8px",
+    fontSize: 11,
+    cursor: "pointer",
+  },
+  runBtn: {
+    background: "#10B981",
+    color: "#052e1c",
+    border: "0",
+    borderRadius: 4,
+    padding: "2px 10px",
+    fontSize: 11,
+    fontWeight: 600,
+    cursor: "pointer",
+  },
+  code: {
+    margin: 0,
+    padding: "12px 14px",
+    fontFamily: "ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace",
+    fontSize: 12.5,
+    lineHeight: 1.55,
+    color: "#E4E4E7",
+    whiteSpace: "pre-wrap",
+    wordBreak: "break-word",
+    overflowX: "auto",
+  },
+  output: {
+    background: "#0c0c10",
+    borderTop: "1px solid #27272A",
+    padding: "8px 14px 10px",
+  },
+  outputHead: {
+    display: "flex",
+    alignItems: "center",
+    justifyContent: "space-between",
+    marginBottom: 6,
+  },
+  outputLabel: {
+    fontSize: 10,
+    fontWeight: 600,
+    color: "#A1A1AA",
+    textTransform: "uppercase",
+    letterSpacing: "0.05em",
+  },
+  metaRow: { display: "flex", gap: 6, alignItems: "center" },
+  okPill: {
+    fontSize: 10,
+    fontWeight: 600,
+    padding: "1px 6px",
+    borderRadius: 9,
+    background: "rgba(16, 185, 129, 0.12)",
+    color: "#10B981",
+    border: "1px solid rgba(16, 185, 129, 0.35)",
+  },
+  failPill: {
+    fontSize: 10,
+    fontWeight: 600,
+    padding: "1px 6px",
+    borderRadius: 9,
+    background: "rgba(239, 68, 68, 0.12)",
+    color: "#ef4444",
+    border: "1px solid rgba(239, 68, 68, 0.35)",
+  },
+  warnPill: {
+    fontSize: 10,
+    fontWeight: 600,
+    padding: "1px 6px",
+    borderRadius: 9,
+    background: "rgba(217, 119, 6, 0.12)",
+    color: "#f59e0b",
+    border: "1px solid rgba(217, 119, 6, 0.35)",
+  },
+  backendPill: {
+    fontSize: 10,
+    fontWeight: 600,
+    padding: "1px 6px",
+    borderRadius: 9,
+    background: "rgba(79, 70, 229, 0.12)",
+    color: "#a5b4fc",
+    border: "1px solid rgba(79, 70, 229, 0.35)",
+  },
+  dim: { color: "#71717A", fontSize: 11 },
+  stdout: {
+    margin: "4px 0 0",
+    padding: "6px 8px",
+    fontFamily: "ui-monospace, SFMono-Regular, Menlo, monospace",
+    fontSize: 12,
+    color: "#D4D4D8",
+    background: "#000",
+    borderRadius: 4,
+    whiteSpace: "pre-wrap",
+    wordBreak: "break-word",
+  },
+  stderr: {
+    margin: "4px 0 0",
+    padding: "6px 8px",
+    fontFamily: "ui-monospace, SFMono-Regular, Menlo, monospace",
+    fontSize: 12,
+    color: "#fca5a5",
+    background: "#0a0000",
+    borderRadius: 4,
+    whiteSpace: "pre-wrap",
+    wordBreak: "break-word",
+  },
+};
diff --git a/frontend/components/SettingsModal.jsx b/frontend/components/SettingsModal.jsx
index 24d43b1..61a44dd 100644
--- a/frontend/components/SettingsModal.jsx
+++ b/frontend/components/SettingsModal.jsx
@@ -1,5 +1,23 @@
 import React, { useEffect, useState } from "react";
 
+const SANDBOX_BACKENDS = [
+  {
+    id: "subprocess",
+    label: "Local",
+    sub: "Host subprocess with a workspace jail. Default — best for trying simple snippets.",
+  },
+  {
+    id: "matrixlab",
+    label: "MatrixLab",
+    sub: "Containerised, ephemeral sandboxes from a MatrixLab Runner. Recommended for enterprise.",
+  },
+  {
+    id: "off",
+    label: "Pass-through",
+    sub: "Run on the host with no jail. Local development only.",
+  },
+];
+
 export default function SettingsModal({ onClose }) {
   const [settings, setSettings] = useState(null);
   const [models, setModels] = useState([]);
@@ -7,15 +25,128 @@ export default function SettingsModal({ onClose }) {
   const [loadingModels, setLoadingModels] = useState(false);
   const [testResult, setTestResult] = useState(null); // { ok: bool, message: string }
   const [testing, setTesting] = useState(false);
+  // Sandbox runtime state. ``sandbox`` is the persisted block from the
+  // settings response; ``sandboxStatus`` is the live probe result
+  // (ok / error). Both are independent of LLM settings so a failed
+  // MatrixLab probe doesn't block provider switching.
+  const [sandbox, setSandbox] = useState(null);
+  const [sandboxStatus, setSandboxStatus] = useState(null);
+  const [sandboxTokenInput, setSandboxTokenInput] = useState("");
+  const [sandboxBusy, setSandboxBusy] = useState(false);
+  // MatrixLab lifecycle state — separate from the sandbox runtime state
+  // because the lifecycle endpoints can run for many seconds (docker
+  // pulls) and we don't want to block the "switch backend" buttons on
+  // a running install.
+  const [lifecycle, setLifecycle] = useState(null);
+  const [lifecycleBusy, setLifecycleBusy] = useState(null); // "install" | "start" | "stop" | null
+  const [lifecycleLog, setLifecycleLog] = useState([]);
+  const [showLifecycleLog, setShowLifecycleLog] = useState(false);
 
   const loadSettings = async () => {
     const res = await fetch("/api/settings");
     const data = await res.json();
     setSettings(data);
+    if (data?.sandbox) setSandbox(data.sandbox);
+  };
+
+  const loadSandboxStatus = async () => {
+    try {
+      const res = await fetch("/api/sandbox/status");
+      const data = await res.json();
+      setSandboxStatus({ ok: data.ok, error: data.error, remote: data.remote });
+      // /status returns the same shape as the persisted block, so refresh
+      // the form state from it — the env vars may override settings.json
+      // and we want the UI to show what's actually live.
+      setSandbox((prev) => ({
+        ...(prev || {}),
+        backend: data.backend,
+        matrixlab_url: data.matrixlab_url,
+        matrixlab_image: data.matrixlab_image,
+        allow_network: data.allow_network,
+        timeout_sec: data.timeout_sec,
+        has_token: data.has_token,
+      }));
+    } catch (err) {
+      setSandboxStatus({ ok: false, error: err.message || "status probe failed" });
+    }
+  };
+
+  const updateSandbox = async (patch) => {
+    setSandboxBusy(true);
+    try {
+      const res = await fetch("/api/sandbox/config", {
+        method: "PUT",
+        headers: { "Content-Type": "application/json" },
+        body: JSON.stringify(patch),
+      });
+      const data = await res.json();
+      if (!res.ok) {
+        setSandboxStatus({ ok: false, error: data.detail || "update failed" });
+        return;
+      }
+      setSandbox((prev) => ({
+        ...(prev || {}),
+        backend: data.backend,
+        matrixlab_url: data.matrixlab_url,
+        matrixlab_image: data.matrixlab_image,
+        allow_network: data.allow_network,
+        timeout_sec: data.timeout_sec,
+        has_token: data.has_token,
+      }));
+      setSandboxStatus({ ok: data.ok, error: data.error, remote: data.remote });
+      // Always clear the local token input after a save so a stale value
+      // doesn't sit in the DOM. The backend stores it; we don't need to
+      // hold it client-side.
+      if ("matrixlab_token" in patch) setSandboxTokenInput("");
+    } finally {
+      setSandboxBusy(false);
+    }
+  };
+
+  const loadLifecycle = async () => {
+    try {
+      const res = await fetch("/api/sandbox/matrixlab/lifecycle");
+      const data = await res.json();
+      setLifecycle(data);
+      if (Array.isArray(data.steps) && data.steps.length) {
+        setLifecycleLog(data.steps);
+      }
+    } catch (err) {
+      setLifecycle({
+        docker_available: false,
+        installed: false,
+        running: false,
+        lifecycle_enabled: false,
+        error: err.message || "lifecycle probe failed",
+      });
+    }
+  };
+
+  const runLifecycle = async (action) => {
+    if (!["install", "start", "stop"].includes(action)) return;
+    setLifecycleBusy(action);
+    setShowLifecycleLog(true);
+    try {
+      const res = await fetch(`/api/sandbox/matrixlab/${action}`, { method: "POST" });
+      const data = await res.json();
+      if (!res.ok) {
+        setLifecycle((prev) => ({ ...(prev || {}), error: data.detail || `HTTP ${res.status}` }));
+        return;
+      }
+      setLifecycle(data);
+      setLifecycleLog(data.steps || []);
+      // Refresh the runtime status — a successful start should flip
+      // sandboxStatus.ok to true.
+      loadSandboxStatus();
+    } finally {
+      setLifecycleBusy(null);
+    }
   };
 
   useEffect(() => {
     loadSettings();
+    loadSandboxStatus();
+    loadLifecycle();
   }, []);
 
   const changeProvider = async (provider) => {
@@ -327,6 +458,376 @@ export default function SettingsModal({ onClose }) {
             phi-3-mini, gemma-2b, tinyllama, etc.
           </div>
         </div>
+
+        {/* Sandbox Runtime section — controls the Run button on chat
+            code blocks. Local subprocess is the default so users can
+            try simple snippets immediately; MatrixLab is the enterprise
+            opt-in for containerised isolation. */}
+        {sandbox && (
+          <div
+            style={{
+              marginTop: 16,
+              paddingTop: 12,
+              borderTop: "1px solid #2c2d46",
+              fontSize: 13,
+            }}
+          >
+            <div style={{ display: "flex", justifyContent: "space-between", alignItems: "baseline", marginBottom: 8 }}>
+              <div style={{ color: "#c3c5dd", fontWeight: 600 }}>Sandbox runtime</div>
+              {sandboxStatus && (
+                <span style={{
+                  display: "inline-flex",
+                  alignItems: "center",
+                  gap: 6,
+                  fontSize: 11,
+                  fontWeight: 600,
+                  padding: "2px 8px",
+                  borderRadius: 10,
+                  background: sandboxStatus.ok ? "#0d3320" : "#3d1111",
+                  border: `1px solid ${sandboxStatus.ok ? "#166534" : "#7f1d1d"}`,
+                  color: sandboxStatus.ok ? "#86efac" : "#fca5a5",
+                }}>
+                  <span style={{
+                    width: 6, height: 6, borderRadius: "50%",
+                    background: sandboxStatus.ok ? "#10B981" : "#ef4444",
+                  }} />
+                  {sandboxStatus.ok ? "Reachable" : "Unreachable"}
+                </span>
+              )}
+            </div>
+            <div style={{ fontSize: 11, color: "#9092b5", lineHeight: 1.5, marginBottom: 10 }}>
+              Where the Run button on generated code blocks executes. Choose Local
+              for a quick try, or install MatrixLab and switch to it for isolated
+              enterprise sandboxes.
+            </div>
+
+            <div style={{ display: "flex", flexDirection: "column", gap: 6 }}>
+              {SANDBOX_BACKENDS.map((b) => (
+                <label
+                  key={b.id}
+                  style={{
+                    display: "flex",
+                    gap: 10,
+                    padding: "8px 10px",
+                    borderRadius: 6,
+                    border: `1px solid ${sandbox.backend === b.id ? "#4f46e5" : "#2c2d46"}`,
+                    background: sandbox.backend === b.id ? "#1a1a3a" : "transparent",
+                    cursor: sandboxBusy ? "not-allowed" : "pointer",
+                    opacity: sandboxBusy ? 0.6 : 1,
+                  }}
+                >
+                  <input
+                    type="radio"
+                    name="sandbox-backend"
+                    value={b.id}
+                    checked={sandbox.backend === b.id}
+                    disabled={sandboxBusy}
+                    onChange={() => updateSandbox({ backend: b.id })}
+                    style={{ marginTop: 2 }}
+                  />
+                  <div style={{ flex: 1 }}>
+                    <div style={{ fontSize: 13, fontWeight: 500, color: "#e6e8ff" }}>{b.label}</div>
+                    <div style={{ fontSize: 11, color: "#9092b5", marginTop: 2 }}>{b.sub}</div>
+                  </div>
+                </label>
+              ))}
+            </div>
+
+            {sandbox.backend === "matrixlab" && (
+              <div style={{ marginTop: 10, padding: 10, background: "#0e0f24", borderRadius: 6, border: "1px solid #2c2d46" }}>
+                <div style={{ display: "grid", gridTemplateColumns: "120px 1fr", gap: 8, alignItems: "center" }}>
+                  <label style={{ fontSize: 12, color: "#c3c5dd" }}>Runner URL</label>
+                  <input
+                    type="text"
+                    value={sandbox.matrixlab_url || ""}
+                    onChange={(e) => setSandbox({ ...sandbox, matrixlab_url: e.target.value })}
+                    onBlur={() => updateSandbox({ matrixlab_url: sandbox.matrixlab_url })}
+                    placeholder="http://localhost:8000"
+                    style={{
+                      fontSize: 12, padding: "4px 6px",
+                      background: "#14152a", color: "#e6e8ff",
+                      border: "1px solid #2c2d46", borderRadius: 4,
+                    }}
+                  />
+                  <label style={{ fontSize: 12, color: "#c3c5dd" }}>Bearer token</label>
+                  <div style={{ display: "flex", gap: 6 }}>
+                    <input
+                      type="password"
+                      value={sandboxTokenInput}
+                      onChange={(e) => setSandboxTokenInput(e.target.value)}
+                      placeholder={sandbox.has_token ? "•••••••• (saved)" : "Optional"}
+                      style={{
+                        flex: 1,
+                        fontSize: 12, padding: "4px 6px",
+                        background: "#14152a", color: "#e6e8ff",
+                        border: "1px solid #2c2d46", borderRadius: 4,
+                      }}
+                    />
+                    <button
+                      type="button"
+                      className="chat-btn secondary"
+                      style={{ padding: "2px 8px", fontSize: 11 }}
+                      disabled={sandboxBusy}
+                      onClick={() => updateSandbox({ matrixlab_token: sandboxTokenInput })}
+                    >
+                      Save token
+                    </button>
+                    {sandbox.has_token && (
+                      <button
+                        type="button"
+                        className="chat-btn secondary"
+                        style={{ padding: "2px 8px", fontSize: 11 }}
+                        disabled={sandboxBusy}
+                        onClick={() => updateSandbox({ matrixlab_token: "" })}
+                        title="Clear the saved token"
+                      >
+                        Clear
+                      </button>
+                    )}
+                  </div>
+                  <label style={{ fontSize: 12, color: "#c3c5dd" }}>Default image</label>
+                  <input
+                    type="text"
+                    value={sandbox.matrixlab_image || ""}
+                    onChange={(e) => setSandbox({ ...sandbox, matrixlab_image: e.target.value })}
+                    onBlur={() => updateSandbox({ matrixlab_image: sandbox.matrixlab_image })}
+                    placeholder="matrixlab-python (let runner pick)"
+                    style={{
+                      fontSize: 12, padding: "4px 6px",
+                      background: "#14152a", color: "#e6e8ff",
+                      border: "1px solid #2c2d46", borderRadius: 4,
+                    }}
+                  />
+                </div>
+              </div>
+            )}
+
+            {/* MatrixLab lifecycle card — only shown when MatrixLab is
+                the selected backend. The button label tracks the
+                detected state: Install → Start → Running. When the
+                operator hasn't enabled GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE
+                the actions are disabled and an inline hint explains how
+                to flip the env flag. */}
+            {sandbox.backend === "matrixlab" && lifecycle && (
+              <div style={{
+                marginTop: 10, padding: 10,
+                background: "#0e0f24", borderRadius: 6,
+                border: "1px solid #2c2d46",
+              }}>
+                <div style={{ display: "flex", justifyContent: "space-between", alignItems: "center", marginBottom: 8 }}>
+                  <div style={{ fontSize: 12, fontWeight: 600, color: "#c3c5dd" }}>
+                    MatrixLab lifecycle
+                  </div>
+                  <span style={{
+                    display: "inline-flex", alignItems: "center", gap: 6,
+                    fontSize: 11, fontWeight: 600, padding: "2px 8px", borderRadius: 10,
+                    background: lifecycle.running ? "#0d3320"
+                      : lifecycle.installed ? "#3d2d11"
+                      : "#3d1111",
+                    border: `1px solid ${lifecycle.running ? "#166534"
+                      : lifecycle.installed ? "#854d0e"
+                      : "#7f1d1d"}`,
+                    color: lifecycle.running ? "#86efac"
+                      : lifecycle.installed ? "#fde68a"
+                      : "#fca5a5",
+                  }}>
+                    <span style={{
+                      width: 6, height: 6, borderRadius: "50%",
+                      background: lifecycle.running ? "#10B981"
+                        : lifecycle.installed ? "#f59e0b"
+                        : "#ef4444",
+                    }} />
+                    {lifecycle.running ? "Running"
+                      : lifecycle.installed ? "Installed · stopped"
+                      : "Not installed"}
+                  </span>
+                  {sandbox.env_override && (
+                    <span style={{
+                      marginLeft: 8, fontSize: 10, fontWeight: 600,
+                      padding: "1px 6px", borderRadius: 9,
+                      background: "rgba(217, 119, 6, 0.12)",
+                      color: "#f59e0b",
+                      border: "1px solid rgba(217, 119, 6, 0.35)",
+                    }}
+                    title={`Env var ${sandbox.env_override} is overriding the persisted setting`}>
+                    env override
+                    </span>
+                  )}
+                </div>
+
+                <div style={{ display: "flex", gap: 6, flexWrap: "wrap", marginBottom: 6 }}>
+                  {/* Running > Installed > Not-installed.  Checking
+                      ``running`` first matters when the operator
+                      brought MatrixLab up from source (e.g. `make run`
+                      inside a checkout) so the image tag doesn't match
+                      our canonical ``ruslanmv/matrixlab-runner:latest``
+                      — the URL still answers, just don't offer to
+                      install on top of a healthy runner. */}
+                  {lifecycle.running ? (
+                    <button
+                      type="button"
+                      className="chat-btn secondary"
+                      style={{ padding: "4px 10px", fontSize: 11 }}
+                      disabled={!lifecycle.lifecycle_enabled || lifecycleBusy != null}
+                      onClick={() => runLifecycle("stop")}
+                      title="docker stop the GitPilot-managed runner container"
+                    >
+                      {lifecycleBusy === "stop" ? "Stopping…" : "Stop"}
+                    </button>
+                  ) : lifecycle.installed ? (
+                    <button
+                      type="button"
+                      className="chat-btn"
+                      style={{ padding: "4px 10px", fontSize: 11, fontWeight: 600 }}
+                      disabled={!lifecycle.lifecycle_enabled || lifecycleBusy != null}
+                      onClick={() => runLifecycle("start")}
+                      title="docker run the MatrixLab runner container"
+                    >
+                      {lifecycleBusy === "start" ? "Starting…" : "Start"}
+                    </button>
+                  ) : (
+                    <button
+                      type="button"
+                      className="chat-btn"
+                      style={{ padding: "4px 10px", fontSize: 11, fontWeight: 600 }}
+                      disabled={!lifecycle.lifecycle_enabled || !lifecycle.docker_available || lifecycleBusy != null}
+                      onClick={() => runLifecycle("install")}
+                      title="docker pull the MatrixLab runner + sandbox images"
+                    >
+                      {lifecycleBusy === "install" ? "Installing…" : "Install"}
+                    </button>
+                  )}
+                  <button
+                    type="button"
+                    className="chat-btn secondary"
+                    style={{ padding: "4px 10px", fontSize: 11 }}
+                    disabled={lifecycleBusy != null}
+                    onClick={loadLifecycle}
+                  >
+                    Refresh
+                  </button>
+                  {lifecycleLog.length > 0 && (
+                    <button
+                      type="button"
+                      className="chat-btn secondary"
+                      style={{ padding: "4px 10px", fontSize: 11 }}
+                      onClick={() => setShowLifecycleLog((v) => !v)}
+                    >
+                      {showLifecycleLog ? "Hide log" : `Show log (${lifecycleLog.length})`}
+                    </button>
+                  )}
+                </div>
+
+                {lifecycle.instructions && (
+                  <div style={{
+                    fontSize: 11, lineHeight: 1.5, color: "#fde68a",
+                    padding: "6px 8px", background: "#2a210d",
+                    border: "1px solid #854d0e", borderRadius: 4,
+                    marginBottom: 6,
+                  }}>
+                    {lifecycle.instructions}
+                  </div>
+                )}
+                {lifecycle.error && (
+                  <div style={{
+                    fontSize: 11, color: "#fca5a5", fontFamily: "ui-monospace, monospace",
+                    padding: "6px 8px", background: "#3d1111", border: "1px solid #7f1d1d",
+                    borderRadius: 4, marginBottom: 6,
+                  }}>
+                    {lifecycle.error}
+                  </div>
+                )}
+
+                {/* Per-step transcript — surfaced so failures are
+                    debuggable from the UI without SSH'ing to the host. */}
+                {showLifecycleLog && lifecycleLog.length > 0 && (
+                  <div style={{
+                    marginTop: 4, padding: 8, background: "#000",
+                    borderRadius: 4, border: "1px solid #2c2d46",
+                    fontFamily: "ui-monospace, monospace", fontSize: 11,
+                    maxHeight: 240, overflow: "auto",
+                  }}>
+                    {lifecycleLog.map((step, i) => (
+                      <div key={i} style={{ marginBottom: 8, paddingBottom: 6, borderBottom: i === lifecycleLog.length - 1 ? "0" : "1px dashed #2c2d46" }}>
+                        <div style={{ color: "#a5b4fc" }}>$ {step.cmd}</div>
+                        <div style={{
+                          color: step.exit_code === 0 ? "#86efac" : "#fca5a5",
+                          fontSize: 10, marginTop: 2,
+                        }}>
+                          exit {step.exit_code} · {step.duration_ms} ms
+                        </div>
+                        {step.stdout && <pre style={{ margin: "4px 0 0", color: "#D4D4D8", whiteSpace: "pre-wrap" }}>{step.stdout}</pre>}
+                        {step.stderr && <pre style={{ margin: "4px 0 0", color: "#fca5a5", whiteSpace: "pre-wrap" }}>{step.stderr}</pre>}
+                      </div>
+                    ))}
+                  </div>
+                )}
+              </div>
+            )}
+
+            <div style={{ display: "flex", gap: 8, alignItems: "center", marginTop: 10, flexWrap: "wrap" }}>
+              <button
+                type="button"
+                className="chat-btn secondary"
+                style={{ padding: "4px 8px", fontSize: 11 }}
+                onClick={loadSandboxStatus}
+                disabled={sandboxBusy}
+              >
+                Test connection
+              </button>
+              <label style={{ display: "flex", gap: 6, alignItems: "center", fontSize: 12, color: "#c3c5dd" }}>
+                <input
+                  type="checkbox"
+                  checked={!!sandbox.allow_network}
+                  disabled={sandboxBusy}
+                  onChange={(e) => updateSandbox({ allow_network: e.target.checked })}
+                />
+                Allow network egress
+              </label>
+              <label style={{ display: "flex", gap: 6, alignItems: "center", fontSize: 12, color: "#c3c5dd" }}>
+                Timeout
+                <input
+                  type="number"
+                  min={1}
+                  max={600}
+                  value={sandbox.timeout_sec || 120}
+                  disabled={sandboxBusy}
+                  onChange={(e) => setSandbox({ ...sandbox, timeout_sec: Number(e.target.value) || 120 })}
+                  onBlur={() => updateSandbox({ timeout_sec: Number(sandbox.timeout_sec) || 120 })}
+                  style={{
+                    width: 64,
+                    fontSize: 12, padding: "2px 6px",
+                    background: "#14152a", color: "#e6e8ff",
+                    border: "1px solid #2c2d46", borderRadius: 4,
+                  }}
+                />
+                <span style={{ color: "#9092b5" }}>s</span>
+              </label>
+            </div>
+
+            {sandboxStatus?.error && (
+              <div style={{
+                marginTop: 8,
+                padding: "6px 10px",
+                borderRadius: 6,
+                background: "#3d1111",
+                border: "1px solid #7f1d1d",
+                color: "#fca5a5",
+                fontSize: 11,
+                fontFamily: "ui-monospace, SFMono-Regular, Menlo, monospace",
+              }}>
+                {sandboxStatus.error}
+              </div>
+            )}
+            {sandbox.backend === "matrixlab" && sandboxStatus?.ok && sandboxStatus?.remote?.version && (
+              <div style={{ marginTop: 6, fontSize: 11, color: "#9092b5" }}>
+                MatrixLab Runner v{sandboxStatus.remote.version}
+                {typeof sandboxStatus.remote.uptime_s === "number" &&
+                  ` · up ${Math.round(sandboxStatus.remote.uptime_s / 60)} min`}
+              </div>
+            )}
+          </div>
+        )}
       </div>
     </div>
   );
diff --git a/frontend/components/TasksPanel.jsx b/frontend/components/TasksPanel.jsx
new file mode 100644
index 0000000..114d033
--- /dev/null
+++ b/frontend/components/TasksPanel.jsx
@@ -0,0 +1,382 @@
+// frontend/components/TasksPanel.jsx
+//
+// Right-sidebar Tasks panel — Claude Code-style trace of every AI
+// invocation in the active session.  Trigger is a small ⊞ icon next
+// to the context meter; clicking it opens a popover anchored to the
+// composer rail.
+//
+// V1 contract (simplest cut):
+//   - One task per top-level user action (Plan, Execute).
+//   - Lazy fetch on open + manual ↻ refresh.  Zero idle traffic.
+//   - No cost row.  Token counts shown only when the provider exposes
+//     them; otherwise "—".
+//
+// GitPilot brand orange #D95C3D is used only for the running-state
+// dot — completed is slate, failed is the existing red.  No new deps;
+// inline styles + scoped <style> block, same pattern as ContextMeter.
+
+import React, { useEffect, useRef, useState } from "react";
+
+const GITPILOT_ORANGE = "#D95C3D";
+const SUCCESS_GREEN = "#10B981";
+const FAIL_RED = "#EF4444";
+const DIM = "#9aa0b4";
+
+const fmtMs = (ms) => {
+  if (ms == null) return "—";
+  if (ms < 1000) return `${ms} ms`;
+  const s = ms / 1000;
+  if (s < 60) return `${s.toFixed(s < 10 ? 1 : 0)} s`;
+  const m = Math.floor(s / 60);
+  const rem = Math.round(s - m * 60);
+  return `${m} m ${rem.toString().padStart(2, "0")} s`;
+};
+
+const fmtTokens = (n) => {
+  if (n == null) return "—";
+  if (n < 1000) return `${n}`;
+  if (n < 1_000_000) return `${(n / 1000).toFixed(1)}k`;
+  return `${(n / 1_000_000).toFixed(2)}M`;
+};
+
+function StatusGlyph({ status }) {
+  if (status === "running") {
+    return (
+      <span
+        aria-label="running"
+        title="running"
+        style={{
+          display: "inline-block",
+          width: 10,
+          height: 10,
+          borderRadius: "50%",
+          background: GITPILOT_ORANGE,
+          animation: "gitpilot-tasks-pulse 1.2s ease-in-out infinite",
+        }}
+      />
+    );
+  }
+  if (status === "failed") {
+    return (
+      <span aria-label="failed" title="failed" style={{ color: FAIL_RED, fontSize: 13 }}>
+        ✕
+      </span>
+    );
+  }
+  return (
+    <span aria-label="completed" title="completed" style={{ color: SUCCESS_GREEN, fontSize: 13 }}>
+      ✓
+    </span>
+  );
+}
+
+function TaskRow({ task }) {
+  const kindLabel = task.kind ? task.kind[0].toUpperCase() + task.kind.slice(1) : "—";
+  const parts = [kindLabel];
+  parts.push(fmtMs(task.duration_ms));
+  if (task.prompt_tokens != null || task.completion_tokens != null) {
+    const t = (task.prompt_tokens || 0) + (task.completion_tokens || 0);
+    if (t > 0) parts.push(`${fmtTokens(t)} tokens`);
+  }
+  parts.push(task.status === "completed" ? "✓ completed" : task.status);
+  return (
+    <div className="ctx-task-row">
+      <div className="ctx-task-glyph">
+        <StatusGlyph status={task.status} />
+      </div>
+      <div className="ctx-task-body">
+        <div className="ctx-task-title" title={task.title}>
+          {task.title || "(untitled)"}
+        </div>
+        <div className="ctx-task-meta">{parts.join(" · ")}</div>
+        {task.error && (
+          <div className="ctx-task-err" title={task.error}>
+            {task.error}
+          </div>
+        )}
+      </div>
+    </div>
+  );
+}
+
+export default function TasksPanel({ sessionId = null }) {
+  const [open, setOpen] = useState(false);
+  const [tasks, setTasks] = useState([]);
+  const [loading, setLoading] = useState(false);
+  const [error, setError] = useState(null);
+  const popoverRef = useRef(null);
+  const triggerRef = useRef(null);
+
+  const fetchTasks = async () => {
+    if (!sessionId) {
+      setTasks([]);
+      return;
+    }
+    setLoading(true);
+    setError(null);
+    try {
+      const r = await fetch(`/api/sessions/${encodeURIComponent(sessionId)}/tasks`);
+      if (!r.ok) {
+        if (r.status === 404) {
+          setError("disabled");
+          setTasks([]);
+        } else {
+          setError(`http ${r.status}`);
+        }
+      } else {
+        const data = await r.json();
+        setTasks(Array.isArray(data.tasks) ? data.tasks : []);
+      }
+    } catch (e) {
+      setError(String(e?.message || e));
+    } finally {
+      setLoading(false);
+    }
+  };
+
+  useEffect(() => {
+    if (open) fetchTasks();
+  }, [open, sessionId]); // eslint-disable-line react-hooks/exhaustive-deps
+
+  useEffect(() => {
+    setTasks([]);
+  }, [sessionId]);
+
+  useEffect(() => {
+    if (!open) return;
+    const onDocClick = (e) => {
+      if (
+        popoverRef.current &&
+        !popoverRef.current.contains(e.target) &&
+        triggerRef.current &&
+        !triggerRef.current.contains(e.target)
+      ) {
+        setOpen(false);
+      }
+    };
+    const onKey = (e) => {
+      if (e.key === "Escape") setOpen(false);
+    };
+    document.addEventListener("mousedown", onDocClick);
+    document.addEventListener("keydown", onKey);
+    return () => {
+      document.removeEventListener("mousedown", onDocClick);
+      document.removeEventListener("keydown", onKey);
+    };
+  }, [open]);
+
+  if (error === "disabled") return null;
+
+  const running = tasks.filter((t) => t.status === "running");
+  const completed = tasks.filter((t) => t.status !== "running");
+
+  return (
+    <span className="gitpilot-tasks-panel" style={{ position: "relative", display: "inline-flex" }}>
+      <style>{`
+        @keyframes gitpilot-tasks-pulse {
+          0%, 100% { opacity: 1; }
+          50% { opacity: 0.35; }
+        }
+        .gitpilot-tasks-panel .tasks-trigger {
+          background: transparent;
+          border: 1px solid rgba(255,255,255,0.12);
+          color: ${DIM};
+          width: 22px;
+          height: 22px;
+          border-radius: 4px;
+          display: inline-flex;
+          align-items: center;
+          justify-content: center;
+          font-size: 12px;
+          line-height: 1;
+          cursor: pointer;
+          padding: 0;
+          transition: color 120ms ease, border-color 120ms ease;
+        }
+        .gitpilot-tasks-panel .tasks-trigger:hover,
+        .gitpilot-tasks-panel .tasks-trigger:focus-visible {
+          color: #e5e7eb;
+          border-color: rgba(255,255,255,0.28);
+          outline: none;
+        }
+        .gitpilot-tasks-panel .tasks-trigger[data-running="1"] { color: ${GITPILOT_ORANGE}; border-color: ${GITPILOT_ORANGE}55; }
+        .gitpilot-tasks-panel .tasks-popover {
+          position: absolute;
+          right: 0;
+          bottom: calc(100% + 8px);
+          width: 380px;
+          max-height: 480px;
+          overflow-y: auto;
+          background: #1a1c25;
+          border: 1px solid rgba(255,255,255,0.10);
+          border-radius: 8px;
+          box-shadow: 0 8px 24px rgba(0,0,0,0.45);
+          padding: 12px 14px;
+          z-index: 50;
+          font-family: ui-sans-serif, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif;
+        }
+        .gitpilot-tasks-panel .tasks-popover h4 {
+          margin: 0 0 8px 0;
+          font-size: 12px;
+          font-weight: 600;
+          letter-spacing: 0.04em;
+          text-transform: uppercase;
+          color: ${DIM};
+        }
+        .gitpilot-tasks-panel .tasks-section {
+          margin-bottom: 12px;
+        }
+        .gitpilot-tasks-panel .tasks-section-label {
+          font-size: 11px;
+          color: ${DIM};
+          margin-bottom: 6px;
+        }
+        .gitpilot-tasks-panel .ctx-task-row {
+          display: flex;
+          align-items: flex-start;
+          gap: 10px;
+          padding: 8px 10px;
+          border: 1px solid rgba(255,255,255,0.06);
+          border-radius: 6px;
+          margin-bottom: 6px;
+        }
+        .gitpilot-tasks-panel .ctx-task-glyph {
+          flex: 0 0 16px;
+          padding-top: 2px;
+          display: flex;
+          justify-content: center;
+        }
+        .gitpilot-tasks-panel .ctx-task-body { flex: 1; min-width: 0; }
+        .gitpilot-tasks-panel .ctx-task-title {
+          font-size: 13px;
+          color: #e5e7eb;
+          overflow: hidden;
+          text-overflow: ellipsis;
+          white-space: nowrap;
+        }
+        .gitpilot-tasks-panel .ctx-task-meta {
+          font-size: 11px;
+          color: ${DIM};
+          font-variant-numeric: tabular-nums;
+          margin-top: 2px;
+        }
+        .gitpilot-tasks-panel .ctx-task-err {
+          font-size: 11px;
+          color: ${FAIL_RED};
+          margin-top: 4px;
+          overflow: hidden;
+          text-overflow: ellipsis;
+          white-space: nowrap;
+        }
+        .gitpilot-tasks-panel .tasks-footer {
+          display: flex;
+          justify-content: space-between;
+          align-items: center;
+          margin-top: 8px;
+          font-size: 11px;
+          color: ${DIM};
+        }
+        .gitpilot-tasks-panel .tasks-refresh {
+          background: transparent;
+          border: 1px solid rgba(255,255,255,0.14);
+          color: #cbd1e3;
+          font-size: 11px;
+          padding: 2px 8px;
+          border-radius: 4px;
+          cursor: pointer;
+        }
+        .gitpilot-tasks-panel .tasks-refresh:hover { color: #fff; border-color: rgba(255,255,255,0.3); }
+        .gitpilot-tasks-panel .tasks-refresh:disabled { opacity: 0.5; cursor: default; }
+        .gitpilot-tasks-panel .tasks-empty {
+          font-size: 12px;
+          color: ${DIM};
+          text-align: center;
+          padding: 20px 0;
+        }
+      `}</style>
+
+      <button
+        ref={triggerRef}
+        type="button"
+        className="tasks-trigger"
+        aria-label="Tasks"
+        aria-haspopup="dialog"
+        aria-expanded={open}
+        data-running={running.length > 0 ? "1" : "0"}
+        onClick={() => setOpen((v) => !v)}
+        title="Tasks"
+      >
+        {/* Simple grid glyph to match the screenshot's column icon. */}
+        <svg width="12" height="12" viewBox="0 0 16 16" fill="none" aria-hidden="true">
+          <rect x="1" y="1" width="6" height="6" rx="1" stroke="currentColor" strokeWidth="1.4" />
+          <rect x="9" y="1" width="6" height="6" rx="1" stroke="currentColor" strokeWidth="1.4" />
+          <rect x="1" y="9" width="6" height="6" rx="1" stroke="currentColor" strokeWidth="1.4" />
+          <rect x="9" y="9" width="6" height="6" rx="1" stroke="currentColor" strokeWidth="1.4" />
+        </svg>
+      </button>
+
+      {open && (
+        <div
+          ref={popoverRef}
+          className="tasks-popover"
+          role="dialog"
+          aria-label="Tasks panel"
+        >
+          <h4>Tasks</h4>
+
+          {!sessionId && (
+            <div className="tasks-empty">Start a chat to see tasks here.</div>
+          )}
+
+          {sessionId && loading && tasks.length === 0 && (
+            <div className="tasks-empty">Loading…</div>
+          )}
+
+          {sessionId && error && error !== "disabled" && (
+            <div className="tasks-empty" style={{ color: "#ffb3b7" }}>
+              Couldn't load: {error}
+            </div>
+          )}
+
+          {sessionId && !loading && !error && tasks.length === 0 && (
+            <div className="tasks-empty">No tasks yet.</div>
+          )}
+
+          {running.length > 0 && (
+            <div className="tasks-section">
+              <div className="tasks-section-label">In flight</div>
+              {running.map((t) => <TaskRow key={t.id} task={t} />)}
+            </div>
+          )}
+
+          {completed.length > 0 && (
+            <div className="tasks-section">
+              <div className="tasks-section-label">
+                Completed ({completed.length})
+              </div>
+              {completed
+                .slice()
+                .reverse()
+                .map((t) => <TaskRow key={t.id} task={t} />)}
+            </div>
+          )}
+
+          {sessionId && (
+            <div className="tasks-footer">
+              <span>One row per AI invocation.</span>
+              <button
+                type="button"
+                className="tasks-refresh"
+                onClick={fetchTasks}
+                disabled={loading}
+                aria-label="Refresh tasks"
+              >
+                {loading ? "…" : "↻ refresh"}
+              </button>
+            </div>
+          )}
+        </div>
+      )}
+    </span>
+  );
+}
diff --git a/gitpilot/agent_prompts.py b/gitpilot/agent_prompts.py
new file mode 100644
index 0000000..a284da8
--- /dev/null
+++ b/gitpilot/agent_prompts.py
@@ -0,0 +1,414 @@
+"""Lean agent-prompt templates for GitPilot (Batch B12).
+
+Rewrites every agent persona and task description with the small-
+model rules:
+
+* No emotional intensifiers (CRITICAL, THOROUGHLY).
+* No "etc." — explicit list or omit.
+* No speculative example filenames (package.json on a repo that
+  doesn't have one is hallucination bait).
+* Facts block lives at the bottom of every prompt — small models
+  over-weight the last segment.
+* Per-intent rule blocks instead of one universal block — we only
+  inject the create / modify / delete / info rules that match what
+  the user actually asked for, picked off the B9 query router's
+  ``RouterDecision.intent``.
+
+The templates are plain ``str.format``-friendly so callers don't
+have to know about the placeholders.  Single source of truth so
+tests can pin character budgets and forbidden-keyword bans without
+chasing duplicated strings across ``agentic.py``.
+
+Gated by the ``lean_prompts`` feature flag (default **on**).  When
+off, callers fall back to the legacy verbose prompts in agentic.py.
+"""
+from __future__ import annotations
+
+from . import flags
+
+FLAG_LEAN_PROMPTS = "lean_prompts"
+
+# Character budgets per prompt — pinned by tests so a future
+# "let me add one more rule" edit can't silently bloat them.
+# Budget covers framing + a typical 10-15 file list.  Larger repos
+# pay more in the file-list section — that's the useful facts the
+# planner needs and we count it under the planner-stack test instead.
+PLAN_TASK_CHAR_BUDGET           = 1_400
+EXPLORER_TASK_CHAR_BUDGET       =   500
+CREATE_FILE_TASK_CHAR_BUDGET    =   700
+MODIFY_FILE_TASK_CHAR_BUDGET    =   600
+CODE_WRITER_BACKSTORY_BUDGET    =   300
+EXPLORER_BACKSTORY_BUDGET       =   200
+PLANNER_BACKSTORY_BUDGET        =   250
+SPECIALIST_BACKSTORY_BUDGET     =   220
+
+# Words to scrub from every prompt.  These are the high-volume,
+# small-model-confusing tokens identified in the inventory.
+FORBIDDEN_KEYWORDS = (
+    "CRITICAL",
+    "THOROUGHLY",
+    "MUST",            # emotional imperative — replace with the verb
+    "etc.",
+    "package.json",   # speculative example file that primed hallucination
+)
+
+
+# ----------------------------------------------------------------------
+# Backstories
+# ----------------------------------------------------------------------
+
+EXPLORER_BACKSTORY = (
+    "You inspect repositories using the supplied tools. "
+    "You report only what the tools return.  You do not "
+    "speculate about files or structure."
+)
+
+EXPLORER_GOAL = (
+    "Inspect the repository and produce a fact-only summary"
+)
+
+
+PLANNER_BACKSTORY = (
+    "You write structured refactor plans from verified repository "
+    "facts.  You only reference files that appear in the supplied "
+    "file list.  DELETE actions require that the file exists in the "
+    "list.  CREATE actions require that the path does not."
+)
+
+PLANNER_GOAL = (
+    "Design a JSON plan for the user goal using only verified files"
+)
+
+
+CODE_WRITER_BACKSTORY = (
+    "You write clean, working code or documentation that matches "
+    "the requested file path's extension.  When reading existing "
+    "files is needed, you use the supplied tools first."
+)
+
+CODE_WRITER_GOAL = (
+    "Generate file content that satisfies the plan step"
+)
+
+
+# ----------------------------------------------------------------------
+# Explorer task
+# ----------------------------------------------------------------------
+
+EXPLORER_TASK_TEMPLATE = """\
+Repository: {repo_full_name}
+Active ref: {active_ref}
+
+Required tool calls (in order):
+1. Get repository summary
+2. List all files in repository
+3. Get directory structure
+4. Read README.md only if it appears in the list
+
+Rules:
+- Mention only files returned by the tools.
+- Do not invent files or folders.
+
+Return exactly:
+
+REPOSITORY EXPLORATION REPORT
+Files Found:
+- <one path per line>
+
+Key Files:
+- <subset>
+
+Directory Structure:
+<tree>
+
+File Types:
+<ext>=<count>, ...
+"""
+
+
+# ----------------------------------------------------------------------
+# Plan task — intent-routed rule blocks
+# ----------------------------------------------------------------------
+
+# Header + footer wrap each per-intent block.  The footer is the
+# "facts block" the user flagged: lives at the bottom so small
+# models give it the most attention weight.
+PLAN_TASK_HEADER_TEMPLATE = """\
+User goal: {goal}
+Repository: {repo_full_name}
+Active ref: {active_ref}
+
+Existing files (verified by tools):
+{file_list_lines}
+
+"""
+
+PLAN_TASK_RULES_CREATE = """\
+Rules:
+- The user asked to create new content.  Include at least one CREATE file.
+- READ existing files only when needed as input for the new file.
+- Do not include MODIFY or DELETE unless the goal asks for them.
+"""
+
+PLAN_TASK_RULES_MODIFY = """\
+Rules:
+- Use MODIFY only for files in the existing-files list above.
+- READ a file when you need its content before modifying.
+- Do not CREATE or DELETE unless the goal asks for it.
+"""
+
+PLAN_TASK_RULES_DELETE = """\
+Rules:
+- Use DELETE only for files in the existing-files list above.
+- Do not include CREATE or MODIFY actions.
+- Files the user wants to keep are absent from the plan.
+"""
+
+PLAN_TASK_RULES_FIND = """\
+Rules:
+- The user asked a search question.
+- Plan READ actions for files likely to contain the answer.
+- Include a substantive summary that answers the question.
+"""
+
+PLAN_TASK_RULES_INFO = """\
+Rules:
+- The user asked an informational question.
+- Empty steps is fine; the summary itself is the answer.
+- Use READ only when you need a specific file's content.
+"""
+
+PLAN_TASK_RULES_UNKNOWN = """\
+Rules:
+- READ / MODIFY / DELETE only for files in the existing-files list.
+- CREATE only for paths NOT in that list.
+- Match the action to what the user goal asks for.
+"""
+
+# Schema block kept tight — one example object, no prose explanations.
+PLAN_TASK_SCHEMA = """\
+Return one JSON object only (no markdown fences, no prose):
+{
+  "goal": "...",
+  "summary": "...",
+  "steps": [
+    {
+      "step_number": 1,
+      "title": "...",
+      "description": "...",
+      "files": [
+        {"path": "<existing file>", "action": "READ"},
+        {"path": "<new file>",      "action": "CREATE"}
+      ],
+      "risks": null
+    }
+  ]
+}
+
+JSON rules:
+- "action" is one of: CREATE, MODIFY, DELETE, READ, INDEX
+- "step_number" is a positive integer
+- "risks" is either a string or null (the JSON null literal)
+- The entire response is the JSON object — nothing before or after
+"""
+
+# Footer = the facts block.  Last 200-300 chars of the prompt =
+# highest attention weight on small models.
+PLAN_TASK_FOOTER_TEMPLATE = """\
+Known facts:
+- Total files in repository: {file_count}
+- A path NOT in the existing-files list above does NOT exist.
+- Never mention a file that is not in that list as if it exists.
+
+Now produce the JSON plan.
+"""
+
+
+_INTENT_TO_RULES = {
+    "create":  PLAN_TASK_RULES_CREATE,
+    "modify":  PLAN_TASK_RULES_MODIFY,
+    "fix":     PLAN_TASK_RULES_MODIFY,    # fix = modify under the hood
+    "delete":  PLAN_TASK_RULES_DELETE,
+    "find":    PLAN_TASK_RULES_FIND,
+    "info":    PLAN_TASK_RULES_INFO,
+    "unknown": PLAN_TASK_RULES_UNKNOWN,
+}
+
+
+def render_plan_task(
+    *,
+    goal: str,
+    repo_full_name: str,
+    active_ref: str,
+    file_list: list[str],
+    intent: str | None,
+) -> str:
+    """Build the planner's task description from verified facts.
+
+    ``intent`` is the literal from :class:`gitpilot.query_router.RouterDecision`.
+    Unknown / missing intent falls back to the generic rule block.
+    """
+    file_lines = "\n".join(f"- {p}" for p in file_list) if file_list else "(empty repository)"
+    rules = _INTENT_TO_RULES.get((intent or "unknown").lower(), PLAN_TASK_RULES_UNKNOWN)
+    return (
+        PLAN_TASK_HEADER_TEMPLATE.format(
+            goal=goal, repo_full_name=repo_full_name,
+            active_ref=active_ref, file_list_lines=file_lines,
+        )
+        + rules
+        + "\n"
+        + PLAN_TASK_SCHEMA
+        + "\n"
+        + PLAN_TASK_FOOTER_TEMPLATE.format(file_count=len(file_list))
+    )
+
+
+def render_explorer_task(*, repo_full_name: str, active_ref: str) -> str:
+    return EXPLORER_TASK_TEMPLATE.format(
+        repo_full_name=repo_full_name, active_ref=active_ref,
+    )
+
+
+# ----------------------------------------------------------------------
+# Code-writer tasks — CREATE and MODIFY
+# ----------------------------------------------------------------------
+
+CREATE_FILE_TASK_TEMPLATE = """\
+Generate the full contents of a new file: {file_path}
+
+Goal: {goal}
+Step context: {step_description}
+
+Rules:
+- Match the file extension's conventions ({extension}).
+- If existing files are relevant, use the supplied tools to read them.
+- Output ONLY the file content (no explanations, no markdown fences).
+"""
+
+MODIFY_FILE_TASK_TEMPLATE = """\
+Modify the file: {file_path}
+
+Goal: {goal}
+Step context: {step_description}
+
+Current file content:
+{current_content}
+
+Rules:
+- Preserve every line that does not need to change.
+- Match the file extension's conventions ({extension}).
+- Output ONLY the complete updated file (no explanations, no fences).
+"""
+
+
+def render_create_file_task(
+    *,
+    file_path: str,
+    goal: str,
+    step_description: str,
+) -> str:
+    return CREATE_FILE_TASK_TEMPLATE.format(
+        file_path=file_path,
+        goal=goal,
+        step_description=step_description,
+        extension=_ext_of(file_path),
+    )
+
+
+def render_modify_file_task(
+    *,
+    file_path: str,
+    goal: str,
+    step_description: str,
+    current_content: str,
+) -> str:
+    return MODIFY_FILE_TASK_TEMPLATE.format(
+        file_path=file_path,
+        goal=goal,
+        step_description=step_description,
+        extension=_ext_of(file_path),
+        current_content=current_content,
+    )
+
+
+def _ext_of(path: str) -> str:
+    name = path.rsplit("/", 1)[-1]
+    if "." not in name:
+        return name or "(no extension)"
+    return "." + name.rsplit(".", 1)[-1].lower()
+
+
+# ----------------------------------------------------------------------
+# Specialist agent backstories (Issue / PR / Search / Code Review / …)
+# ----------------------------------------------------------------------
+#
+# These were ~500-char persona blocks under the previous design.  Each
+# is now ~150-200 chars: role + scope + single tool-use sentence.
+
+SPECIALIST_BACKSTORIES = {
+    "issue_management": (
+        "You manage GitHub issues — list, create, comment, label, close, assign. "
+        "You use the supplied issue tools and report concrete results."
+    ),
+    "pr_management": (
+        "You manage GitHub pull requests — list, create, review, comment, merge. "
+        "You use the supplied PR tools and report concrete results."
+    ),
+    "search_discovery": (
+        "You answer search and discovery questions about the repository. "
+        "You use file-listing and content-search tools and cite exact matches."
+    ),
+    "code_review": (
+        "You review code for correctness, clarity, and obvious bugs. "
+        "You quote the specific lines you reference."
+    ),
+    "learning_guidance": (
+        "You answer GitHub how-to and convention questions in plain text. "
+        "You do not modify the repository."
+    ),
+    "local_editor": (
+        "You edit local files using the supplied filesystem tools. "
+        "You preserve existing content unless instructed to change it."
+    ),
+    "terminal_executor": (
+        "You run terminal commands using the supplied shell tool. "
+        "You explain command output briefly."
+    ),
+}
+
+
+# ----------------------------------------------------------------------
+# Flag check helper
+# ----------------------------------------------------------------------
+
+def lean_prompts_enabled() -> bool:
+    """Single source of truth for callers in agentic.py — flag-on means
+    use the templates here; flag-off falls back to legacy verbose
+    strings still defined inline in agentic.py."""
+    return flags.is_on(FLAG_LEAN_PROMPTS, default=True)
+
+
+__all__ = [
+    "FLAG_LEAN_PROMPTS",
+    "FORBIDDEN_KEYWORDS",
+    "PLAN_TASK_CHAR_BUDGET",
+    "EXPLORER_TASK_CHAR_BUDGET",
+    "CREATE_FILE_TASK_CHAR_BUDGET",
+    "MODIFY_FILE_TASK_CHAR_BUDGET",
+    "CODE_WRITER_BACKSTORY_BUDGET",
+    "EXPLORER_BACKSTORY_BUDGET",
+    "PLANNER_BACKSTORY_BUDGET",
+    "SPECIALIST_BACKSTORY_BUDGET",
+    "EXPLORER_BACKSTORY",
+    "EXPLORER_GOAL",
+    "PLANNER_BACKSTORY",
+    "PLANNER_GOAL",
+    "CODE_WRITER_BACKSTORY",
+    "CODE_WRITER_GOAL",
+    "SPECIALIST_BACKSTORIES",
+    "lean_prompts_enabled",
+    "render_create_file_task",
+    "render_explorer_task",
+    "render_modify_file_task",
+    "render_plan_task",
+]
diff --git a/gitpilot/agent_tools.py b/gitpilot/agent_tools.py
index e0a34ea..160e4be 100644
--- a/gitpilot/agent_tools.py
+++ b/gitpilot/agent_tools.py
@@ -8,7 +8,7 @@
 
 from crewai.tools import tool
 
-from .github_api import get_repo_tree, get_file
+from .github_api import get_file, get_repo_tree
 
 
 def _sanitize_tool_arg(value: Any, fallback_key: str = "description") -> str:
@@ -198,32 +198,232 @@ def get_directory_structure() -> str:
         return f"Error: {str(e)}"
 
 
-@tool("Read file content")
-def read_file(file_path: Any) -> str:
-    """Read the content of a file from the active repository.
+# ----------------------------------------------------------------------
+# Windowed-Read defaults — match Claude Code's contract
+# ----------------------------------------------------------------------
+READ_DEFAULT_LIMIT = 2000        # default line cap when limit is omitted
+READ_MAX_LIMIT = 10_000          # hard ceiling — beyond this the caller
+                                 # must paginate via offset
+GLOB_DEFAULT_MAX_RESULTS = 200   # cap for "Find files matching a pattern"
+GLOB_HARD_MAX_RESULTS = 1_000
 
-    file_path: the file's path relative to the repository root, e.g.
-    "README.md" or "src/main.py".  Pass a plain string — do **not** pass
-    a dict like ``{"description": "...", "type": "str"}`` (that is the
-    parameter's schema, not its value).
+
+def _coerce_int(value: Any, default: int) -> int:
+    """CrewAI sometimes passes ints as strings or dicts.  Coerce
+    safely; anything we can't parse falls back to the default.
     """
-    file_path = _sanitize_tool_arg(file_path)
+    if value is None:
+        return default
+    if isinstance(value, bool):
+        return default
+    if isinstance(value, (int, float)):
+        return int(value)
+    if isinstance(value, str):
+        try:
+            return int(value.strip())
+        except (TypeError, ValueError):
+            return default
+    if isinstance(value, dict):
+        # Common CrewAI schema-leak: {"description": "...", "type": "int"}
+        return default
+    return default
+
+
+@tool("Find files matching a pattern")
+def list_repository_files_glob(
+    pattern: Any,
+    max_results: Any = GLOB_DEFAULT_MAX_RESULTS,
+) -> str:
+    """Search the repository for files whose path matches a glob.
+
+    pattern: a pathlib-style glob.  Examples:
+        "**/*.py"            all Python files
+        "src/**/*.tsx"       every .tsx under src
+        "**/test_*.py"       all pytest files
+        "README*"            top-level README files
+    max_results: hard cap on the number of paths returned (default 200,
+        max 1000).  When the cap is hit the result is annotated so the
+        caller can refine.
+
+    Output: one path per line.  Path-only — no contents.  Use
+    "Read file content" afterwards if you need bytes.
+    """
+    pattern = _sanitize_tool_arg(pattern, fallback_key="pattern") or "**/*"
+    cap = max(1, min(GLOB_HARD_MAX_RESULTS, _coerce_int(max_results, GLOB_DEFAULT_MAX_RESULTS)))
     try:
         owner, repo, token, branch = get_repo_context()
 
         loop = asyncio.new_event_loop()
         asyncio.set_event_loop(loop)
         try:
-            # Pass token + ref explicitly
-            content = loop.run_until_complete(get_file(owner, repo, file_path, token=token, ref=branch))
+            tree = loop.run_until_complete(get_repo_tree(owner, repo, token=token, ref=branch))
         finally:
             loop.close()
 
+        if not tree:
+            return f"Repository is empty - no files. (Branch: {branch})"
+
+        # ``fnmatch`` understands `*`/`?`/`[…]` but treats `**` as a
+        # plain star.  Translate `**` → match-any-segments by walking
+        # the pattern manually for a tighter match on the common case.
+        paths = [item["path"] for item in tree]
+        matches = _glob_match(paths, pattern)
+        truncated = False
+        if len(matches) > cap:
+            matches = matches[:cap]
+            truncated = True
+
+        if not matches:
+            return f"No files matched pattern: {pattern}\n(Branch: {branch}, total files: {len(paths)})"
+
+        header = f"Repository: {owner}/{repo} (Branch: {branch})\nMatching: {pattern}\n"
+        body = "\n".join(f"  - {p}" for p in sorted(matches))
+        footer = f"\n…{cap}+ matches truncated. Refine the pattern.\n" if truncated else ""
+        return f"{header}{body}{footer}"
+    except Exception as e:
+        return f"Error globbing files: {str(e)}"
+
+
+import re as _re
+
+
+def _glob_to_regex(pattern: str) -> "_re.Pattern[str]":
+    """Translate a shell-style glob into a regex with proper `/`-aware
+    semantics — the same contract Claude Code, ripgrep and bash use:
+
+    * ``*``  matches anything **except** ``/``
+    * ``**`` matches anything **including** ``/`` (any number of segments)
+    * ``?``  matches exactly one non-``/`` character
+    * ``[abc]`` character class (passed through to regex)
+    * everything else is literal
+
+    The result is anchored with ``\\A`` and ``\\Z`` so it must match the
+    full path — ``*.py`` will not falsely match ``src/foo.py``.
+    """
+    out: list[str] = []
+    i = 0
+    while i < len(pattern):
+        c = pattern[i]
+        if c == "*":
+            if i + 1 < len(pattern) and pattern[i + 1] == "*":
+                # `**` — match any number of full segments.  When the
+                # following character is `/` consume it as part of the
+                # match (so `**/foo.py` correctly matches `foo.py`
+                # at the repo root).
+                if i + 2 < len(pattern) and pattern[i + 2] == "/":
+                    out.append("(?:.*/)?")
+                    i += 3
+                    continue
+                out.append(".*")
+                i += 2
+                continue
+            out.append("[^/]*")
+            i += 1
+        elif c == "?":
+            out.append("[^/]")
+            i += 1
+        elif c == ".":
+            out.append(r"\.")
+            i += 1
+        elif c == "[":
+            # Character class — pass through up to the matching ']'.
+            j = pattern.find("]", i + 1)
+            if j == -1:
+                out.append(r"\[")
+                i += 1
+            else:
+                out.append(pattern[i : j + 1])
+                i = j + 1
+        else:
+            out.append(_re.escape(c))
+            i += 1
+    return _re.compile(r"\A" + "".join(out) + r"\Z")
+
+
+def _glob_match(paths: List[str], pattern: str) -> List[str]:
+    """Match paths against a glob with `/`-aware semantics."""
+    rx = _glob_to_regex(pattern)
+    return [p for p in paths if rx.match(p)]
+
+
+def _fetch_file_content(file_path: str) -> str | None:
+    """Fetch a file from the active repository using the current context."""
+    owner, repo, token, branch = get_repo_context()
+
+    loop = asyncio.new_event_loop()
+    asyncio.set_event_loop(loop)
+    try:
+        return loop.run_until_complete(
+            get_file(owner, repo, file_path, token=token, ref=branch)
+        )
+    finally:
+        loop.close()
+
+
+@tool("Read file content")
+def read_file(file_path: Any) -> str:
+    """Read the content of a file from the active repository.
+
+    file_path: the file's path relative to the repository root, e.g.
+    "README.md" or "src/main.py". Pass a plain string — do **not** pass
+    a dict like {"description": "...", "type": "str"}.
+    """
+    file_path = _sanitize_tool_arg(file_path)
+    try:
+        content = _fetch_file_content(file_path)
         return f"Content of {file_path}:\n---\n{content}\n---"
     except Exception as e:
         return f"Error reading file {file_path}: {str(e)}"
 
 
+@tool("Read file content window")
+def read_file_window(
+    file_path: Any,
+    offset: Any = 0,
+    limit: Any = READ_DEFAULT_LIMIT,
+) -> str:
+    """Read a line window from a file in the active repository.
+
+    This advanced pagination tool is intentionally not included in the
+    default repository tool list. Keep the primary "Read file content"
+    tool's schema simple for smaller ReAct models.
+
+    file_path: the file's path relative to the repository root.
+    offset: 0-indexed line number to start reading from.
+    limit: maximum number of lines to return (default 2000, max 10000).
+    """
+    file_path = _sanitize_tool_arg(file_path)
+    start = max(0, _coerce_int(offset, 0))
+    span = max(1, min(READ_MAX_LIMIT, _coerce_int(limit, READ_DEFAULT_LIMIT)))
+    try:
+        content = _fetch_file_content(file_path)
+        if content is None:
+            return f"Error reading file {file_path}: empty response"
+
+        lines = content.splitlines()
+        total = len(lines)
+        if total == 0:
+            return f"Content of {file_path}:\n---\n(empty file)\n---"
+
+        end = min(total, start + span)
+        slice_text = "\n".join(lines[start:end])
+
+        header = f"Content of {file_path}"
+        if start > 0 or end < total:
+            header += f" (lines {start + 1}-{end} of {total})"
+
+        footer = ""
+        if end < total:
+            remaining = total - end
+            footer = (
+                f"\n…{remaining} more lines. Continue with offset={end} "
+                f"to read further."
+            )
+        return f"{header}:\n---\n{slice_text}\n---{footer}"
+    except Exception as e:
+        return f"Error reading file {file_path}: {str(e)}"
+
+
 @tool("Get repository summary")
 def get_repository_summary() -> str:
     """Provides a comprehensive summary of the repository."""
@@ -247,6 +447,137 @@ def get_repository_summary() -> str:
 # Write tools — allow agents to create, update, and delete files via GitHub API
 # ---------------------------------------------------------------------------
 
+@tool("Edit a section of a file (exact string replacement)")
+def edit_file(
+    file_path: Any,
+    old_string: Any,
+    new_string: Any,
+    commit_message: Any,
+    expected_occurrences: Any = 1,
+) -> str:
+    """Surgical edit — replace a small section of a file without
+    re-emitting the rest.  Use this whenever you want to fix a bug,
+    rename a symbol, or insert a few lines into a file that already
+    exists.  Never use ``Write or update a file`` to apply a small
+    change — that requires re-emitting the whole file and corrupts
+    long files on small-context models.
+
+    file_path: path relative to the repo root.  Plain string.
+    old_string: the exact text to find — including surrounding
+        indentation and (where needed) preceding/trailing context
+        so the match is unique.  Plain string.
+    new_string: the replacement text.  Plain string.  Pass an empty
+        string to delete the matched block.
+    commit_message: short imperative commit summary.
+    expected_occurrences: how many times old_string is expected to
+        appear in the file.  Default 1.  Pass a higher number to
+        rename an identifier that appears N times; pass -1 to allow
+        any positive number.  When the actual count differs, the
+        edit is refused — widen old_string to disambiguate.
+
+    On success returns "File '<path>' edited (N occurrence(s) replaced).
+    Commit: <sha>".  On failure returns an actionable error message
+    starting with "Error:".
+    """
+    from .edit_backend import EditError, apply_edit
+    from .github_api import get_file, put_file
+
+    file_path = _sanitize_tool_arg(file_path)
+    old_string_s = old_string if isinstance(old_string, str) else _sanitize_tool_arg(old_string, fallback_key="value")
+    new_string_s = new_string if isinstance(new_string, str) else _sanitize_tool_arg(new_string, fallback_key="value")
+    commit_message_s = _sanitize_tool_arg(commit_message, fallback_key="value") or f"Edit {file_path}"
+    expected = _coerce_int(expected_occurrences, 1)
+
+    try:
+        owner, repo, token, branch = get_repo_context()
+        loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(loop)
+        try:
+            current = loop.run_until_complete(
+                get_file(owner, repo, file_path, token=token, ref=branch)
+            )
+            new_content, report = apply_edit(
+                current or "",
+                old_string=old_string_s,
+                new_string=new_string_s,
+                expected_occurrences=expected,
+            )
+            result = loop.run_until_complete(
+                put_file(owner, repo, file_path, new_content, commit_message_s, token=token, branch=branch)
+            )
+        finally:
+            loop.close()
+
+        sha = result.get("commit_sha", "")
+        return (
+            f"File '{file_path}' edited "
+            f"({report.occurrences_replaced} occurrence(s) replaced, "
+            f"{report.bytes_before} → {report.bytes_after} bytes). "
+            f"Commit: {sha[:8]}"
+        )
+    except EditError as e:
+        # User-facing — keep the original message so the agent can
+        # widen the context and retry.
+        return f"Error: {e}"
+    except Exception as e:
+        return f"Error editing file {file_path}: {e}"
+
+
+@tool("Apply a unified diff to a file")
+def apply_patch_to_file(
+    file_path: Any,
+    diff: Any,
+    commit_message: Any,
+) -> str:
+    """Apply a unified-diff patch to a single file.  Use this when the
+    change involves several non-contiguous edits inside one file and
+    a single ``Edit a section of a file`` call wouldn't capture all
+    of them cleanly.
+
+    file_path: path relative to the repo root.
+    diff: a single-file unified diff with one or more @@-hunks.  The
+        helper matches each hunk by *context lines* (the leading-space
+        lines around the change), so line numbers can be stale.
+        Multi-file diffs are not accepted — split them first.
+    commit_message: short imperative commit summary.
+
+    Returns the same shape as ``Edit a section of a file``.
+    """
+    from .edit_backend import EditError, apply_unified_diff
+    from .github_api import get_file, put_file
+
+    file_path = _sanitize_tool_arg(file_path)
+    diff_s = diff if isinstance(diff, str) else _sanitize_tool_arg(diff, fallback_key="value")
+    commit_message_s = _sanitize_tool_arg(commit_message, fallback_key="value") or f"Patch {file_path}"
+
+    try:
+        owner, repo, token, branch = get_repo_context()
+        loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(loop)
+        try:
+            current = loop.run_until_complete(
+                get_file(owner, repo, file_path, token=token, ref=branch)
+            )
+            new_content, report = apply_unified_diff(current or "", diff_s)
+            result = loop.run_until_complete(
+                put_file(owner, repo, file_path, new_content, commit_message_s, token=token, branch=branch)
+            )
+        finally:
+            loop.close()
+
+        sha = result.get("commit_sha", "")
+        return (
+            f"File '{file_path}' patched "
+            f"({report.occurrences_replaced} hunk(s) applied, "
+            f"{report.bytes_before} → {report.bytes_after} bytes). "
+            f"Commit: {sha[:8]}"
+        )
+    except EditError as e:
+        return f"Error: {e}"
+    except Exception as e:
+        return f"Error patching file {file_path}: {e}"
+
+
 @tool("Write or update a file in the repository")
 def write_file(file_path: Any, content: Any, commit_message: Any) -> str:
     """Create or update a file in the repository.
@@ -332,5 +663,162 @@ def create_repo_branch(branch_name: str) -> str:
 
 
 # Export tools
-REPOSITORY_TOOLS = [list_repository_files, get_directory_structure, read_file, get_repository_summary]
-WRITE_TOOLS = [write_file, delete_repo_file, create_repo_branch]
+@tool("Search file contents")
+def grep_repository(
+    pattern: Any,
+    path_pattern: Any = None,
+    case_insensitive: Any = False,
+    max_results: Any = 100,
+) -> str:
+    """Search the repository for a regex pattern across file contents.
+
+    pattern: a Python-style regular expression.  Use this when you need
+        to find a symbol, string, import, or any other content that
+        listing/globbing won't reveal.
+    path_pattern: optional glob to scope the search (e.g. "**/*.py",
+        "src/**/*.ts").  Same `/`-aware semantics as
+        "Find files matching a pattern".
+    case_insensitive: pass true to match regardless of case.
+    max_results: hard cap (default 100, max 500).  Beyond the cap the
+        result is annotated so you can narrow the search.
+
+    Output: one match per line, formatted ``path:line: matched_text``.
+    """
+    from .grep_backend import (
+        GREP_DEFAULT_MAX_RESULTS,
+        format_result,
+        grep,
+    )
+
+    pattern_str = _sanitize_tool_arg(pattern, fallback_key="pattern") or ""
+    if not pattern_str:
+        return "Error: empty search pattern"
+    path_filter_str = path_pattern if isinstance(path_pattern, str) else None
+    ci_flag = bool(case_insensitive) if not isinstance(case_insensitive, dict) else False
+    cap = _coerce_int(max_results, GREP_DEFAULT_MAX_RESULTS)
+
+    try:
+        owner, repo, token, branch = get_repo_context()
+
+        loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(loop)
+        try:
+            tree = loop.run_until_complete(get_repo_tree(owner, repo, token=token, ref=branch))
+        finally:
+            loop.close()
+
+        if not tree:
+            return f"Repository is empty - no files to search. (Branch: {branch})"
+
+        # Pre-filter file list by path glob BEFORE fetching contents —
+        # this is the single biggest cost saving on GitHub-backed repos.
+        paths = [item["path"] for item in tree]
+        if path_filter_str:
+            paths = _glob_match(paths, path_filter_str)
+        if not paths:
+            return (
+                f"No files matched path_pattern: {path_filter_str}\n"
+                f"(Branch: {branch}, total files: {len(tree)})"
+            )
+
+        # Cap the number of files we fetch — at 200 paths × ~50 KB each
+        # that's already 10 MB.  Anything beyond is the caller's job
+        # to narrow with a tighter path_pattern.
+        FILE_FETCH_CAP = 200
+        paths = paths[:FILE_FETCH_CAP]
+
+        # Fetch contents concurrently.  ``get_file`` is async so we batch.
+        loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(loop)
+        try:
+            async def _gather():
+                import asyncio as _aio
+                async def _fetch(p):
+                    try:
+                        return p, await get_file(owner, repo, p, token=token, ref=branch)
+                    except Exception:
+                        return p, None
+                return await _aio.gather(*(_fetch(p) for p in paths))
+            results = loop.run_until_complete(_gather())
+        finally:
+            loop.close()
+
+        files = {p: c for p, c in results if isinstance(c, str)}
+        if not files:
+            return f"Could not fetch any matching files. (Tried {len(paths)} paths.)"
+
+        rx_path_filter = _glob_to_regex(path_filter_str) if path_filter_str else None
+        result = grep(
+            files,
+            pattern_str,
+            case_insensitive=ci_flag,
+            max_results=cap,
+            path_filter=rx_path_filter,
+        )
+        return format_result(result, pattern=pattern_str)
+    except Exception as e:
+        return f"Error in grep_repository: {str(e)}"
+
+
+@tool("Find code by semantic search")
+def semantic_search(query: Any, k: Any = 8) -> str:
+    """Find the most semantically-similar code chunks for a natural-
+    language query.  Powered by a local on-prem RAG index (ChromaDB
+    + MiniLM-L6-v2 by default; pure-Python hashing fallback when the
+    model isn't available).
+
+    query: what you want to find, in natural language.  Example
+        queries: "authentication middleware", "where do we parse the
+        plan response", "the function that talks to OpenAI".
+    k: how many results to return (default 8, max 20).
+
+    Output: one chunk per result, formatted as ``path:start-end``
+    plus a short excerpt.  Returns "No matches" silently when the
+    index hasn't been built yet — fall back to grep / glob in that
+    case.
+
+    Gated behind the ``rag_retrieval`` flag — when off this tool
+    isn't registered with the agent at all.
+    """
+    from . import flags
+    from .rag import FLAG_RAG_RETRIEVAL, retrieve_top_k
+
+    if not flags.is_on(FLAG_RAG_RETRIEVAL, default=False):
+        return "Semantic search is disabled. Enable the rag_retrieval flag and build the index first."
+
+    q = _sanitize_tool_arg(query, fallback_key="query") or ""
+    if not q:
+        return "Error: empty search query"
+    kk = max(1, min(20, _coerce_int(k, 8)))
+    try:
+        owner, repo, token, branch = get_repo_context()
+        hits = retrieve_top_k(q, owner=owner, repo=repo, branch=branch or "HEAD", k=kk)
+        if not hits:
+            return (
+                f"No semantic matches for: {q}\n"
+                "Either the index hasn't been built yet, or no chunks "
+                "matched.  Try the 'Search file contents' tool instead."
+            )
+        lines = [f"Top {len(hits)} semantic match(es) for: {q}"]
+        for h in hits:
+            excerpt = h.text.replace("\n", " ").strip()[:200]
+            lines.append(f"  {h.path}:{h.start_line}-{h.end_line}  (score={h.score:.2f})")
+            lines.append(f"    {excerpt}")
+        return "\n".join(lines)
+    except Exception as e:
+        return f"Error in semantic_search: {str(e)}"
+
+
+REPOSITORY_TOOLS = [
+    list_repository_files,
+    get_directory_structure,
+    read_file,
+    get_repository_summary,
+]
+WRITE_TOOLS = [
+    edit_file,              # B8: surgical exact-string replacement
+    apply_patch_to_file,    # B8: unified-diff patch
+    write_file,
+    delete_repo_file,
+    create_repo_branch,
+]
diff --git a/gitpilot/agentic.py b/gitpilot/agentic.py
index 5c30ff9..22f2883 100644
--- a/gitpilot/agentic.py
+++ b/gitpilot/agentic.py
@@ -154,6 +154,82 @@ def _crewai():
 _tools_cache: dict = {}
 
 
+async def _execute_index_action(
+    owner: str, repo: str, *, token: str | None, branch_name: str | None,
+) -> str:
+    """Handle the ``INDEX`` plan-step pseudo-action (Batch B9).
+
+    Triggers a one-time RAG index build for the active repo:
+    fetches every file via the GitHub tree, runs them through the
+    chunker / embedder, persists the ChromaDB collection, and grants
+    per-repo consent so future fuzzy queries auto-build incrementally.
+
+    Returns a one-line summary suitable for the execution-log step
+    output.  Failures are surfaced as their own line; we never raise
+    because that would abort sibling steps in the same plan.
+    """
+    from .github_api import get_file, get_repo_tree
+    from .rag.indexer import build_index_from_files
+    from .rag_consent import grant_consent
+
+    try:
+        tree = await get_repo_tree(owner, repo, token=token, ref=branch_name)
+    except Exception as exc:
+        logger.warning("[index] could not list repo tree: %s", exc)
+        return f"! Failed to list repo for indexing: {exc}"
+
+    paths = [item["path"] for item in (tree or []) if item.get("path")]
+    if not paths:
+        return "i Repo is empty — nothing to index."
+
+    # Cap how many files we'll embed in one user-approved build to
+    # bound time + disk.  Anything over the cap still produces a
+    # usable index covering the most-important files; the rest can
+    # be added incrementally on subsequent builds.
+    INDEX_FETCH_CAP = 500
+    paths = paths[:INDEX_FETCH_CAP]
+
+    async def _fetch(p: str) -> tuple[str, str | None]:
+        try:
+            return p, await get_file(owner, repo, p, token=token, ref=branch_name)
+        except Exception:
+            return p, None
+
+    import asyncio as _aio
+    results = await _aio.gather(*(_fetch(p) for p in paths))
+    files: list[tuple[str, str]] = [
+        (p, c) for p, c in results if isinstance(c, str) and c
+    ]
+    if not files:
+        return "! Could not fetch any repo files for indexing."
+
+    # Build synchronously inside the await — embedding is CPU-bound
+    # and we want the user to see "indexing complete" before the
+    # next plan step runs.
+    try:
+        report = build_index_from_files(
+            files,
+            owner=owner,
+            repo=repo,
+            branch=branch_name or "HEAD",
+        )
+    except Exception as exc:
+        logger.warning("[index] build failed: %s", exc)
+        return f"! Index build failed: {exc}"
+
+    try:
+        grant_consent(owner, repo)
+    except Exception as exc:  # pragma: no cover - defensive
+        logger.debug("[index] could not grant consent: %s", exc)
+
+    return (
+        f"+ Indexed {report.files_indexed} file(s) "
+        f"({report.chunks_added} chunks, embedder={report.embedder_name}, "
+        f"skipped={report.files_skipped}).  "
+        f"Semantic search is now available for {owner}/{repo}."
+    )
+
+
 def _tools():
     """Return cached tool collections (lazy-loaded on first use)."""
     if not _tools_cache:
@@ -185,9 +261,18 @@ def _build_llm():
 
 
 class PlanFile(BaseModel):
-    """Represents a file operation in a plan step."""
+    """Represents a file operation in a plan step.
+
+    ``INDEX`` (Batch B9) is a special pseudo-action: the ``path`` is
+    treated as a marker ("__repo__") rather than a real file, and the
+    executor branch triggers a one-time RAG index build for the active
+    repo.  Surfaced as its own plan step so the user approves the
+    indexing cost (time + disk) just like any other action.
+    """
     path: str
-    action: Literal["CREATE", "MODIFY", "DELETE", "READ"] = "MODIFY"
+    action: Literal[
+        "CREATE", "MODIFY", "DELETE", "READ", "INDEX",
+    ] = "MODIFY"
 
 
 class PlanStep(BaseModel):
@@ -262,9 +347,23 @@ async def generate_plan(
     repo_full_name: str,
     token: str | None = None,
     branch_name: str | None = None,
+    *,
+    routing_hint: str | None = None,
+    intent: str | None = None,
 ) -> PlanResult:
     """Agentic planning: create a structured plan but DO NOT modify the repo.
 
+    ``intent`` is the literal from :class:`gitpilot.query_router.RouterDecision`
+    (fix / find / info / create / delete / modify).  When supplied AND
+    the ``lean_prompts`` flag is on, the planner's task description
+    uses only the rule block matching the intent — small models stop
+    drowning in irrelevant create-vs-delete-vs-modify rules.
+
+    ``routing_hint`` is an optional pre-classified directive from
+    :mod:`gitpilot.query_router` that gets concatenated into the
+    planner's context_pack.  Advisory — the planner can override
+    when context demands more exploration.
+
     Two-phase approach:
     1) Explore and understand the repository (on the correct branch)
     2) Create a plan based on actual repository state
@@ -285,6 +384,12 @@ async def generate_plan(
     if context_pack:
         logger.info("[GitPilot] Context pack loaded (%d chars)", len(context_pack))
 
+    # Batch B9 — append the API-layer router's strategy hint so the
+    # planner sees the recommended intent / target files / tool order.
+    if routing_hint:
+        context_pack = (context_pack or "") + ("\n\n" if context_pack else "") + routing_hint
+        logger.info("[GitPilot] Router hint injected (%d chars)", len(routing_hint))
+
     # PHASE 1: Explore repository (correct branch)
     logger.info("[GitPilot] Phase 1: Exploring repository %s (ref=%s)...", repo_full_name, active_ref)
 
@@ -295,10 +400,49 @@ async def generate_plan(
         active_ref,
     )
 
+    # Batch B6: pin a compact "repo map" into the planner's context.
+    # Same idea Aider, Cursor and Claude Code use — give the planner a
+    # high-level site map (key files + modules + language histogram)
+    # in <= 500 tokens, persisted to disk so we don't rebuild it on
+    # every turn.  Best-effort: a failure here must never block the
+    # planner.
+    try:
+        from . import flags as _flags
+        from .repo_map import FLAG_REPO_MAP, build_repo_map
+
+        if _flags.is_on(FLAG_REPO_MAP, default=True):
+            _all_files = list(repo_context_data.get("all_files") or [])
+            if _all_files:
+                _map = build_repo_map(
+                    owner=owner, repo=repo, branch=active_ref or "HEAD",
+                    paths=_all_files,
+                )
+                if _map.agents_md:
+                    context_pack = (context_pack or "") + (
+                        "\n\n" if context_pack else ""
+                    ) + _map.agents_md
+                    logger.info(
+                        "[GitPilot] Repo map pinned (%d tokens, %d modules, %d key files)",
+                        len(_map.agents_md.split()),  # rough proxy
+                        len(_map.modules),
+                        len(_map.key_files),
+                    )
+    except Exception as _map_err:  # pragma: no cover - defensive
+        logger.debug("[GitPilot] repo map injection skipped: %s", _map_err)
+
+    # Batch B12 — when ``lean_prompts`` is on, every persona / task
+    # description is sourced from ``gitpilot.agent_prompts`` so prompt
+    # budgets are pinned by tests and never accidentally bloated.
+    from . import agent_prompts as _ap
+
+    _lean = _ap.lean_prompts_enabled()
+
     explorer = _crewai()["Agent"](
         role="Repository Explorer",
-        goal="Thoroughly explore and document the current state of the repository",
-        backstory=(
+        goal=_ap.EXPLORER_GOAL if _lean else (
+            "Thoroughly explore and document the current state of the repository"
+        ),
+        backstory=_ap.EXPLORER_BACKSTORY if _lean else (
             "You are a meticulous code archaeologist who explores repositories "
             "to understand their complete structure before any changes are made. "
             "You use all available tools to build a comprehensive picture."
@@ -309,8 +453,13 @@ async def generate_plan(
         allow_delegation=False,
     )
 
-    explore_task = _crewai()["Task"](
-        description=dedent(f"""
+    if _lean:
+        _explore_description = _ap.render_explorer_task(
+            repo_full_name=repo_full_name, active_ref=active_ref,
+        )
+        _explore_expected = "A repository exploration report in the documented format"
+    else:
+        _explore_description = dedent(f"""
             Repository: {repo_full_name}
             Active Ref (branch/tag/SHA): {active_ref}
 
@@ -338,8 +487,14 @@ async def generate_plan(
             File Types: [count files by extension]
 
             Your report MUST be based on ACTUAL tool calls, not assumptions.
-        """),
-        expected_output="A detailed exploration report listing ALL files found in the repository",
+        """)
+        _explore_expected = (
+            "A detailed exploration report listing ALL files found in the repository"
+        )
+
+    explore_task = _crewai()["Task"](
+        description=_explore_description,
+        expected_output=_explore_expected,
         agent=explorer,
     )
 
@@ -375,34 +530,64 @@ def _explore():
             "request, or switch to a stronger LLM via Settings → Provider."
         ) from exc
 
-    exploration_report = exploration_result.raw if hasattr(exploration_result, "raw") else str(exploration_result)
-    logger.info("[GitPilot] Exploration complete. Report length: %s chars", len(exploration_report))
+    exploration_report_raw = exploration_result.raw if hasattr(exploration_result, "raw") else str(exploration_result)
+    logger.info("[GitPilot] Exploration complete. Report length: %s chars", len(exploration_report_raw))
+
+    # Batch B5: protect the planner's context by compressing the
+    # explorer's free-form report into a fixed-budget summary.  When
+    # the report already fits (small repos, small models) this is a
+    # no-op; on big repos it can shave 3–6 KB off the planner prompt
+    # without losing any concrete file paths.
+    try:
+        from .explorer_summary import compress_exploration_report
+
+        exploration_report, _exp_metrics = compress_exploration_report(exploration_report_raw)
+        if _exp_metrics.compressed_tokens < _exp_metrics.original_tokens:
+            logger.info(
+                "[GitPilot] Compressed exploration report: %d → %d tokens "
+                "(%d/%d files kept)",
+                _exp_metrics.original_tokens,
+                _exp_metrics.compressed_tokens,
+                _exp_metrics.files_kept,
+                _exp_metrics.files_in_original,
+            )
+    except Exception as _exp_err:  # pragma: no cover - defensive
+        logger.debug("[GitPilot] explorer compression failed: %s", _exp_err)
+        exploration_report = exploration_report_raw
 
     # PHASE 2: Plan creation based on exploration
     logger.info("[GitPilot] Phase 2: Creating plan based on repository exploration (ref=%s)...", active_ref)
 
     # Build planner backstory with optional context pack injection
-    _planner_backstory = (
-        "You are an experienced staff engineer who creates plans based on FACTS, not assumptions. "
-        "You have received a complete exploration report of the repository. "
-        "You ONLY create plans for files that actually exist in the exploration report. "
-        "You are extremely careful with DELETE actions - you verify the file exists "
-        "and that it's not on the 'keep' list before marking it for deletion. "
-        "When users ask to delete files, you delete individual FILES, not directory names. "
-        "When users ask to ANALYZE files and GENERATE new content (code, docs, examples), "
-        "you create plans that READ existing files and CREATE new files with generated content. "
-        "You understand that 'analyze X and create Y' means: use tools to read X, then plan to CREATE Y. "
-        "You never make changes yourself, only create detailed plans."
-    )
-    if context_pack:
+    if _lean:
+        _planner_backstory = _ap.PLANNER_BACKSTORY
+        _planner_goal = _ap.PLANNER_GOAL
+    else:
+        _planner_backstory = (
+            "You are an experienced staff engineer who creates plans based on FACTS, not assumptions. "
+            "You have received a complete exploration report of the repository. "
+            "You ONLY create plans for files that actually exist in the exploration report. "
+            "You are extremely careful with DELETE actions - you verify the file exists "
+            "and that it's not on the 'keep' list before marking it for deletion. "
+            "When users ask to delete files, you delete individual FILES, not directory names. "
+            "When users ask to ANALYZE files and GENERATE new content (code, docs, examples), "
+            "you create plans that READ existing files and CREATE new files with generated content. "
+            "You understand that 'analyze X and create Y' means: use tools to read X, then plan to CREATE Y. "
+            "You never make changes yourself, only create detailed plans."
+        )
+        _planner_goal = (
+            "Design safe, step-by-step refactor plans based on ACTUAL repository state "
+            "discovered during exploration"
+        )
+    # context_pack additions (B6 repo map + B9 routing hint) are only
+    # appended in non-lean mode; on small models they bloat the prompt
+    # and push the JSON-schema rules out of the attention window.
+    if context_pack and not _lean:
         _planner_backstory += "\n\n" + context_pack
 
     planner = _crewai()["Agent"](
         role="Repository Refactor Planner",
-        goal=(
-            "Design safe, step-by-step refactor plans based on ACTUAL repository state "
-            "discovered during exploration"
-        ),
+        goal=_planner_goal,
         backstory=_planner_backstory,
         llm=llm,
         tools=_tools()["REPOSITORY_TOOLS"],
@@ -410,8 +595,20 @@ def _explore():
         allow_delegation=False,
     )
 
-    plan_task = _crewai()["Task"](
-        description=dedent(f"""
+    if _lean:
+        # Use the per-intent compact template from agent_prompts.
+        # Pass the verified file list directly so the planner sees the
+        # facts block at the bottom of the prompt — highest attention
+        # weight on small models.
+        _plan_description = _ap.render_plan_task(
+            goal="{goal}",     # CrewAI inputs substitution happens later
+            repo_full_name=repo_full_name,
+            active_ref=active_ref or "HEAD",
+            file_list=list(repo_context_data.get("all_files") or []),
+            intent=intent,
+        )
+    else:
+        _plan_description = dedent(f"""
             User goal: {{goal}}
             Repository: {repo_full_name}
             Active Ref (branch/tag/SHA): {active_ref}
@@ -485,7 +682,9 @@ def _explore():
             - Do NOT wrap the JSON in markdown code fences
             - Do NOT add any explanation before or after the JSON
             - The ENTIRE response MUST be ONLY the JSON object, starting with '{{' and ending with '}}'
-        """),
+        """)
+    plan_task = _crewai()["Task"](
+        description=_plan_description,
         expected_output=dedent("""
             A single valid JSON object matching the PlanResult schema:
             - goal: string
@@ -736,9 +935,19 @@ async def generate_plan_lite(
     repo_full_name: str,
     token: str | None = None,
     branch_name: str | None = None,
+    *,
+    routing_hint: str | None = None,
+    intent: str | None = None,
 ) -> PlanResult:
     """Lite Mode planning: smart intent detection + single agent + pre-fetched context.
 
+    ``routing_hint`` is accepted for signature parity with
+    :func:`generate_plan`.  Lite Mode has its own simpler routing
+    via regex intent classification, so the hint is currently
+    treated as advisory metadata only — it does not change the
+    Lite planner's behaviour.  Kept here so call sites can use a
+    single signature for both planners.
+
     The topology is:
       1. Classify intent (regex — instant, no LLM)
       2. Pre-fetch repo context from GitHub API (no LLM tool-calling)
@@ -1078,6 +1287,14 @@ def _modify():
                 elif file.action == "READ":
                     step_summary += f"\n  i Inspected {file.path}"
 
+                elif file.action == "INDEX":
+                    # Batch B9 — INDEX is a special plan step that
+                    # triggers the local RAG index build for this repo.
+                    summary_line = await _execute_index_action(
+                        owner, repo, token=token, branch_name=branch_name,
+                    )
+                    step_summary += f"\n  {summary_line}"
+
             except Exception as e:
                 logger.exception("Lite: Error processing %s: %s", file.path, e)
                 step_summary += f"\n  ! Error: {file.path}: {e}"
@@ -1129,10 +1346,16 @@ async def execute_plan(
     # CRITICAL: ensure tools read from the ACTIVE execution branch
     _tools()["set_repo_context"](owner, repo, token=token, branch=branch_name)
 
+    # Batch B12 — lean persona from agent_prompts when the flag is on.
+    from . import agent_prompts as _ap
+    _lean_writer = _ap.lean_prompts_enabled()
+
     code_writer = _crewai()["Agent"](
         role="Expert Code Writer",
-        goal="Generate high-quality, production-ready code and documentation based on requirements.",
-        backstory=(
+        goal=_ap.CODE_WRITER_GOAL if _lean_writer else (
+            "Generate high-quality, production-ready code and documentation based on requirements."
+        ),
+        backstory=_ap.CODE_WRITER_BACKSTORY if _lean_writer else (
             "You are a senior software engineer with expertise in multiple programming languages. "
             "You write clean, well-documented, and functional code. "
             "You understand context and generate appropriate content for each file type. "
@@ -1155,8 +1378,14 @@ async def execute_plan(
         for file in step.files:
             try:
                 if file.action == "CREATE":
-                    create_task = _crewai()["Task"](
-                        description=(
+                    if _lean_writer:
+                        _create_description = _ap.render_create_file_task(
+                            file_path=file.path,
+                            goal=plan.goal,
+                            step_description=step.description,
+                        )
+                    else:
+                        _create_description = (
                             f"Generate complete content for a new file: {file.path}\n\n"
                             f"Overall Goal: {plan.goal}\n"
                             f"Step Context: {step.description}\n\n"
@@ -1177,8 +1406,10 @@ async def execute_plan(
                             "- Do NOT include placeholder comments like 'TODO' or 'IMPLEMENT THIS'\n"
                             "- The content should be fully functional and informative\n\n"
                             "Return ONLY the file content, no explanations or markdown code blocks."
-                        ),
-                        expected_output=f"Complete, production-ready content for {file.path}",
+                        )
+                    create_task = _crewai()["Task"](
+                        description=_create_description,
+                        expected_output=f"Complete content for {file.path}",
                         agent=code_writer,
                     )
 
@@ -1302,6 +1533,13 @@ def _modify():
                 elif file.action == "READ":
                     step_summary += f"\n  ℹ️ READ-only: inspected {file.path}"
 
+                elif file.action == "INDEX":
+                    # Batch B9 — triggers the per-repo RAG index build.
+                    summary_line = await _execute_index_action(
+                        owner, repo, token=token, branch_name=branch_name,
+                    )
+                    step_summary += f"\n  {summary_line}"
+
             except Exception as e:  # noqa: BLE001
                 logger.exception(
                     "Error processing file %s in step %s: %s",
@@ -1440,11 +1678,18 @@ def _build_terminal_agent(llm) -> Agent:
         role="Terminal & Shell Executor",
         goal="Execute shell commands safely in the workspace and report results",
         backstory=(
-            "You are a terminal expert that runs shell commands in a sandboxed "
-            "environment. You can run tests, linters, build tools, and other "
-            "development commands. You always report exit codes and output. "
-            "You refuse to run destructive commands like rm -rf / or format disks. "
-            "You explain command output clearly to the user."
+            "You are a terminal expert that runs shell commands in the "
+            "sandbox the user picked in Settings (local subprocess by "
+            "default, MatrixLab for containerised enterprise isolation). "
+            "Both run_command and run_in_sandbox route through the same "
+            "backend, so the user's runtime choice applies to your "
+            "autonomous loop too — not just to the Run button in chat. "
+            "Use run_command for workspace commands (tests, linters, "
+            "builds) and run_in_sandbox(language, code) when you want "
+            "to validate a self-contained snippet before returning it. "
+            "Always report the exit code and surface stderr verbatim "
+            "when a run fails: the trace is your debugging signal. "
+            "You refuse destructive commands like 'rm -rf /' or 'mkfs'. "
         ),
         llm=llm,
         tools=_tools()["LOCAL_SHELL_TOOLS"] + _tools()["LOCAL_GIT_TOOLS"],
diff --git a/gitpilot/api.py b/gitpilot/api.py
index 0107bce..75fcc85 100644
--- a/gitpilot/api.py
+++ b/gitpilot/api.py
@@ -311,6 +311,17 @@ def _env_bool(name: str, default: bool) -> bool:
 except Exception:  # noqa: BLE001
     logger.exception("MCP admin API failed to mount; tab will show as unavailable")
 
+# Sandbox runtime API (Settings → Sandbox runtime, Run button on chat
+# code blocks).  Mounting is non-fatal so a partial deployment can still
+# serve chat / planner endpoints if this module fails to import.
+try:
+    from .sandbox_api import router as sandbox_router
+
+    app.include_router(sandbox_router)
+    logger.info("Sandbox API enabled (mounting /api/sandbox/* endpoints)")
+except Exception:  # noqa: BLE001
+    logger.exception("Sandbox API failed to mount; Run button will be disabled")
+
 # GitPilot-as-MCP-server (turns GitPilot into an MCP server other agents
 # can drive). Off by default; mount only when GITPILOT_EXPOSE_MCP_SERVER=true.
 try:
@@ -591,7 +602,32 @@ def _build_local_repo_aware_prompt(req, session) -> str:
         "- Output the COMPLETE file content, not just a snippet.\n"
         "- For edits to existing files, output the full updated file.\n"
         "- Be explicit about which files to create or modify and why.\n"
-        "- Prefer incremental, production-safe changes over large rewrites."
+        "- Prefer incremental, production-safe changes over large rewrites.\n"
+        "\n"
+        "RUNNABLE EXAMPLES (separate from file-output fences):\n"
+        "When the user asks for a small example they could try out — "
+        "\"write a hello-world\", \"give me a snippet that ...\", "
+        "\"show me how to call X\" — emit the example as a fenced block "
+        "with ONLY the language on the opening line (no filepath):\n"
+        "\n"
+        "  ```python\n"
+        "  print('Hello, world!')\n"
+        "  ```\n"
+        "\n"
+        "  ```javascript\n"
+        "  console.log('Hello, world!');\n"
+        "  ```\n"
+        "\n"
+        "  ```bash\n"
+        "  echo 'Hello, world!'\n"
+        "  ```\n"
+        "\n"
+        "The chat UI shows a per-block ▶ Run button next to these "
+        "snippets and executes them in the user's selected sandbox "
+        "(local subprocess or MatrixLab). Supported languages: python, "
+        "javascript (or js/node), bash (or sh/shell). Keep snippets "
+        "self-contained — they run in a fresh tempdir with no project "
+        "files mounted — and short enough to read at a glance."
     )
 
     sections = [system_block]
@@ -701,6 +737,10 @@ class SettingsResponse(BaseModel):
     ollabridge: dict
     langflow_url: str
     has_langflow_plan_flow: bool
+    # Sandbox runtime selection — populated by settings_response_from.  The
+    # field is Optional so older serialised payloads continue to validate
+    # even though the runtime always writes a value today.
+    sandbox: Optional[dict] = None
 
 
 class ProviderModelsResponse(BaseModel):
@@ -718,6 +758,16 @@ class ChatPlanRequest(BaseModel):
     repo_name: str
     goal: str
     branch_name: Optional[str] = None
+    # Optional: when present, the planner invocation is recorded as a
+    # Task on the active session so the right-sidebar Tasks panel can
+    # trace it.  Older frontends that omit this field continue to work
+    # — no task is recorded, no error raised.
+    session_id: Optional[str] = None
+    # Batch B9: set by the post-Reject "retry with grep" path so the
+    # router suppresses RAG / INDEX recommendations on the next
+    # attempt of the same goal.  Default False — older frontends are
+    # unaffected.
+    force_no_rag: bool = False
 
 
 class ExecutePlanRequest(BaseModel):
@@ -725,6 +775,12 @@ class ExecutePlanRequest(BaseModel):
     repo_name: str
     plan: PlanResult
     branch_name: Optional[str] = None
+    # Optional: when present, the active session's `branch` (and the
+    # matching `repos[i].branch`) is updated to the branch the executor
+    # actually wrote to, so reopening the session jumps to that branch
+    # instead of the one it was created on.  Older frontends that omit
+    # this field continue to work — no session update is attempted.
+    session_id: Optional[str] = None
 
 
 class AuthUrlResponse(BaseModel):
@@ -1017,6 +1073,13 @@ async def api_put_file(
 # ============================================================================
 
 def settings_response_from(s: AppSettings) -> SettingsResponse:
+    sandbox_dump = s.sandbox.model_dump()
+    # Strip the secret value before it leaves the process — the frontend
+    # only needs to know whether a token is configured, not the token
+    # itself.  Keeps GET /api/settings safe to log and to surface in the
+    # browser devtools.
+    token = sandbox_dump.pop("matrixlab_token", "")
+    sandbox_payload = {**sandbox_dump, "has_token": bool(token)}
     return SettingsResponse(
         provider=s.provider,
         providers=[
@@ -1033,6 +1096,7 @@ def settings_response_from(s: AppSettings) -> SettingsResponse:
         ollabridge=s.ollabridge.model_dump(),
         langflow_url=s.langflow_url,
         has_langflow_plan_flow=bool(s.langflow_plan_flow_id),
+        sandbox=sandbox_payload,
     )
 
 
@@ -1213,8 +1277,110 @@ async def api_context_usage(session_id: Optional[str] = Query(None)):
 # Chat Endpoints
 # ============================================================================
 
+
+def _track_task(*, kind: str, title_fn=None):
+    """Decorator: wrap a chat endpoint so its run is recorded as a Task
+    on the active session (right-sidebar trace).
+
+    Reads ``session_id`` directly off the request model.  ``title_fn``
+    is a small callable that derives the human title from the request
+    object — keeps the decorator decoupled from any specific schema.
+    Endpoints whose requests don't carry a session_id behave exactly
+    as before — no Task is recorded, no error is raised.
+    """
+    import functools
+
+    from .task_recorder import begin_task as _begin_task
+    from .task_recorder import finish_task as _finish_task
+
+    def _default_title(_req):
+        return kind.title()
+
+    extract_title = title_fn or _default_title
+
+    def deco(handler):
+        @functools.wraps(handler)
+        async def wrapper(req, *args, **kwargs):
+            session_id = getattr(req, "session_id", None)
+            try:
+                raw_title = extract_title(req)
+            except Exception:
+                raw_title = None
+            title = (raw_title or kind.title())[:160]
+            task = _begin_task(_session_mgr, session_id, kind=kind, title=title)
+            status = "failed"
+            err: Optional[str] = None
+            try:
+                result = await handler(req, *args, **kwargs)
+                status = "completed"
+                return result
+            except HTTPException as exc:
+                # HTTPException paths are still "failed" from the
+                # tasks-panel point of view (the user did not get a
+                # plan / commit).  Preserve the detail as the error.
+                err = str(exc.detail) if exc.detail else None
+                raise
+            except Exception as exc:
+                err = str(exc)
+                raise
+            finally:
+                _finish_task(
+                    _session_mgr,
+                    session_id,
+                    task,
+                    status=status,
+                    error=err,
+                )
+        return wrapper
+    return deco
+
+
+def _maybe_compact_session_for_request(session_id: Optional[str]) -> None:
+    """Best-effort auto-compaction hook (Batch B3).
+
+    Called at the start of /api/chat/plan + /api/chat/execute.  If the
+    persisted session is over 70 % of the active model's context
+    window, fold the older messages into a single summary entry and
+    record a Task row so the user sees what happened.  A failure here
+    must never block the agent run.
+    """
+    if not session_id:
+        return
+    try:
+        from .auto_compact import maybe_compact_session
+        from .context_meter import resolve_context_window
+        from .task_recorder import begin_task, finish_task
+
+        s = get_settings()
+        window = resolve_context_window(s)
+        report = maybe_compact_session(
+            _session_mgr, session_id, context_window=window
+        )
+        if report.compacted:
+            # Surface the compaction in the right-sidebar trace so the
+            # operator can see "Conversation summarised 24 → 1" rather
+            # than wonder where their messages went.
+            task = begin_task(
+                _session_mgr, session_id,
+                kind="compact",
+                title=(
+                    f"Compacted: {report.messages_folded} older messages "
+                    f"({report.before_tokens} → {report.after_tokens} tokens)"
+                ),
+            )
+            finish_task(
+                _session_mgr, session_id, task,
+                status="completed",
+                prompt_tokens=report.after_tokens,
+            )
+    except Exception as exc:  # pragma: no cover - defensive
+        logger.debug("[compact] hook failed: %s", exc)
+
+
 @app.post("/api/chat/plan")
+@_track_task(kind="plan", title_fn=lambda req: req.goal)
 async def api_chat_plan(req: ChatPlanRequest, authorization: Optional[str] = Header(None)):
+    _maybe_compact_session_for_request(req.session_id)
     token = get_github_token(authorization)
 
     logger.info(
@@ -1230,8 +1396,57 @@ async def api_chat_plan(req: ChatPlanRequest, authorization: Optional[str] = Hea
         # Use lite planner when Lite Mode is active (setting OR topology)
         planner = generate_plan_lite if _is_lite_mode_active() else generate_plan
 
+        # Batch B9 — deterministic query router.  Runs BEFORE the LLM
+        # so even small models that pick poorly without guidance see
+        # a strategy hint up front.  Best-effort: any failure falls
+        # back to today's no-hint behaviour rather than 500-ing.
+        routing_hint = None
+        routing_intent: Optional[str] = None
+        try:
+            from . import flags as _flags
+            if _flags.is_on("query_router", default=True):
+                from .query_router import classify, render_planner_hint
+                from .rag_consent import has_consent
+
+                # Cheap path: a flat list of repo files for the
+                # classifier's path-verification step.  Failure is
+                # tolerated — router falls back to "no targets".
+                repo_paths: list[str] = []
+                try:
+                    from .github_api import get_repo_tree
+                    _tree = await get_repo_tree(
+                        req.repo_owner, req.repo_name,
+                        token=token, ref=req.branch_name,
+                    )
+                    repo_paths = [t["path"] for t in (_tree or []) if t.get("path")]
+                except Exception:
+                    pass
+
+                rag_index_present = (
+                    has_consent(req.repo_owner, req.repo_name)
+                )
+
+                decision = classify(
+                    req.goal,
+                    repo_files=repo_paths,
+                    rag_index_exists=rag_index_present,
+                    force_no_rag=bool(req.force_no_rag),
+                )
+                routing_hint = render_planner_hint(decision)
+                routing_intent = decision.intent
+                logger.info("[router] %s", decision.rationale)
+        except Exception as _route_err:  # pragma: no cover - defensive
+            logger.debug("[router] skipped: %s", _route_err)
+            routing_hint = None
+            routing_intent = None
+
         try:
-            plan = await planner(req.goal, full_name, token=token, branch_name=req.branch_name)
+            plan = await planner(
+                req.goal, full_name,
+                token=token, branch_name=req.branch_name,
+                routing_hint=routing_hint,
+                intent=routing_intent,
+            )
             return plan
         except Exception as exc:
             error_msg = str(exc)
@@ -1308,6 +1523,8 @@ async def api_chat_plan(req: ChatPlanRequest, authorization: Optional[str] = Hea
                         full_name,
                         token=token,
                         branch_name=req.branch_name,
+                        routing_hint=routing_hint,
+                        intent=routing_intent,
                     )
                 except Exception as lite_exc:
                     logger.exception(
@@ -1343,10 +1560,15 @@ async def api_chat_plan(req: ChatPlanRequest, authorization: Optional[str] = Hea
 
 
 @app.post("/api/chat/execute")
+@_track_task(
+    kind="execute",
+    title_fn=lambda req: getattr(getattr(req, "plan", None), "goal", None) or "Execute plan",
+)
 async def api_chat_execute(
     req: ExecutePlanRequest,
     authorization: Optional[str] = Header(None)
 ):
+    _maybe_compact_session_for_request(req.session_id)
     token = get_github_token(authorization)
 
     with execution_context(token, ref=req.branch_name):
@@ -1408,6 +1630,39 @@ async def api_chat_execute(
                 "mode",
                 "sticky" if req.branch_name else "hard-switch",
             )
+
+        # Persist the branch the executor actually wrote to onto the
+        # session record so reopening this session jumps back to that
+        # branch (instead of the master/default it was created on).
+        # Best-effort: a failure to update the session must never block
+        # the user-facing execute result.
+        new_branch = (
+            result.get("branch") if isinstance(result, dict) else None
+        ) or req.branch_name
+        if req.session_id and new_branch:
+            try:
+                session = _session_mgr.load(req.session_id)
+                session.branch = new_branch
+                # Multi-repo support: update the matching repos[] entry
+                # too if it exists, so callers that read from there see
+                # a consistent value.
+                if session.repos:
+                    for entry in session.repos:
+                        if entry.get("full_name") == full_name:
+                            entry["branch"] = new_branch
+                _session_mgr.save(session)
+            except FileNotFoundError:
+                logger.debug(
+                    "[exec] session %s not found — skipping branch persist",
+                    req.session_id,
+                )
+            except Exception as exc:  # pragma: no cover - defensive
+                logger.warning(
+                    "[exec] could not persist branch on session %s: %s",
+                    req.session_id,
+                    exc,
+                )
+
         return result
 
 
@@ -3017,6 +3272,45 @@ async def api_get_session_messages(session_id: str):
     }
 
 
+@app.get("/api/sessions/{session_id}/tasks")
+async def api_get_session_tasks(session_id: str):
+    """Return the right-sidebar Tasks trace for one session.
+
+    Read-only.  Gated behind the ``tasks_sidebar`` flag — when off the
+    endpoint 404s so an old frontend can detect "feature absent" with
+    the same code path it uses for "session deleted".
+    """
+    from . import flags
+    from .task_recorder import FLAG_TASKS_SIDEBAR
+
+    if not flags.is_on(FLAG_TASKS_SIDEBAR, default=True):
+        raise HTTPException(status_code=404, detail="Tasks sidebar is disabled")
+
+    try:
+        session = _session_mgr.load(session_id)
+    except FileNotFoundError:
+        raise HTTPException(status_code=404, detail="Session not found")
+
+    return {
+        "session_id": session.id,
+        "tasks": [
+            {
+                "id": t.id,
+                "kind": t.kind,
+                "title": t.title,
+                "status": t.status,
+                "started_at": t.started_at,
+                "completed_at": t.completed_at,
+                "duration_ms": t.duration_ms,
+                "prompt_tokens": t.prompt_tokens,
+                "completion_tokens": t.completion_tokens,
+                "error": t.error,
+            }
+            for t in session.tasks
+        ],
+    }
+
+
 @app.get("/api/sessions/{session_id}/diff")
 async def api_get_session_diff(session_id: str):
     """Get diff stats for a session (placeholder for sandbox integration)."""
diff --git a/gitpilot/auto_compact.py b/gitpilot/auto_compact.py
new file mode 100644
index 0000000..c434909
--- /dev/null
+++ b/gitpilot/auto_compact.py
@@ -0,0 +1,217 @@
+"""Auto-compaction of chat session history (Batch B3).
+
+When a session's persisted conversation crosses 70 % of the active
+model's context window, we fold the older non-essential messages into
+a single summary entry — same strategy Claude Code, Cursor and
+Continue use to keep sessions usable across many turns without
+silently truncating mid-stream.
+
+Design notes:
+
+* **Pure Python.**  We reuse :mod:`gitpilot.context_budget`'s
+  deterministic ``_default_summariser`` so compaction never depends
+  on a live LLM call.  Production deployments can later inject a
+  smarter summariser without changing this module's interface.
+* **Append-only audit trail.**  Each compaction also lands as a
+  ``kind="compact"`` Task in the right-sidebar trace so the user can
+  see "Conversation summarised: 24 messages → 1 summary".
+* **Idempotent.**  We tag the summary message with
+  ``metadata["compacted"] = "1"`` so a no-op pass over already-compact
+  history doesn't repeatedly fold the same content.
+* **Best-effort.**  Failure to load or save the session must never
+  block the user-facing endpoint — log and proceed.  The chat
+  continues to work; it just won't shrink this turn.
+
+Wired in at the API boundary in :mod:`gitpilot.api` (``/api/chat/plan``
++ ``/api/chat/execute``), so agentic.py is untouched.
+"""
+from __future__ import annotations
+
+import logging
+from dataclasses import dataclass
+from typing import Optional
+
+from . import flags
+from .context_budget import (
+    BudgetPolicy,
+    Message as BudgetMessage,
+    _default_summariser,
+    estimate_tokens,
+)
+from .session import Message as SessionMessage, Session, SessionManager
+
+logger = logging.getLogger(__name__)
+
+FLAG_AUTO_COMPACT = "auto_compact"
+
+# Tunable knobs.  Centralised so future tuning (or per-provider
+# overrides) doesn't require code changes in two places.
+DEFAULT_CONDENSE_AT_RATIO = 0.70   # fire at 70 % of window
+DEFAULT_KEEP_RECENT_TURNS = 6      # last N messages always preserved
+DEFAULT_RESERVED_RESPONSE = 4_096  # mirror context_meter constant
+COMPACTED_FLAG = "compacted"
+SUMMARY_LABEL = "Conversation summary (older turns condensed)"
+
+
+@dataclass
+class CompactionReport:
+    """Returned by :func:`maybe_compact_session` so the caller can log
+    a Task entry with concrete before/after numbers."""
+    compacted: bool = False
+    before_tokens: int = 0
+    after_tokens: int = 0
+    messages_folded: int = 0
+    reason: Optional[str] = None      # human-readable explanation
+
+
+# ----------------------------------------------------------------------
+# Internal helpers
+# ----------------------------------------------------------------------
+
+def _budget_messages_from_session(session: Session) -> list[BudgetMessage]:
+    """Bridge SessionMessage → BudgetMessage."""
+    out: list[BudgetMessage] = []
+    for m in session.messages:
+        role = m.role if m.role in ("user", "assistant", "system", "tool") else "user"
+        importance = "pinned" if (m.metadata or {}).get(COMPACTED_FLAG) == "1" else "normal"
+        # Best-effort role narrowing — BudgetMessage role is a Literal.
+        out.append(
+            BudgetMessage(
+                role=role,  # type: ignore[arg-type]
+                content=m.content or "",
+                importance=importance,  # type: ignore[arg-type]
+            )
+        )
+    return out
+
+
+def _session_total_tokens(session: Session) -> int:
+    return sum(estimate_tokens(m.content or "") for m in session.messages)
+
+
+# ----------------------------------------------------------------------
+# Public entry point
+# ----------------------------------------------------------------------
+
+def maybe_compact_session(
+    session_mgr: SessionManager,
+    session_id: Optional[str],
+    *,
+    context_window: int,
+    reserved_response: int = DEFAULT_RESERVED_RESPONSE,
+    condense_at_ratio: float = DEFAULT_CONDENSE_AT_RATIO,
+    keep_recent_turns: int = DEFAULT_KEEP_RECENT_TURNS,
+) -> CompactionReport:
+    """Condense the session's history if it's crossed the threshold.
+
+    Returns a :class:`CompactionReport` so the caller can record a
+    Task entry with the concrete numbers.  A no-op report
+    (``compacted=False``) is returned silently when:
+
+    * the feature flag is off,
+    * no session id was supplied,
+    * the session can't be loaded,
+    * we're below the threshold,
+    * there are not enough non-recent messages to fold.
+    """
+    if not flags.is_on(FLAG_AUTO_COMPACT, default=True):
+        return CompactionReport(reason="flag off")
+    if not session_id:
+        return CompactionReport(reason="no session id")
+    if context_window <= 0:
+        return CompactionReport(reason="unknown context window")
+
+    try:
+        session = session_mgr.load(session_id)
+    except Exception as exc:
+        logger.debug("[compact] session %s not loadable: %s", session_id, exc)
+        return CompactionReport(reason="session not loadable")
+
+    before = _session_total_tokens(session)
+    # The user's *effective* budget excludes the reserved response
+    # headroom — that's the budget we actually need to keep below.
+    effective_window = max(0, context_window - reserved_response)
+    threshold = int(effective_window * condense_at_ratio)
+    if before < threshold:
+        return CompactionReport(
+            compacted=False,
+            before_tokens=before,
+            after_tokens=before,
+            reason="below threshold",
+        )
+
+    # Fold using the existing deterministic summariser.  We keep:
+    #   - any message already marked compacted (pinned)
+    #   - the last ``keep_recent_turns`` messages
+    # Everything else gets summarised into one system message.
+    msgs = session.messages
+    if len(msgs) <= keep_recent_turns + 1:
+        return CompactionReport(
+            compacted=False,
+            before_tokens=before,
+            after_tokens=before,
+            reason="not enough history to fold",
+        )
+
+    pinned = [m for m in msgs if (m.metadata or {}).get(COMPACTED_FLAG) == "1"]
+    rest = [m for m in msgs if (m.metadata or {}).get(COMPACTED_FLAG) != "1"]
+    keep_n = max(0, keep_recent_turns)
+    foldable = rest[:-keep_n] if keep_n else rest
+    kept = rest[-keep_n:] if keep_n else []
+
+    if not foldable:
+        return CompactionReport(
+            compacted=False,
+            before_tokens=before,
+            after_tokens=before,
+            reason="nothing foldable",
+        )
+
+    # Use BudgetMessage objects for the summariser — the existing
+    # summariser was written against that shape.
+    budget_foldable = [
+        BudgetMessage(
+            role=(m.role if m.role in ("user", "assistant", "system", "tool") else "user"),  # type: ignore[arg-type]
+            content=m.content or "",
+        )
+        for m in foldable
+    ]
+    summary_body = _default_summariser(budget_foldable)
+    summary_msg = SessionMessage(
+        role="system",
+        content=f"## {SUMMARY_LABEL}\n\n{summary_body}",
+        metadata={COMPACTED_FLAG: "1"},
+    )
+
+    session.messages = pinned + [summary_msg] + kept
+    after = _session_total_tokens(session)
+
+    try:
+        session_mgr.save(session)
+    except Exception as exc:  # pragma: no cover - defensive
+        logger.warning("[compact] could not save session %s: %s", session_id, exc)
+        return CompactionReport(
+            compacted=False,
+            before_tokens=before,
+            after_tokens=after,
+            reason=f"save failed: {exc}",
+        )
+
+    return CompactionReport(
+        compacted=True,
+        before_tokens=before,
+        after_tokens=after,
+        messages_folded=len(foldable),
+        reason=f"folded {len(foldable)} older messages",
+    )
+
+
+__all__ = [
+    "FLAG_AUTO_COMPACT",
+    "CompactionReport",
+    "DEFAULT_CONDENSE_AT_RATIO",
+    "DEFAULT_KEEP_RECENT_TURNS",
+    "DEFAULT_RESERVED_RESPONSE",
+    "SUMMARY_LABEL",
+    "maybe_compact_session",
+]
diff --git a/gitpilot/edit_backend.py b/gitpilot/edit_backend.py
new file mode 100644
index 0000000..fde0c8f
--- /dev/null
+++ b/gitpilot/edit_backend.py
@@ -0,0 +1,317 @@
+"""Surgical edit operations for the executor (Batch B8).
+
+Pure text-in / text-out functions — no GitHub / disk I/O.  The agent
+tool wrappers in :mod:`gitpilot.agent_tools` are responsible for
+fetching the current file bytes (GitHub mode) or reading from disk
+(local mode), passing them through these helpers, and then writing
+the result back.
+
+Two operations:
+
+* :func:`apply_edit` — exact-string find-and-replace with
+  *strict occurrence validation*.  The model passes a small
+  ``old_string`` and a small ``new_string``; we refuse to apply
+  unless ``old_string`` occurs exactly the expected number of times.
+  Inspired by Claude Code's ``Edit`` tool — the contract that makes
+  fixing line 1 482 of a 2 000-line file reliable across any model.
+
+* :func:`apply_unified_diff` — parse a minimal subset of unified
+  diff and apply it by *matching the leading context lines* rather
+  than trusting the line numbers in the hunk header.  Line numbers
+  drift the moment another edit lands; context survives.  This is
+  the same trick Codex's ``apply_patch`` uses internally.
+
+Both functions raise :class:`EditError` with a precise, actionable
+message rather than silently mis-editing.  The executor must surface
+that error to the user and refuse to commit.
+"""
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass
+from typing import List, Optional, Sequence, Tuple
+
+
+class EditError(ValueError):
+    """Raised when an edit cannot be applied safely.
+
+    The message is user-facing — keep it concrete: file path, what
+    failed, what the caller can do about it.  Never log a stack trace
+    in place of a clear sentence.
+    """
+
+
+@dataclass(frozen=True)
+class EditReport:
+    """Returned alongside the new content so callers can record a
+    Task row with concrete numbers."""
+    occurrences_replaced: int
+    bytes_before: int
+    bytes_after: int
+
+
+# ----------------------------------------------------------------------
+# apply_edit — exact-string find-and-replace
+# ----------------------------------------------------------------------
+
+def apply_edit(
+    content: str,
+    *,
+    old_string: str,
+    new_string: str,
+    expected_occurrences: int = 1,
+) -> Tuple[str, EditReport]:
+    """Replace ``old_string`` with ``new_string`` in ``content``.
+
+    ``expected_occurrences`` is a *contract*: we will only apply the
+    edit when ``old_string`` appears in ``content`` exactly this many
+    times.  Any deviation raises :class:`EditError` — agents must
+    disambiguate by widening ``old_string`` or specifying the right
+    count.
+
+    Pass ``expected_occurrences=-1`` to allow any positive number of
+    matches; useful for "rename this identifier everywhere".
+
+    The function is pure: same inputs → same outputs, no I/O.
+    """
+    if old_string is None:
+        raise EditError("apply_edit: old_string is required")
+    if new_string is None:
+        raise EditError("apply_edit: new_string is required (use empty string to delete)")
+    if old_string == new_string:
+        raise EditError(
+            "apply_edit: old_string and new_string are identical — "
+            "no edit would be applied.  This is almost always a bug "
+            "in the planner; refuse rather than commit a no-op."
+        )
+
+    # Count occurrences without regex — old_string is treated as
+    # literal text, including whitespace and newlines.
+    if old_string == "":
+        raise EditError("apply_edit: old_string must not be empty")
+
+    n = content.count(old_string)
+    if n == 0:
+        # Provide a short hint about why nothing matched: indentation
+        # mismatch is by far the most common cause on Python files.
+        hint = ""
+        if old_string.strip() and old_string.strip() in content:
+            hint = (
+                "  Hint: a stripped form of old_string IS present — "
+                "the indentation in your edit does not match the file. "
+                "Re-read the surrounding lines and copy them exactly."
+            )
+        raise EditError(
+            "apply_edit: old_string was not found in the file." + hint
+        )
+
+    if expected_occurrences == -1:
+        # "replace all" mode — at least one match suffices.
+        pass
+    elif n != expected_occurrences:
+        raise EditError(
+            f"apply_edit: old_string occurs {n} time(s) in the file, "
+            f"but expected_occurrences was {expected_occurrences}. "
+            "Widen old_string to include more surrounding context, or "
+            "set expected_occurrences to the correct number."
+        )
+
+    new_content = content.replace(old_string, new_string)
+    return new_content, EditReport(
+        occurrences_replaced=n,
+        bytes_before=len(content),
+        bytes_after=len(new_content),
+    )
+
+
+# ----------------------------------------------------------------------
+# apply_unified_diff — minimal patch parser + context-match applier
+# ----------------------------------------------------------------------
+
+_HUNK_HEADER_RE = re.compile(
+    r"^@@\s+-(?P<old_start>\d+)(?:,(?P<old_count>\d+))?\s+"
+    r"\+(?P<new_start>\d+)(?:,(?P<new_count>\d+))?\s+@@"
+)
+
+
+@dataclass
+class _Hunk:
+    """One hunk extracted from a unified diff."""
+    old_start: int          # 1-indexed line number from the @@ header
+    new_start: int
+    lines: List[str]        # raw lines including the leading char
+
+
+def _parse_unified_diff(diff: str) -> List[_Hunk]:
+    """Tolerant parser — accepts diffs with or without file headers
+    (``--- a/file`` / ``+++ b/file``).  We only care about the hunks.
+    """
+    hunks: List[_Hunk] = []
+    current: Optional[_Hunk] = None
+    for raw in diff.splitlines():
+        m = _HUNK_HEADER_RE.match(raw)
+        if m:
+            if current is not None:
+                hunks.append(current)
+            current = _Hunk(
+                old_start=int(m.group("old_start")),
+                new_start=int(m.group("new_start")),
+                lines=[],
+            )
+            continue
+        if current is None:
+            # Pre-hunk preamble (--- / +++ / diff --git) — ignore.
+            continue
+        if not raw:
+            # An empty line inside a hunk represents a context blank
+            # line (some tools emit a bare "\n" with no leading space).
+            current.lines.append(" ")
+            continue
+        prefix = raw[0]
+        if prefix in (" ", "+", "-"):
+            current.lines.append(raw)
+        elif prefix == "\\":
+            # "\ No newline at end of file" — silently skip.
+            continue
+        else:
+            # Foreign line inside a hunk — fail loudly so we never
+            # silently corrupt the file.
+            raise EditError(
+                f"apply_unified_diff: malformed hunk line: {raw!r}.  "
+                "Lines must start with ' ', '+' or '-'."
+            )
+    if current is not None:
+        hunks.append(current)
+    if not hunks:
+        raise EditError(
+            "apply_unified_diff: no @@ hunks found.  The diff appears empty "
+            "or only contains file headers."
+        )
+    return hunks
+
+
+def _hunk_old_block(hunk: _Hunk) -> List[str]:
+    """Return the contiguous list of pre-edit lines (the ones with
+    ``' '`` or ``'-'`` prefix).  These are what we match against."""
+    out: List[str] = []
+    for ln in hunk.lines:
+        if not ln:
+            out.append("")
+            continue
+        if ln[0] in (" ", "-"):
+            out.append(ln[1:])
+    return out
+
+
+def _hunk_new_block(hunk: _Hunk) -> List[str]:
+    """Return the contiguous list of post-edit lines (``' '`` or
+    ``'+'`` prefix)."""
+    out: List[str] = []
+    for ln in hunk.lines:
+        if not ln:
+            out.append("")
+            continue
+        if ln[0] in (" ", "+"):
+            out.append(ln[1:])
+    return out
+
+
+def _find_block(haystack: Sequence[str], needle: Sequence[str], near: int) -> int:
+    """Locate ``needle`` inside ``haystack`` as a contiguous slice.
+
+    Returns the 0-indexed start position.  Prefers a match near
+    ``near`` (1-indexed line from the hunk header, translated by the
+    caller) so when the file has several identical blocks the patch
+    lands close to where it was authored.  Raises if no exact match.
+    """
+    if not needle:
+        raise EditError("apply_unified_diff: empty hunk")
+    matches: List[int] = []
+    for i in range(0, len(haystack) - len(needle) + 1):
+        if list(haystack[i : i + len(needle)]) == list(needle):
+            matches.append(i)
+    if not matches:
+        raise EditError(
+            "apply_unified_diff: could not locate the hunk's context in "
+            "the file — the surrounding lines have drifted.  Re-read the "
+            "file and regenerate the diff."
+        )
+    if len(matches) == 1:
+        return matches[0]
+    # Multiple identical blocks — pick the one nearest to the hunk header.
+    target = max(0, near - 1)
+    return min(matches, key=lambda pos: abs(pos - target))
+
+
+def apply_unified_diff(content: str, diff: str) -> Tuple[str, EditReport]:
+    """Apply a unified diff to ``content`` by matching context lines
+    rather than trusting hunk line numbers.
+
+    Limitations (documented, intentional):
+
+    * Single-file diffs only.  If ``diff`` looks like a multi-file
+      patch (``diff --git`` separator with more than one file), the
+      caller must split it.
+    * No fuzz matching.  Context must match byte-for-byte.  Drift
+      caused by another concurrent edit raises :class:`EditError`
+      with an actionable message rather than silently mis-editing.
+    """
+    if diff is None or not diff.strip():
+        raise EditError("apply_unified_diff: diff is empty")
+
+    # Detect multi-file diffs BEFORE parsing so the parser doesn't
+    # trip on the second file's ``diff --git`` header.
+    gits = diff.count("\ndiff --git ") + (1 if diff.startswith("diff --git ") else 0)
+    if gits > 1:
+        raise EditError(
+            "apply_unified_diff: multi-file diff detected; this helper "
+            "applies to one file at a time.  Split the patch first."
+        )
+
+    hunks = _parse_unified_diff(diff)
+
+    lines = content.splitlines(keepends=True)
+    # Drop the keepends so block matching is line-exact; we'll
+    # reassemble the line endings from the original where possible.
+    raw_lines = [ln.rstrip("\n") for ln in lines]
+    # Preserve the original trailing newline state so we don't
+    # accidentally drop or add one.
+    had_trailing_newline = content.endswith("\n")
+
+    output: List[str] = list(raw_lines)
+    total_replacements = 0
+
+    # Apply hunks in order, tracking a running offset so the second
+    # hunk's context match accounts for earlier hunks' line-count
+    # changes.
+    offset = 0
+    for hunk in hunks:
+        old_block = _hunk_old_block(hunk)
+        new_block = _hunk_new_block(hunk)
+        near = hunk.old_start + offset
+        pos = _find_block(output, old_block, near=near)
+        output[pos : pos + len(old_block)] = new_block
+        offset += len(new_block) - len(old_block)
+        total_replacements += 1
+
+    new_content = "\n".join(output)
+    if had_trailing_newline and not new_content.endswith("\n"):
+        new_content += "\n"
+    elif not had_trailing_newline and new_content.endswith("\n"):
+        # Preserve "no newline at end of file" if the original lacked
+        # one — only when the diff didn't add it.
+        new_content = new_content.rstrip("\n")
+
+    return new_content, EditReport(
+        occurrences_replaced=total_replacements,
+        bytes_before=len(content),
+        bytes_after=len(new_content),
+    )
+
+
+__all__ = [
+    "EditError",
+    "EditReport",
+    "apply_edit",
+    "apply_unified_diff",
+]
diff --git a/gitpilot/explorer_summary.py b/gitpilot/explorer_summary.py
new file mode 100644
index 0000000..cdcf310
--- /dev/null
+++ b/gitpilot/explorer_summary.py
@@ -0,0 +1,286 @@
+"""Explorer-report compression (Batch B5).
+
+The Repository Explorer agent produces a free-form "REPOSITORY
+EXPLORATION REPORT" that grows linearly with the repo size.  On a
+200-file repo the file-listing section alone can run 4–6 KB — enough
+to crowd the planner's prompt on an 8 k-context model like
+llama3:8b.
+
+This module compresses that report into a fixed-budget summary the
+planner sees instead of the raw transcript.  Strict properties:
+
+* **Deterministic.**  No LLM call needed; the compression is pure
+  string manipulation.  Easy to test, reproducible across runs.
+* **Lossless for facts.**  Every concrete file path the planner needs
+  to validate is preserved (file lists, key files, directory tree).
+  Only the prose padding and redundant repetition is trimmed.
+* **Hard-capped.**  Default 800 tokens, configurable.  When the raw
+  report is already under cap we emit it unchanged (no churn).
+* **Format-stable.**  Output is the same "REPOSITORY EXPLORATION
+  REPORT" header the planner already knows how to read — no prompt
+  template change in agentic.py beyond passing the compressed string.
+
+Wired in at the boundary between explorer.kickoff() and the planner
+task description, so neither agent changes shape.
+"""
+from __future__ import annotations
+
+import logging
+import re
+from collections import Counter
+from dataclasses import dataclass, field
+from typing import List, Optional
+
+from . import flags
+from .context_budget import estimate_tokens
+
+logger = logging.getLogger(__name__)
+
+FLAG_SUBAGENT_EXPLORER = "subagent_explorer"
+
+# Tunable budgets.  Centralised so a future per-provider override can
+# tighten them on small-context models without touching call sites.
+DEFAULT_TOKEN_BUDGET = 800        # planner-injection cap
+MAX_FILES_LISTED = 60             # absolute hard cap on enumerated paths
+MAX_KEY_FILES = 8
+MAX_DIRECTORY_LINES = 25
+PROSE_PARAGRAPH_CAP = 280         # chars per free-text paragraph
+
+
+@dataclass
+class CompressionReport:
+    """Returned alongside the compressed string so the caller can land
+    a Task row showing concrete before/after numbers."""
+    original_tokens: int = 0
+    compressed_tokens: int = 0
+    files_in_original: int = 0
+    files_kept: int = 0
+    truncated: bool = False
+    reason: Optional[str] = None
+
+
+@dataclass
+class _ParsedReport:
+    """Best-effort split of the explorer's free-form report into the
+    sections we actually use.  Missing sections default to empty
+    strings; we never crash on a malformed report."""
+    files_found: List[str] = field(default_factory=list)
+    key_files: List[str] = field(default_factory=list)
+    directory_structure: str = ""
+    file_types: str = ""
+    other_prose: List[str] = field(default_factory=list)
+
+
+# Regexes for the section headers the explorer's prompt template
+# instructs it to emit.  Case-insensitive on purpose — small models
+# sometimes wobble the capitalisation.
+_SECTION_RE = re.compile(
+    r"(?im)^\s*(?P<name>files\s+found|key\s+files|directory\s+structure|file\s+types|repository\s+exploration\s+report)\s*:?\s*$"
+)
+_BULLET_RE = re.compile(r"^\s*(?:[-*•]|\d+\.)\s+(?P<rest>.+?)\s*$")
+_PATH_RE = re.compile(r"[\w./\-]+\.(?:md|py|ts|tsx|js|jsx|json|yml|yaml|toml|cfg|ini|txt|rst|sh|bash|go|rs|rb|java|c|h|cpp|hpp|html|css|scss)")
+
+
+def _split_into_sections(report: str) -> _ParsedReport:
+    """Walk the report top-to-bottom, routing lines into buckets based
+    on the most-recent section header we've seen.  Tolerant: unknown
+    sections fall through to ``other_prose``."""
+    parsed = _ParsedReport()
+    current = "other"
+    for raw_line in report.splitlines():
+        line = raw_line.rstrip()
+        if not line.strip():
+            continue
+        m = _SECTION_RE.match(line)
+        if m:
+            name = m.group("name").lower().replace(" ", "_")
+            if "files_found" in name:
+                current = "files_found"
+            elif "key_files" in name:
+                current = "key_files"
+            elif "directory" in name:
+                current = "directory"
+            elif "file_types" in name:
+                current = "file_types"
+            else:
+                current = "other"
+            continue
+
+        if current == "files_found":
+            # Pull path-like tokens out of the line (handles "- file.py"
+            # and "1. file.py" and bare "file.py").
+            for path in _PATH_RE.findall(line):
+                if path not in parsed.files_found:
+                    parsed.files_found.append(path)
+            # Also catch lines that look like bullets but don't have a
+            # standard extension — we don't want to drop a "Dockerfile".
+            bm = _BULLET_RE.match(line)
+            if bm:
+                rest = bm.group("rest").strip().strip("`'\"")
+                if rest and "/" in rest or _looks_like_filename(rest):
+                    if rest not in parsed.files_found:
+                        parsed.files_found.append(rest)
+        elif current == "key_files":
+            bm = _BULLET_RE.match(line)
+            if bm:
+                rest = bm.group("rest").strip().strip("`'\"")
+                if rest and rest not in parsed.key_files:
+                    parsed.key_files.append(rest)
+            else:
+                # Sometimes the explorer just lists key files inline.
+                for path in _PATH_RE.findall(line):
+                    if path not in parsed.key_files:
+                        parsed.key_files.append(path)
+        elif current == "directory":
+            parsed.directory_structure += line + "\n"
+        elif current == "file_types":
+            parsed.file_types += line + "\n"
+        else:
+            parsed.other_prose.append(line)
+    return parsed
+
+
+def _looks_like_filename(s: str) -> bool:
+    """Heuristic for tokens like ``Dockerfile``, ``Makefile``, ``LICENSE``
+    that don't have an extension but are obviously file names."""
+    s = s.strip()
+    if not s or "\n" in s or " " in s:
+        return False
+    if s.startswith(".") and len(s) > 1:
+        return True   # .gitignore, .env
+    bare = {"Dockerfile", "Makefile", "LICENSE", "CHANGELOG", "NOTICE", "AUTHORS"}
+    return s in bare
+
+
+def _truncate_directory(structure: str) -> str:
+    """Keep only the first ``MAX_DIRECTORY_LINES`` lines of the
+    directory tree.  Append a marker when trimmed."""
+    lines = [ln for ln in structure.splitlines() if ln.strip()]
+    if len(lines) <= MAX_DIRECTORY_LINES:
+        return "\n".join(lines)
+    return "\n".join(lines[:MAX_DIRECTORY_LINES]) + f"\n  …{len(lines) - MAX_DIRECTORY_LINES} more entries"
+
+
+def _file_extension_histogram(files: List[str]) -> str:
+    """When the explorer hasn't produced a 'File Types' section, derive
+    one from the file list ourselves.  Cheap and always-on."""
+    counter: Counter[str] = Counter()
+    for f in files:
+        if "." in f.split("/")[-1]:
+            ext = f.rsplit(".", 1)[-1].lower()
+            counter[ext] += 1
+        else:
+            counter["(no-ext)"] += 1
+    if not counter:
+        return ""
+    return ", ".join(f"{ext}={n}" for ext, n in counter.most_common(8))
+
+
+def compress_exploration_report(
+    report: str,
+    *,
+    token_budget: int = DEFAULT_TOKEN_BUDGET,
+) -> tuple[str, CompressionReport]:
+    """Return a fixed-budget compressed form of the explorer's report.
+
+    Always returns a string the planner can read using its existing
+    template; never raises on malformed input.
+
+    Falls back to the raw report (no compression) when:
+    * the feature flag is off, OR
+    * the raw report already fits under the budget.
+    """
+    metrics = CompressionReport(
+        original_tokens=estimate_tokens(report or ""),
+    )
+    if not flags.is_on(FLAG_SUBAGENT_EXPLORER, default=True):
+        metrics.compressed_tokens = metrics.original_tokens
+        metrics.reason = "flag off"
+        return report, metrics
+
+    if not report or not report.strip():
+        metrics.reason = "empty report"
+        return report, metrics
+
+    if metrics.original_tokens <= token_budget:
+        metrics.compressed_tokens = metrics.original_tokens
+        metrics.reason = "under budget"
+        return report, metrics
+
+    parsed = _split_into_sections(report)
+    metrics.files_in_original = len(parsed.files_found)
+
+    # File list — cap and annotate if we trimmed.
+    files = parsed.files_found[:MAX_FILES_LISTED]
+    truncated_files = len(parsed.files_found) > MAX_FILES_LISTED
+    metrics.files_kept = len(files)
+    metrics.truncated = truncated_files
+
+    # Key files — preserve order, cap.
+    key = parsed.key_files[:MAX_KEY_FILES]
+
+    # Directory structure — cap to first N lines.
+    directory = _truncate_directory(parsed.directory_structure)
+
+    # File-type histogram — prefer explorer's own; else compute one.
+    file_types = parsed.file_types.strip() or _file_extension_histogram(parsed.files_found)
+
+    # Assemble the compressed report using the same header the planner
+    # already expects.
+    lines: list[str] = ["REPOSITORY EXPLORATION REPORT", "============================="]
+    lines.append("")
+    lines.append("Files Found:")
+    for path in files:
+        lines.append(f"  - {path}")
+    if truncated_files:
+        lines.append(
+            f"  …{len(parsed.files_found) - MAX_FILES_LISTED} more files. "
+            "Use 'Find files matching a pattern' or 'Search file contents' "
+            "to drill down."
+        )
+    if key:
+        lines.append("")
+        lines.append("Key Files:")
+        for k in key:
+            lines.append(f"  - {k}")
+    if directory:
+        lines.append("")
+        lines.append("Directory Structure:")
+        lines.append(directory.rstrip())
+    if file_types:
+        lines.append("")
+        lines.append(f"File Types: {file_types}")
+
+    compressed = "\n".join(lines)
+    metrics.compressed_tokens = estimate_tokens(compressed)
+
+    # If our compression somehow blew the budget (very pathological
+    # input), trim from the file list as the last resort.
+    while metrics.compressed_tokens > token_budget and len(files) > 5:
+        files = files[: max(5, int(len(files) * 0.75))]
+        metrics.files_kept = len(files)
+        rebuilt = [
+            "REPOSITORY EXPLORATION REPORT", "=============================", "",
+            "Files Found:",
+            *(f"  - {p}" for p in files),
+            f"  …{len(parsed.files_found) - len(files)} more files. "
+            "Use 'Find files matching a pattern' or 'Search file contents' to drill down.",
+        ]
+        if key:
+            rebuilt.extend(["", "Key Files:", *(f"  - {k}" for k in key)])
+        if directory:
+            rebuilt.extend(["", "Directory Structure:", directory.rstrip()])
+        if file_types:
+            rebuilt.extend(["", f"File Types: {file_types}"])
+        compressed = "\n".join(rebuilt)
+        metrics.compressed_tokens = estimate_tokens(compressed)
+
+    return compressed, metrics
+
+
+__all__ = [
+    "FLAG_SUBAGENT_EXPLORER",
+    "DEFAULT_TOKEN_BUDGET",
+    "CompressionReport",
+    "compress_exploration_report",
+]
diff --git a/gitpilot/grep_backend.py b/gitpilot/grep_backend.py
new file mode 100644
index 0000000..78db1f7
--- /dev/null
+++ b/gitpilot/grep_backend.py
@@ -0,0 +1,275 @@
+"""Grep backend — pure-Python regex search across repo files.
+
+Powers the ``Search file contents`` agent tool.  Designed for local
+on-prem use first; no shell-out, no external dependency.  When
+``ripgrep`` is present it is used as a fast path; otherwise we fall
+back to a hand-written Python loop that's still fast on the typical
+GitPilot repo (a few hundred files).
+
+Contract (pinned by the tests):
+
+* Returns a list of dicts: ``{path, line, match}``.
+* Truncates above ``max_results`` and includes a ``truncated=True``
+  flag in the metadata so the caller can refine.
+* Result order: stable — files are sorted, lines within a file are
+  in ascending order.  Reproducible runs for tests.
+
+Security:
+* The pattern is a regular expression (validated up front).  No
+  shell injection: we never pass it to a shell when using the rg
+  binary — only via subprocess args.
+* The ``path_pattern`` filter goes through the same glob → regex
+  translator used by Batch B1, so the same `/`-aware semantics apply.
+"""
+from __future__ import annotations
+
+import logging
+import re
+import shutil
+import subprocess
+from dataclasses import dataclass, field
+from typing import Iterable, List, Optional
+
+logger = logging.getLogger(__name__)
+
+# Hard cap — never return more than this regardless of caller value.
+GREP_HARD_MAX_RESULTS = 500
+GREP_DEFAULT_MAX_RESULTS = 100
+RIPGREP_TIMEOUT_S = 10
+
+
+@dataclass
+class GrepHit:
+    path: str
+    line: int
+    match: str
+
+
+@dataclass
+class GrepResult:
+    hits: List[GrepHit] = field(default_factory=list)
+    truncated: bool = False
+    backend: str = "python"   # "ripgrep" | "python"
+    error: Optional[str] = None
+
+
+# ----------------------------------------------------------------------
+# Public entry point
+# ----------------------------------------------------------------------
+
+def grep(
+    files: dict[str, str],
+    pattern: str,
+    *,
+    case_insensitive: bool = False,
+    max_results: int = GREP_DEFAULT_MAX_RESULTS,
+    path_filter: Optional[re.Pattern[str]] = None,
+) -> GrepResult:
+    """Run a regex search over the supplied (path → content) mapping.
+
+    The caller is responsible for assembling the file map — for the
+    GitHub-only path that means downloading the relevant files first;
+    for the local-checkout path that's just ``Path.read_text`` per
+    matching file.  Keeping the backend file-source-agnostic lets us
+    test it without touching GitHub or the disk.
+    """
+    cap = max(1, min(GREP_HARD_MAX_RESULTS, int(max_results)))
+
+    try:
+        flags = re.IGNORECASE if case_insensitive else 0
+        rx = re.compile(pattern, flags)
+    except re.error as exc:
+        return GrepResult(error=f"invalid regex: {exc}")
+
+    hits: List[GrepHit] = []
+    for path in sorted(files.keys()):
+        if path_filter is not None and not path_filter.match(path):
+            continue
+        content = files[path]
+        if not content:
+            continue
+        for lineno, line in enumerate(content.splitlines(), start=1):
+            if rx.search(line):
+                hits.append(GrepHit(path=path, line=lineno, match=line.rstrip()))
+                if len(hits) >= cap:
+                    return GrepResult(hits=hits, truncated=True, backend="python")
+
+    return GrepResult(hits=hits, truncated=False, backend="python")
+
+
+# ----------------------------------------------------------------------
+# Local-checkout fast path: shell out to ripgrep when available
+# ----------------------------------------------------------------------
+
+def grep_local(
+    workdir: str,
+    pattern: str,
+    *,
+    case_insensitive: bool = False,
+    max_results: int = GREP_DEFAULT_MAX_RESULTS,
+    glob_filter: Optional[str] = None,
+) -> GrepResult:
+    """Search files under ``workdir`` using ripgrep if available,
+    falling back to a pure-Python walk otherwise.
+
+    Used for the local-checkout / local-git modes.  GitHub-only
+    sessions go through :func:`grep` instead because they don't have
+    a tree on disk.
+    """
+    if shutil.which("rg"):
+        return _grep_via_ripgrep(
+            workdir,
+            pattern,
+            case_insensitive=case_insensitive,
+            max_results=max_results,
+            glob_filter=glob_filter,
+        )
+    # Pure-Python fallback — walk the tree, read each file, match.
+    # Kept here (rather than in the GitHub helper) because the local
+    # path benefits from a file-handle-streaming walk that doesn't
+    # materialise the whole repo into memory.
+    return _grep_via_python_walk(
+        workdir,
+        pattern,
+        case_insensitive=case_insensitive,
+        max_results=max_results,
+        glob_filter=glob_filter,
+    )
+
+
+def _grep_via_ripgrep(
+    workdir: str,
+    pattern: str,
+    *,
+    case_insensitive: bool,
+    max_results: int,
+    glob_filter: Optional[str],
+) -> GrepResult:
+    cap = max(1, min(GREP_HARD_MAX_RESULTS, int(max_results)))
+    argv = [
+        "rg",
+        "--no-config",         # ignore user's ~/.ripgreprc
+        "--no-heading",
+        "--line-number",
+        "--with-filename",
+        "--color", "never",
+        "--max-count", str(cap),
+        # Skip binaries — saves token-pollution and matches what
+        # Claude Code / Cursor do by default.
+        "--text",
+    ]
+    if case_insensitive:
+        argv.append("-i")
+    if glob_filter:
+        argv.extend(["-g", glob_filter])
+    argv.extend(["--", pattern, workdir])
+
+    try:
+        proc = subprocess.run(
+            argv,
+            capture_output=True,
+            text=True,
+            timeout=RIPGREP_TIMEOUT_S,
+            check=False,
+        )
+    except subprocess.TimeoutExpired:
+        return GrepResult(
+            error=f"ripgrep timed out after {RIPGREP_TIMEOUT_S}s",
+            backend="ripgrep",
+        )
+    except FileNotFoundError:
+        # rg disappeared between which() and run() — degrade gracefully.
+        return _grep_via_python_walk(
+            workdir, pattern,
+            case_insensitive=case_insensitive,
+            max_results=max_results,
+            glob_filter=glob_filter,
+        )
+
+    # rg exits 1 when there are zero matches — that's not an error.
+    if proc.returncode not in (0, 1):
+        err = proc.stderr.strip().splitlines()
+        return GrepResult(error="; ".join(err[:3]) if err else "ripgrep failed", backend="ripgrep")
+
+    hits: List[GrepHit] = []
+    truncated = False
+    for raw in proc.stdout.splitlines():
+        # Format: <path>:<lineno>:<match>
+        parts = raw.split(":", 2)
+        if len(parts) < 3:
+            continue
+        path, lineno_s, match = parts
+        try:
+            lineno = int(lineno_s)
+        except ValueError:
+            continue
+        # Trim the leading workdir prefix so paths look repo-relative.
+        if path.startswith(workdir + "/"):
+            path = path[len(workdir) + 1:]
+        hits.append(GrepHit(path=path, line=lineno, match=match.rstrip()))
+        if len(hits) >= cap:
+            truncated = True
+            break
+    return GrepResult(hits=hits, truncated=truncated, backend="ripgrep")
+
+
+def _grep_via_python_walk(
+    workdir: str,
+    pattern: str,
+    *,
+    case_insensitive: bool,
+    max_results: int,
+    glob_filter: Optional[str],
+) -> GrepResult:
+    import pathlib
+
+    try:
+        flags = re.IGNORECASE if case_insensitive else 0
+        rx = re.compile(pattern, flags)
+    except re.error as exc:
+        return GrepResult(error=f"invalid regex: {exc}")
+
+    # Local import to avoid a hard module-load dep when grep isn't used.
+    from .agent_tools import _glob_to_regex
+
+    pf = _glob_to_regex(glob_filter) if glob_filter else None
+    cap = max(1, min(GREP_HARD_MAX_RESULTS, int(max_results)))
+    root = pathlib.Path(workdir)
+    hits: List[GrepHit] = []
+    # Walk deterministically so tests are reproducible.
+    for path in sorted(root.rglob("*")):
+        if not path.is_file():
+            continue
+        rel = path.relative_to(root).as_posix()
+        if pf is not None and not pf.match(rel):
+            continue
+        try:
+            content = path.read_text(encoding="utf-8", errors="replace")
+        except (OSError, UnicodeDecodeError):
+            continue
+        for lineno, line in enumerate(content.splitlines(), start=1):
+            if rx.search(line):
+                hits.append(GrepHit(path=rel, line=lineno, match=line.rstrip()))
+                if len(hits) >= cap:
+                    return GrepResult(hits=hits, truncated=True, backend="python")
+    return GrepResult(hits=hits, truncated=False, backend="python")
+
+
+# ----------------------------------------------------------------------
+# Formatter for the agent tool wrapper
+# ----------------------------------------------------------------------
+
+def format_result(result: GrepResult, *, pattern: str) -> str:
+    if result.error:
+        return f"Error: {result.error}"
+    if not result.hits:
+        return f"No matches for pattern: {pattern}"
+    lines = [f"Found {len(result.hits)} match(es) for: {pattern}"]
+    for hit in result.hits:
+        lines.append(f"  {hit.path}:{hit.line}: {hit.match[:200]}")
+    if result.truncated:
+        lines.append(
+            f"…truncated at {len(result.hits)} hits. "
+            "Narrow the pattern or pass max_results to see more."
+        )
+    return "\n".join(lines)
diff --git a/gitpilot/local_tools.py b/gitpilot/local_tools.py
index 05b18d2..476b0cd 100644
--- a/gitpilot/local_tools.py
+++ b/gitpilot/local_tools.py
@@ -165,11 +165,27 @@ def git_log(count: str = "10") -> str:
 def run_command(command: str, timeout: str = "120") -> str:
     """Run a shell command in the workspace directory.
     Returns stdout, stderr, and exit code.
-    Examples: 'npm test', 'python -m pytest', 'make build', 'ls -la'."""
+    Examples: 'npm test', 'python -m pytest', 'make build', 'ls -la'.
+
+    When the user has selected a non-local sandbox in Settings (e.g.
+    MatrixLab), this tool transparently delegates to that backend so
+    the agent's autonomous build/test loop runs in the same isolation
+    the chat UI's Run button uses. With the default ``subprocess``
+    backend the call still goes through :class:`SubprocessSandbox`,
+    which jails cwd to the workspace and scrubs secrets — strictly
+    stronger than the previous host-direct path."""
     ws = _require_workspace()
+    timeout_int = _coerce_timeout(timeout)
+    try:
+        # Prefer the configured sandbox.  Falls back to the legacy
+        # TerminalSession path only on import errors so an environment
+        # without httpx still runs the agent (existing behaviour).
+        return _run_via_sandbox(command, timeout_int, ws.path)
+    except _SandboxFallback:
+        pass
     try:
         session = TerminalSession(workspace_path=ws.path)
-        result = _run_async(_executor.execute(session, command, int(timeout)))
+        result = _run_async(_executor.execute(session, command, timeout_int))
         output = f"Exit code: {result.exit_code}\n"
         if result.stdout:
             output += f"--- stdout ---\n{result.stdout}\n"
@@ -186,6 +202,199 @@ def run_command(command: str, timeout: str = "120") -> str:
         return f"Error: {e}"
 
 
+@tool("Run code in sandbox")
+def run_in_sandbox(language: str, code: str, timeout: str = "120") -> str:
+    """Execute a self-contained code snippet in the configured sandbox.
+
+    Use this when you want to verify that code you produced *actually
+    works* before handing it back to the user — write the snippet,
+    call this tool, read the captured stdout / stderr / exit code,
+    and iterate. The snippet runs in an ephemeral tempdir (not the
+    workspace), so file-system side effects don't pollute the repo.
+
+    Supported languages: python, javascript (or js/node), bash (or
+    sh/shell). Returns a single text block with the exit code,
+    stdout, stderr, duration, and backend (Local subprocess /
+    MatrixLab) so you can tell which sandbox executed the snippet.
+
+    Error retrieval is the point of this tool: when the snippet
+    fails, the full stderr trace comes back verbatim — the agent
+    should read it, decide how to fix the bug, and re-run."""
+    timeout_int = _coerce_timeout(timeout)
+    try:
+        return _run_snippet_via_sandbox(language, code, timeout_int)
+    except _SandboxFallback as exc:
+        return f"Error: sandbox unavailable ({exc}); cannot run snippet."
+
+
+# ---------------------------------------------------------------------
+# Sandbox helpers
+# ---------------------------------------------------------------------
+
+class _SandboxFallback(Exception):
+    """Raised when the sandbox path is unusable and the caller should
+    fall back to the legacy TerminalSession executor."""
+
+
+def _coerce_timeout(value: object) -> int:
+    try:
+        n = int(str(value))
+    except (TypeError, ValueError):
+        return 120
+    if n <= 0:
+        return 120
+    return min(n, 600)
+
+
+def _format_sandbox_output(result, label: str) -> str:
+    """Render a SandboxResult / SandboxRunResponse-shaped object as the
+    same text block the agent has been reading from ``run_command``,
+    so existing prompt parsing keeps working — just with the backend
+    line appended so the agent (and the user reading the trace) can
+    see which sandbox ran the command."""
+    backend = getattr(result, "backend", None) or "subprocess"
+    pretty = {
+        "subprocess": "local subprocess",
+        "matrixlab": "MatrixLab",
+        "off": "pass-through (host)",
+    }.get(backend, backend)
+    lines = [f"Sandbox: {pretty}", f"Command: {label}", f"Exit code: {result.exit_code}"]
+    if getattr(result, "duration_ms", None) is not None:
+        lines.append(f"Duration: {result.duration_ms} ms")
+    if result.stdout:
+        lines.append("--- stdout ---")
+        lines.append(result.stdout)
+    if result.stderr:
+        lines.append("--- stderr ---")
+        lines.append(result.stderr)
+    if getattr(result, "timed_out", False):
+        lines.append("WARNING: Command timed out")
+    if getattr(result, "truncated", False):
+        lines.append("WARNING: Output was truncated")
+    sbid = getattr(result, "sandbox_id", None)
+    if sbid:
+        lines.append(f"sandbox_id: {sbid}")
+    return "\n".join(lines) + "\n"
+
+
+def _run_via_sandbox(command: str, timeout: int, workspace_path) -> str:
+    """Route ``run_command`` through the configured sandbox backend.
+
+    Raises :class:`_SandboxFallback` so the caller can drop to the
+    legacy TerminalSession path if the sandbox can't be constructed
+    (e.g. httpx missing in a stripped runtime)."""
+    try:
+        from pathlib import Path
+
+        from .sandbox import (
+            BACKEND_MATRIXLAB,
+            BACKEND_OFF,
+            BACKEND_SUBPROCESS,
+            MatrixLabSandbox,
+            NullSandbox,
+            SandboxPolicy,
+            SandboxRunError,
+            SandboxUnavailableError,
+            SubprocessSandbox,
+        )
+        from .settings import get_settings
+    except Exception as exc:  # noqa: BLE001
+        raise _SandboxFallback(str(exc)) from exc
+
+    cfg = get_settings().sandbox
+    backend = (cfg.backend or BACKEND_SUBPROCESS).strip().lower()
+
+    # MatrixLab's /repo/run endpoint requires a real ``repo_url`` —
+    # that's the contract for cloning + running CI against a remote
+    # repo, not for arbitrary in-workspace shell commands.  Route
+    # workspace commands through the snippet path instead (POST
+    # /api/sandbox/run with language=bash), which already dispatches
+    # to MatrixLab /code/run.  Keeps the agent's run_command working
+    # whichever backend the user picked.
+    if backend == BACKEND_MATRIXLAB:
+        return _run_snippet_via_sandbox("bash", command, timeout)
+
+    policy = SandboxPolicy(
+        workspace=Path(workspace_path),
+        timeout_sec=timeout,
+        allow_network=cfg.allow_network,
+        image=cfg.matrixlab_image or None,
+    )
+    if backend == BACKEND_OFF:
+        sb = NullSandbox(policy)
+    else:
+        sb = SubprocessSandbox(policy)
+
+    # Run + close in a SINGLE event loop.  MatrixLabSandbox would
+    # lazily build an httpx.AsyncClient on first use; closing it in a
+    # different loop than it was created in is the textbook asyncio
+    # antipattern (RuntimeError: Event loop is closed).  Two separate
+    # asyncio.run() calls would do exactly that.  (For the
+    # subprocess/null path the close is a no-op, but keeping the
+    # pattern uniform means future backends can rely on it.)
+    async def _run_and_close():
+        try:
+            return await sb.run(command, timeout=timeout)
+        finally:
+            aclose = getattr(sb, "aclose", None)
+            if aclose is not None:
+                try:
+                    await aclose()
+                except Exception:  # noqa: BLE001
+                    pass
+    try:
+        result = _run_async(_run_and_close())
+    except SandboxUnavailableError as exc:
+        return f"Error: sandbox backend {backend!r} unreachable: {exc}\n"
+    except SandboxRunError as exc:
+        return f"Error: sandbox backend {backend!r} reported an error: {exc}\n"
+    except PermissionError as exc:
+        return f"Permission denied by sandbox policy: {exc}\n"
+    return _format_sandbox_output(result, command)
+
+
+def _run_snippet_via_sandbox(language: str, code: str, timeout: int) -> str:
+    """Execute a fenced snippet by POSTing to GitPilot's own
+    /api/sandbox/run endpoint so the agent and the chat UI share one
+    code path. Going via the HTTP surface (rather than reaching into
+    sandbox_api internals) keeps the lifecycle / cleanup behaviour
+    identical between the two callers."""
+    try:
+        import os
+
+        import httpx
+    except Exception as exc:  # noqa: BLE001
+        raise _SandboxFallback(str(exc)) from exc
+
+    port = os.environ.get("GITPILOT_PORT") or "8765"
+    base = os.environ.get("GITPILOT_INTERNAL_URL") or f"http://127.0.0.1:{port}"
+    body = {"language": language, "code": code, "timeout_sec": timeout}
+    try:
+        with httpx.Client(timeout=timeout + 10) as client:
+            resp = client.post(f"{base}/api/sandbox/run", json=body)
+    except httpx.HTTPError as exc:
+        return f"Error: could not reach the in-process sandbox API: {exc}\n"
+    if resp.status_code >= 400:
+        try:
+            detail = resp.json().get("detail", resp.text)
+        except Exception:  # noqa: BLE001
+            detail = resp.text
+        return f"Sandbox error ({resp.status_code}): {detail}\n"
+    data = resp.json()
+
+    class _R:
+        backend = data.get("backend")
+        exit_code = data.get("exit_code")
+        stdout = data.get("stdout", "")
+        stderr = data.get("stderr", "")
+        duration_ms = data.get("duration_ms")
+        timed_out = data.get("timed_out", False)
+        truncated = data.get("truncated", False)
+        sandbox_id = data.get("sandbox_id")
+
+    return _format_sandbox_output(_R(), f"{language} <snippet>")
+
+
 # -----------------------------------------------------------------------
 # Exports
 # -----------------------------------------------------------------------
@@ -207,6 +416,7 @@ def run_command(command: str, timeout: str = "120") -> str:
 
 LOCAL_SHELL_TOOLS = [
     run_command,
+    run_in_sandbox,
 ]
 
 LOCAL_TOOLS = LOCAL_FILE_TOOLS + LOCAL_GIT_TOOLS + LOCAL_SHELL_TOOLS
diff --git a/gitpilot/query_router.py b/gitpilot/query_router.py
new file mode 100644
index 0000000..b78f431
--- /dev/null
+++ b/gitpilot/query_router.py
@@ -0,0 +1,456 @@
+"""Deterministic query router (Batch B9).
+
+Classifies a user goal into one of a handful of *intents* (fix /
+find / info / create / delete / modify), extracts any files the user
+mentioned, and emits a strategy hint the planner can either follow
+or override.
+
+Pure Python, no LLM call.  The point is that **small local models
+(llama3:8b)** sometimes fail to pick the right tool even with rich
+descriptions — a deterministic pre-router keeps them on the rails.
+Big models can ignore the hint when their judgment is better than
+the heuristic; we treat it as advisory, not constraining.
+
+Auto-RAG decision:
+* The router signals ``auto_index_repo=True`` only when the query
+  is *fuzzy* (natural-language, no symbol tokens, no path mentions)
+  AND the repo is big enough to benefit (>= 50 files) AND a RAG
+  index doesn't already exist.
+* When consent has not been granted yet, the API layer turns that
+  signal into an INDEX plan step (see Batch B9 design).  When
+  consent IS granted, the API layer auto-builds in the background.
+
+This module returns the *decision*; the API layer translates it
+into either a plan step or a background task.
+"""
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass, field
+from typing import List, Literal, Optional, Sequence
+
+# ----------------------------------------------------------------------
+# Constants — exposed so tests can pin the heuristics
+# ----------------------------------------------------------------------
+
+INTENT_LITERALS = (
+    "fix",
+    "find",
+    "info",
+    "create",
+    "delete",
+    "modify",
+    "unknown",
+)
+
+# Per-intent trigger words, lowercased.  Order in this table = priority
+# when several intents match (first hit wins, except "fix" beats
+# "modify" because every fix is a modify but not every modify is a fix).
+_INTENT_TRIGGERS: list[tuple[str, tuple[str, ...]]] = [
+    ("fix",    ("fix ", "bug", " error", "broken", "doesn't work",
+                "doesnt work", "crash", "traceback", "exception",
+                "fails", "failing", "regression")),
+    ("delete", ("delete ", "remove ", "drop ", "get rid of",
+                "uninstall", "clean up")),
+    ("create", ("create ", "add ", "generate ", "new file",
+                "write a new", "build a", "make a", "scaffold")),
+    ("modify", ("modify ", "change ", "update ", "rename ", "refactor ",
+                "replace ", "rewrite ", "convert ", "migrate ", "edit ")),
+    ("find",   ("where ", "find ", "search ", "locate ", "show me ",
+                "list ", "which file", "look for")),
+    ("info",   ("what is ", "what does ", "explain ", "describe ",
+                "how does ", "how do ", "tell me ", "summari",
+                "overview", "what do you think")),
+]
+
+# Repo-file extension whitelist for path extraction — every match
+# must end in a "real" extension OR be a well-known extensionless
+# file.  Keeps the extractor from grabbing words like "asyncio" that
+# happen to contain dots in surrounding punctuation.
+_PATH_EXTENSIONS = (
+    "py", "ts", "tsx", "js", "jsx", "mjs", "cjs",
+    "go", "rs", "rb", "java", "kt", "scala", "swift",
+    "c", "h", "cpp", "hpp", "cc", "cxx",
+    "md", "rst", "txt", "html", "css", "scss", "sass",
+    "json", "yml", "yaml", "toml", "ini", "cfg",
+    "sh", "bash", "zsh", "fish",
+    "sql", "graphql", "proto",
+)
+_EXTENSIONLESS_KEY_FILES = (
+    "Dockerfile", "Makefile", "LICENSE", "CHANGELOG", "NOTICE",
+    "AUTHORS", "Procfile", "Gemfile", "Rakefile", ".gitignore",
+    ".env", ".env.example", ".dockerignore",
+)
+
+# Indentation-sensitive extensions — the planner gets a stronger hint
+# to use Edit (surgical) rather than Write (regenerate).
+_INDENTATION_SENSITIVE_EXTS = (
+    "py", "yml", "yaml", "haml", "slim", "pug", "jade",
+)
+
+# Generated / lock files we refuse to MODIFY.
+_FORBIDDEN_EDIT_EXTS = (
+    "lock", "min.js", "min.css",
+)
+_FORBIDDEN_EDIT_NAMES = (
+    "poetry.lock", "package-lock.json", "yarn.lock",
+    "Cargo.lock", "Gemfile.lock", "go.sum",
+)
+
+# Quoted-path regex covers `'README.md'`, `"src/main.py"`, `` `LICENSE` ``.
+_QUOTED_PATH_RE = re.compile(r"""[`'"]([\w./\-]+)[`'"]""")
+# Bareword path: ``src/main.py``, ``README.md``.  Must contain at
+# least one ".ext" or "/".
+_BAREWORD_PATH_RE = re.compile(
+    r"(?<![\w/])([\w./\-]*[A-Za-z0-9](?:/[\w.\-]+)+|[\w.\-]+\.[A-Za-z][A-Za-z0-9]+)(?!\w)"
+)
+
+# A query is "fuzzy" if it reads like natural language (≥4 words,
+# no exact symbols, no path-shaped tokens).  Tuned by experimentation
+# on representative dev prompts.
+_FUZZY_MIN_WORDS = 4
+_SYMBOL_RE = re.compile(r"[A-Z][a-z]+(?:[A-Z][a-z]+)+|[a-z_]+_[a-z_]+|[A-Z]{2,}")
+
+
+# ----------------------------------------------------------------------
+# Public types
+# ----------------------------------------------------------------------
+
+Intent = Literal["fix", "find", "info", "create", "delete", "modify", "unknown"]
+EditStrategy = Literal["surgical", "regenerate", "reject"]
+
+
+@dataclass(frozen=True)
+class RouterDecision:
+    """Everything the planner / executor needs to pick a strategy."""
+
+    intent: Intent
+    target_files: List[str] = field(default_factory=list)
+    tool_priority: List[str] = field(default_factory=list)
+    rag_recommended: bool = False
+    auto_index_repo: bool = False
+    edit_strategy: EditStrategy = "surgical"
+    file_policy_notes: str = ""
+    rationale: str = ""        # one-liner for the Tasks panel
+    repo_too_small_for_rag: bool = False
+
+
+# ----------------------------------------------------------------------
+# Heuristics
+# ----------------------------------------------------------------------
+
+def _detect_intent(goal: str) -> Intent:
+    q = " " + goal.lower() + " "
+    for intent, triggers in _INTENT_TRIGGERS:
+        for t in triggers:
+            if t in q:
+                return intent  # type: ignore[return-value]
+    return "unknown"
+
+
+def _extract_path_candidates(goal: str) -> List[str]:
+    """Pull every plausibly-file-shaped token out of the goal."""
+    candidates: list[str] = []
+    seen: set[str] = set()
+
+    def _push(tok: str) -> None:
+        tok = tok.strip().strip(".,:;()")
+        if not tok or tok in seen:
+            return
+        seen.add(tok)
+        candidates.append(tok)
+
+    for m in _QUOTED_PATH_RE.findall(goal):
+        _push(m)
+    for m in _BAREWORD_PATH_RE.findall(goal):
+        _push(m)
+    return candidates
+
+
+def _verify_against_repo(
+    candidates: Sequence[str],
+    repo_files: Optional[Sequence[str]],
+) -> List[str]:
+    """Drop candidates that don't exist in the repo.
+
+    Match strategy: exact path or basename.  Returns the canonical
+    repo path so the planner's prompt always uses the same casing
+    as the actual file (lower vs upper-case readme.md vs README.md).
+    """
+    if not repo_files:
+        return list(candidates)
+    file_set = set(repo_files)
+    basename_map: dict[str, str] = {}
+    for p in repo_files:
+        basename = p.rsplit("/", 1)[-1]
+        basename_map.setdefault(basename, p)
+    out: list[str] = []
+    for tok in candidates:
+        if tok in file_set:
+            out.append(tok)
+        elif tok in basename_map:
+            out.append(basename_map[tok])
+        # Case-insensitive last-chance match.
+        else:
+            lower = tok.lower()
+            for p in repo_files:
+                if p.lower() == lower:
+                    out.append(p)
+                    break
+    return out
+
+
+def _ext_of(path: str) -> str:
+    name = path.rsplit("/", 1)[-1]
+    if name in _EXTENSIONLESS_KEY_FILES:
+        return ""
+    # Special-case multi-dot extensions before the simple split.
+    lname = name.lower()
+    if lname.endswith(".min.js"):
+        return "min.js"
+    if lname.endswith(".min.css"):
+        return "min.css"
+    if "." not in name:
+        return ""
+    return name.rsplit(".", 1)[-1].lower()
+
+
+def _is_fuzzy(goal: str) -> bool:
+    """A query is fuzzy when it reads like natural language — no
+    symbol-shaped tokens, no path mentions, ≥ N content words."""
+    g = goal.strip()
+    if not g:
+        return False
+    if _QUOTED_PATH_RE.search(g) or _BAREWORD_PATH_RE.search(g):
+        return False
+    if _SYMBOL_RE.search(g):
+        return False
+    words = [w for w in re.split(r"\s+", g) if len(w) > 2]
+    return len(words) >= _FUZZY_MIN_WORDS
+
+
+def _looks_like_symbol_search(goal: str) -> bool:
+    return bool(_SYMBOL_RE.search(goal))
+
+
+def _file_policy_notes(target_files: Sequence[str]) -> tuple[EditStrategy, str]:
+    """Roll up per-file policy into one human-readable note for the
+    planner prompt and one machine-readable strategy."""
+    if not target_files:
+        return "surgical", ""
+
+    forbidden = []
+    indentation_sensitive = []
+    for p in target_files:
+        ext = _ext_of(p)
+        name = p.rsplit("/", 1)[-1]
+        if ext in _FORBIDDEN_EDIT_EXTS or name in _FORBIDDEN_EDIT_NAMES:
+            forbidden.append(p)
+        elif ext in _INDENTATION_SENSITIVE_EXTS:
+            indentation_sensitive.append(p)
+
+    if forbidden:
+        notes = (
+            f"Refuse to MODIFY: {', '.join(forbidden)} — these are "
+            "generated / lock files.  Edit the source manifest instead."
+        )
+        return "reject", notes
+
+    if indentation_sensitive:
+        notes = (
+            "Use 'Edit a section of a file' (surgical) — the file "
+            "extension is indentation-sensitive.  Quote leading "
+            "whitespace exactly when constructing old_string."
+        )
+        return "surgical", notes
+
+    return "surgical", "Use 'Edit a section of a file' for any MODIFY action."
+
+
+# ----------------------------------------------------------------------
+# Public entry point
+# ----------------------------------------------------------------------
+
+DEFAULT_FUZZY_REPO_SIZE_FOR_RAG = 50
+
+
+def classify(
+    goal: str,
+    *,
+    repo_files: Optional[Sequence[str]] = None,
+    rag_index_exists: bool = False,
+    force_no_rag: bool = False,
+) -> RouterDecision:
+    """Classify a user goal into a :class:`RouterDecision`.
+
+    Pure: same inputs → identical outputs, no I/O, no LLM call.
+
+    ``repo_files`` is optional but recommended — without it we can't
+    verify that the files the user mentioned actually exist.
+    """
+    if not goal or not goal.strip():
+        return RouterDecision(
+            intent="unknown",
+            rationale="empty goal",
+            tool_priority=["Get repository summary"],
+        )
+
+    intent = _detect_intent(goal)
+    raw_candidates = _extract_path_candidates(goal)
+    targets = _verify_against_repo(raw_candidates, repo_files)
+    fuzzy = _is_fuzzy(goal)
+    repo_size = len(repo_files) if repo_files is not None else 0
+    too_small_for_rag = repo_size > 0 and repo_size < DEFAULT_FUZZY_REPO_SIZE_FOR_RAG
+
+    # RAG / semantic search is only useful for *read-leaning* intents
+    # — finding something, diagnosing a fix, refactoring across files.
+    # Informational queries are answered from the repo map (no need
+    # for vectors); create / delete are structural and benefit from
+    # Glob, not embeddings.
+    _rag_eligible_intent = intent in ("find", "fix", "modify", "unknown")
+    rag_recommended = (
+        fuzzy
+        and _rag_eligible_intent
+        and not force_no_rag
+        and not too_small_for_rag
+    )
+    auto_index_repo = (
+        rag_recommended
+        and not rag_index_exists
+        and not too_small_for_rag
+    )
+
+    edit_strategy, file_notes = _file_policy_notes(targets)
+
+    tools: list[str]
+    if intent == "info":
+        # Informational: read README + repo map, no plan.
+        tools = ["Read file content", "Get repository summary"]
+    elif intent in ("fix", "modify") and targets:
+        tools = ["Read file content", "Edit a section of a file"]
+        if edit_strategy == "reject":
+            tools = ["Read file content"]
+    elif intent in ("fix", "modify") and not targets:
+        # Need to find the file first.
+        if rag_recommended:
+            tools = [
+                "Find code by semantic search",
+                "Search file contents",
+                "Read file content",
+                "Edit a section of a file",
+            ]
+        else:
+            tools = [
+                "Search file contents",
+                "Read file content",
+                "Edit a section of a file",
+            ]
+    elif intent == "find":
+        if rag_recommended:
+            tools = ["Find code by semantic search", "Search file contents",
+                     "Read file content"]
+        elif _looks_like_symbol_search(goal):
+            tools = ["Search file contents", "Read file content"]
+        else:
+            tools = ["Find files matching a pattern", "Search file contents",
+                     "Read file content"]
+    elif intent == "create":
+        tools = ["Get repository summary", "Read file content",
+                 "Write or update a file in the repository"]
+    elif intent == "delete":
+        tools = ["Find files matching a pattern",
+                 "Delete a file from the repository"]
+    else:
+        # unknown — default to the safe exploration set.
+        tools = [
+            "Get repository summary",
+            "Find files matching a pattern",
+            "Search file contents",
+            "Read file content",
+        ]
+
+    rationale = _build_rationale(
+        intent=intent, targets=targets, rag=rag_recommended,
+        auto_index=auto_index_repo, fuzzy=fuzzy,
+    )
+
+    return RouterDecision(
+        intent=intent,
+        target_files=targets,
+        tool_priority=tools,
+        rag_recommended=rag_recommended,
+        auto_index_repo=auto_index_repo,
+        edit_strategy=edit_strategy,
+        file_policy_notes=file_notes,
+        rationale=rationale,
+        repo_too_small_for_rag=too_small_for_rag,
+    )
+
+
+def _build_rationale(
+    *, intent: Intent, targets: Sequence[str], rag: bool,
+    auto_index: bool, fuzzy: bool,
+) -> str:
+    parts = [f"intent={intent}"]
+    if targets:
+        parts.append(f"targets={','.join(targets[:3])}")
+    if rag:
+        parts.append("rag=preferred")
+    if auto_index:
+        parts.append("auto-index=requested")
+    if fuzzy and not rag:
+        parts.append("fuzzy")
+    return " · ".join(parts)
+
+
+# ----------------------------------------------------------------------
+# Hint rendering — what the planner sees inside its prompt
+# ----------------------------------------------------------------------
+
+def render_planner_hint(decision: RouterDecision) -> str:
+    """Render the decision as a small markdown block to splice into
+    the planner's context_pack.  Advisory tone — the planner may
+    override when context demands it."""
+    lines = [
+        "## ROUTING HINT (advisory — override if the goal demands more)",
+        f"- Intent: **{decision.intent}**",
+    ]
+    if decision.target_files:
+        lines.append(
+            "- Likely target files: " + ", ".join(
+                f"`{p}`" for p in decision.target_files[:5]
+            )
+        )
+    if decision.tool_priority:
+        lines.append(
+            "- Preferred tools (in order): "
+            + " → ".join(f"`{t}`" for t in decision.tool_priority)
+        )
+    if decision.file_policy_notes:
+        lines.append(f"- File policy: {decision.file_policy_notes}")
+    if decision.rag_recommended:
+        lines.append(
+            "- Semantic search recommended for this fuzzy query.  "
+            "Prefer `Find code by semantic search` before `Search file contents`."
+        )
+    if decision.auto_index_repo:
+        lines.append(
+            "- A semantic index has not been built for this repo yet.  "
+            "Include a Step 1 with action `INDEX` so the user can "
+            "approve the one-time build (~30 s, local, ~12 MB)."
+        )
+    if decision.intent == "info":
+        lines.append(
+            "- This is an informational query.  Produce a plan with "
+            "READ-only file actions and a substantive summary — do NOT "
+            "create / modify / delete files."
+        )
+    return "\n".join(lines)
+
+
+__all__ = [
+    "DEFAULT_FUZZY_REPO_SIZE_FOR_RAG",
+    "RouterDecision",
+    "classify",
+    "render_planner_hint",
+]
diff --git a/gitpilot/rag/__init__.py b/gitpilot/rag/__init__.py
new file mode 100644
index 0000000..60aaada
--- /dev/null
+++ b/gitpilot/rag/__init__.py
@@ -0,0 +1,64 @@
+"""GitPilot local RAG pipeline (Batch B7).
+
+On-prem-first design — no cloud calls, no API keys.  Defaults to
+ChromaDB's bundled MiniLM-L6-v2 (downloaded once, ~80 MB on disk).
+A pure-Python ``HashingEmbedder`` is shipped alongside as a dependency-
+free fallback for tests and minimal-footprint deployments.
+
+Public entry points:
+
+* :func:`build_index_from_files` — given an iterable of
+  ``(path, content)`` pairs, chunk + embed + persist to ChromaDB.
+* :func:`retrieve_top_k` — embed a query, return the best-matching
+  chunks across the persisted index.
+* :func:`semantic_search_tool` — the CrewAI tool wrapper.
+
+Storage layout:
+
+    <RAG_ROOT>/<owner>/<repo>/<branch>/      ← Chroma persistent client
+    <RAG_ROOT>/<owner>/<repo>/<branch>/meta.json
+        {
+          "indexed_files": {"<path>": "<file_sha>"},
+          "embedder":      "default" | "hashing",
+          "embedding_dim": 384,
+          "updated_at":    "ISO-8601",
+        }
+
+Flags:
+
+* ``rag_retrieval`` — gates the agent tool registration and the
+  /api/repos/.../index endpoints.  Default **off** (opt-in apex).
+"""
+from __future__ import annotations
+
+FLAG_RAG_RETRIEVAL = "rag_retrieval"
+
+from .chunker import Chunk, chunk_file, chunk_files  # noqa: E402
+from .embedder import (  # noqa: E402
+    Embedder,
+    HashingEmbedder,
+    get_default_embedder,
+)
+from .indexer import (  # noqa: E402
+    IndexBuildReport,
+    IndexMeta,
+    build_index_from_files,
+)
+from .retriever import RetrievedChunk, retrieve_top_k  # noqa: E402
+from .store import RagStore  # noqa: E402
+
+__all__ = [
+    "FLAG_RAG_RETRIEVAL",
+    "Chunk",
+    "Embedder",
+    "HashingEmbedder",
+    "IndexBuildReport",
+    "IndexMeta",
+    "RagStore",
+    "RetrievedChunk",
+    "build_index_from_files",
+    "chunk_file",
+    "chunk_files",
+    "get_default_embedder",
+    "retrieve_top_k",
+]
diff --git a/gitpilot/rag/chunker.py b/gitpilot/rag/chunker.py
new file mode 100644
index 0000000..3eb8ee7
--- /dev/null
+++ b/gitpilot/rag/chunker.py
@@ -0,0 +1,122 @@
+"""File-to-chunk splitter for the RAG indexer (Batch B7).
+
+Strategy (simplest viable):
+
+* **Line-window chunking with overlap.**  Each chunk is up to
+  ``CHUNK_LINES`` source lines (default 40), with ``CHUNK_OVERLAP``
+  lines (default 5) of overlap to preserve context across boundaries.
+* **Binary / oversize skip.**  Files larger than ``MAX_FILE_BYTES``
+  or detected as binary are skipped silently.  We never want a 10 MB
+  minified JS or a binary blob to poison the index.
+* **Deterministic chunk ids.**  ``<sha1(path)>:<start_line>``, so
+  re-indexing the same file produces identical ids and ChromaDB's
+  upsert keeps the collection tidy.
+
+The chunker is intentionally language-naive — AST-aware splitting
+(tree-sitter per language) is the next refinement once the simpler
+approach is working.  Even the naive version dramatically outperforms
+"read every file" on >100-file repos.
+"""
+from __future__ import annotations
+
+import hashlib
+from dataclasses import dataclass
+from typing import Iterable, Iterator, List
+
+CHUNK_LINES = 40
+CHUNK_OVERLAP = 5
+MAX_FILE_BYTES = 256 * 1024   # 256 KB per file — bigger files are
+                              # almost always generated / minified.
+
+
+@dataclass(frozen=True)
+class Chunk:
+    chunk_id: str
+    path: str
+    start_line: int          # 1-indexed, inclusive
+    end_line: int            # 1-indexed, inclusive
+    text: str
+    file_sha: str            # short sha of the source file at chunk time
+
+
+def _short_sha(data: str) -> str:
+    return hashlib.sha1(data.encode("utf-8", errors="replace")).hexdigest()[:16]
+
+
+def _looks_binary(content: str) -> bool:
+    """Heuristic — null bytes or a high non-printable ratio in the
+    first chunk usually means binary.  Anything ChromaDB embeds must
+    be reasonable text."""
+    sample = content[:2048]
+    if "\x00" in sample:
+        return True
+    if not sample:
+        return False
+    bad = sum(
+        1 for c in sample
+        if not (c.isprintable() or c in "\n\r\t ")
+    )
+    return bad / max(1, len(sample)) > 0.3
+
+
+def chunk_file(
+    path: str,
+    content: str,
+    *,
+    chunk_lines: int = CHUNK_LINES,
+    overlap: int = CHUNK_OVERLAP,
+) -> List[Chunk]:
+    """Split a single file's content into overlapping line windows.
+
+    Returns an empty list when the file is empty, binary, or above
+    :data:`MAX_FILE_BYTES`.  Never raises on bad input.
+    """
+    if not content:
+        return []
+    if len(content.encode("utf-8", errors="replace")) > MAX_FILE_BYTES:
+        return []
+    if _looks_binary(content):
+        return []
+
+    chunk_lines = max(5, int(chunk_lines))
+    overlap = max(0, min(int(overlap), chunk_lines - 1))
+    step = chunk_lines - overlap
+
+    lines = content.splitlines()
+    if not lines:
+        return []
+    file_sha = _short_sha(content)
+    path_sha = hashlib.sha1(path.encode("utf-8")).hexdigest()[:12]
+
+    out: List[Chunk] = []
+    i = 0
+    while i < len(lines):
+        window = lines[i : i + chunk_lines]
+        if not window:
+            break
+        start = i + 1
+        end = i + len(window)
+        chunk_id = f"{path_sha}:{start}"
+        out.append(
+            Chunk(
+                chunk_id=chunk_id,
+                path=path,
+                start_line=start,
+                end_line=end,
+                text="\n".join(window),
+                file_sha=file_sha,
+            )
+        )
+        if end >= len(lines):
+            break
+        i += step
+    return out
+
+
+def chunk_files(
+    files: Iterable[tuple[str, str]],
+) -> Iterator[Chunk]:
+    """Yield chunks across an iterable of (path, content) pairs."""
+    for path, content in files:
+        for chunk in chunk_file(path, content):
+            yield chunk
diff --git a/gitpilot/rag/embedder.py b/gitpilot/rag/embedder.py
new file mode 100644
index 0000000..2e6650c
--- /dev/null
+++ b/gitpilot/rag/embedder.py
@@ -0,0 +1,165 @@
+"""Local embedding backends for the RAG pipeline (Batch B7).
+
+GitPilot is on-prem-first.  We refuse to require a cloud API key for
+the indexing path.  Two backends ship:
+
+* :class:`DefaultEmbedder` — wraps ChromaDB's bundled
+  ``all-MiniLM-L6-v2`` ONNX model.  ~80 MB downloaded once, 384-dim
+  vectors, free.  This is the production default.
+* :class:`HashingEmbedder` — pure-Python, deterministic, zero deps.
+  Produces a 256-dim sparse-hash representation.  Quality is below
+  MiniLM but the recall is good enough for unit tests and for
+  environments where ``onnxruntime`` can't be installed.
+
+Both implement the same :class:`Embedder` Protocol so callers can swap
+freely.  Tests inject :class:`HashingEmbedder` so the suite doesn't
+require an 80 MB download.
+
+The selection function :func:`get_default_embedder` tries the
+production embedder first and falls back transparently when the
+underlying deps aren't available.
+"""
+from __future__ import annotations
+
+import hashlib
+import logging
+import math
+import re
+from typing import Iterable, List, Protocol, runtime_checkable
+
+logger = logging.getLogger(__name__)
+
+HASHING_DIM = 256
+TOKEN_RE = re.compile(r"[A-Za-z_][A-Za-z0-9_]*")
+
+
+@runtime_checkable
+class Embedder(Protocol):
+    """Embedder Protocol — matches ChromaDB's EmbeddingFunction shape."""
+
+    @property
+    def name(self) -> str:
+        ...
+
+    @property
+    def dim(self) -> int:
+        ...
+
+    def __call__(self, texts: List[str]) -> List[List[float]]:
+        ...
+
+
+# ----------------------------------------------------------------------
+# HashingEmbedder — dependency-free fallback
+# ----------------------------------------------------------------------
+
+class HashingEmbedder:
+    """Deterministic hash-bucket embedder.
+
+    Tokenises the input on identifier boundaries, hashes each token
+    into one of :data:`HASHING_DIM` buckets, counts occurrences, then
+    L2-normalises.  Two semantically-similar code snippets that share
+    identifier vocabulary will produce vectors close in cosine
+    distance — good enough for "find the file that mentions
+    ``foo_bar``" without any model download.
+    """
+    name = "hashing-v1"
+
+    def __init__(self, dim: int = HASHING_DIM) -> None:
+        self._dim = max(32, int(dim))
+
+    @property
+    def dim(self) -> int:
+        return self._dim
+
+    def _embed_one(self, text: str) -> List[float]:
+        buckets = [0.0] * self._dim
+        for tok in TOKEN_RE.findall(text.lower()):
+            h = int(hashlib.md5(tok.encode("utf-8")).hexdigest(), 16)
+            buckets[h % self._dim] += 1.0
+        # L2 normalise so cosine similarity stays in [0, 1].
+        norm = math.sqrt(sum(b * b for b in buckets))
+        if norm == 0.0:
+            return buckets
+        return [b / norm for b in buckets]
+
+    def __call__(self, texts: List[str]) -> List[List[float]]:
+        return [self._embed_one(t) for t in texts]
+
+
+# ----------------------------------------------------------------------
+# DefaultEmbedder — ChromaDB's bundled MiniLM
+# ----------------------------------------------------------------------
+
+class DefaultEmbedder:
+    """Wraps Chroma's :class:`DefaultEmbeddingFunction` so it matches
+    our :class:`Embedder` Protocol.  Lazily constructed so we don't
+    pay the ONNX model load when only HashingEmbedder is used."""
+    name = "chromadb-default-minilm-l6-v2"
+    _ef: object | None = None
+
+    @property
+    def dim(self) -> int:
+        # MiniLM-L6-v2 is 384-dim.  Hard-coded because Chroma's EF
+        # doesn't expose this through a stable attribute.
+        return 384
+
+    def _load(self) -> object:
+        if self._ef is None:
+            try:
+                from chromadb.utils.embedding_functions import (
+                    DefaultEmbeddingFunction,
+                )
+            except Exception as exc:
+                raise RuntimeError(
+                    "DefaultEmbedder requires chromadb + onnxruntime. "
+                    "Install them or use HashingEmbedder instead."
+                ) from exc
+            self._ef = DefaultEmbeddingFunction()
+        return self._ef
+
+    def __call__(self, texts: List[str]) -> List[List[float]]:
+        ef = self._load()
+        out = ef(texts)  # type: ignore[operator]
+        # Ensure we return plain lists of floats (some Chroma versions
+        # return numpy arrays).
+        return [[float(x) for x in vec] for vec in out]
+
+
+# ----------------------------------------------------------------------
+# Selection
+# ----------------------------------------------------------------------
+
+def get_default_embedder() -> Embedder:
+    """Return the production embedder if available, else the hashing
+    fallback.  Caller doesn't need to know which one was picked —
+    both honour the same Protocol."""
+    try:
+        emb = DefaultEmbedder()
+        # Trigger lazy load up-front so we fail fast if onnxruntime
+        # is missing — easier to recover than discovering it at the
+        # first ``__call__``.
+        emb._load()
+        return emb
+    except Exception as exc:
+        logger.info(
+            "[rag] falling back to HashingEmbedder (DefaultEmbedder unavailable): %s",
+            exc,
+        )
+        return HashingEmbedder()
+
+
+def cosine_similarity(a: Iterable[float], b: Iterable[float]) -> float:
+    """L2-cosine.  Used by the in-process retriever for the hashing
+    backend (ChromaDB handles this internally on its own vectors)."""
+    al = list(a)
+    bl = list(b)
+    if not al or not bl:
+        return 0.0
+    n = min(len(al), len(bl))
+    dot = sum(al[i] * bl[i] for i in range(n))
+    na = math.sqrt(sum(x * x for x in al[:n]))
+    nb = math.sqrt(sum(x * x for x in bl[:n]))
+    if na == 0.0 or nb == 0.0:
+        return 0.0
+    return dot / (na * nb)
diff --git a/gitpilot/rag/indexer.py b/gitpilot/rag/indexer.py
new file mode 100644
index 0000000..5cac71a
--- /dev/null
+++ b/gitpilot/rag/indexer.py
@@ -0,0 +1,193 @@
+"""Index-builder orchestration for the RAG pipeline (Batch B7).
+
+Take a list of ``(path, content)`` pairs, run them through the
+chunker, push the chunks into the :class:`RagStore`, and persist a
+small ``meta.json`` next to the Chroma directory so subsequent runs
+can do incremental re-indexing instead of re-embedding everything.
+
+Public surface is :func:`build_index_from_files` and
+:class:`IndexBuildReport`.  The function is **synchronous** —
+embedding is CPU-bound and we deliberately stay off async so future
+batching / multi-process parallelism doesn't fight an event loop.
+"""
+from __future__ import annotations
+
+import hashlib
+import json
+import logging
+from dataclasses import dataclass, field
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Iterable, Optional
+
+from .chunker import chunk_file
+from .embedder import Embedder, get_default_embedder
+from .store import RagStore, _persist_dir
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class IndexMeta:
+    """On-disk header for an index — small JSON next to the Chroma dir."""
+    owner: str
+    repo: str
+    branch: str
+    embedder: str
+    embedding_dim: int
+    indexed_files: dict[str, str] = field(default_factory=dict)  # path -> file_sha
+    updated_at: str = field(
+        default_factory=lambda: datetime.now(UTC).isoformat(),
+    )
+
+    @classmethod
+    def load(cls, persist_dir: Path) -> Optional["IndexMeta"]:
+        path = persist_dir / "meta.json"
+        if not path.exists():
+            return None
+        try:
+            raw = json.loads(path.read_text(encoding="utf-8"))
+        except Exception:
+            return None
+        try:
+            return cls(
+                owner=str(raw.get("owner", "") or ""),
+                repo=str(raw.get("repo", "") or ""),
+                branch=str(raw.get("branch", "") or ""),
+                embedder=str(raw.get("embedder", "") or ""),
+                embedding_dim=int(raw.get("embedding_dim", 0) or 0),
+                indexed_files={
+                    str(k): str(v)
+                    for k, v in (raw.get("indexed_files") or {}).items()
+                },
+                updated_at=str(raw.get("updated_at", "") or ""),
+            )
+        except Exception:
+            return None
+
+    def save(self, persist_dir: Path) -> None:
+        persist_dir.mkdir(parents=True, exist_ok=True)
+        (persist_dir / "meta.json").write_text(
+            json.dumps(
+                {
+                    "owner": self.owner,
+                    "repo": self.repo,
+                    "branch": self.branch,
+                    "embedder": self.embedder,
+                    "embedding_dim": self.embedding_dim,
+                    "indexed_files": self.indexed_files,
+                    "updated_at": self.updated_at,
+                },
+                indent=2,
+            ),
+            encoding="utf-8",
+        )
+
+
+@dataclass
+class IndexBuildReport:
+    files_seen: int = 0
+    files_indexed: int = 0      # actually re-embedded this run
+    files_skipped: int = 0      # unchanged since last index
+    chunks_added: int = 0
+    embedder_name: str = ""
+    embedding_dim: int = 0
+
+
+def _file_sha(content: str) -> str:
+    return hashlib.sha1(content.encode("utf-8", errors="replace")).hexdigest()[:16]
+
+
+def build_index_from_files(
+    files: Iterable[tuple[str, str]],
+    *,
+    owner: str,
+    repo: str,
+    branch: str,
+    embedder: Optional[Embedder] = None,
+    persist_dir: Optional[Path] = None,
+    force_full_rebuild: bool = False,
+) -> IndexBuildReport:
+    """Index a batch of files into the per-(owner/repo/branch) store.
+
+    Incremental: a file whose content hasn't changed since the last
+    build (matching ``file_sha`` in ``meta.json``) is skipped entirely
+    — no re-chunking, no re-embedding.
+
+    ``force_full_rebuild=True`` deletes existing chunks and re-indexes
+    everything.  Used by /api/repos/.../index/build with force=True.
+    """
+    emb = embedder or get_default_embedder()
+    pdir = persist_dir or _persist_dir(owner, repo, branch)
+    store = RagStore(
+        owner=owner, repo=repo, branch=branch,
+        embedder=emb, persist_dir=pdir,
+    )
+    meta = IndexMeta.load(pdir) or IndexMeta(
+        owner=owner, repo=repo, branch=branch,
+        embedder=emb.name, embedding_dim=emb.dim,
+    )
+    if meta.embedder != emb.name or meta.embedding_dim != emb.dim:
+        # Embedder changed since last build — vectors are incomparable
+        # so we must rebuild from scratch.
+        logger.info(
+            "[rag] embedder changed (%s/%d → %s/%d) — full rebuild",
+            meta.embedder, meta.embedding_dim, emb.name, emb.dim,
+        )
+        force_full_rebuild = True
+        meta = IndexMeta(
+            owner=owner, repo=repo, branch=branch,
+            embedder=emb.name, embedding_dim=emb.dim,
+        )
+
+    report = IndexBuildReport(
+        embedder_name=emb.name,
+        embedding_dim=emb.dim,
+    )
+
+    new_indexed: dict[str, str] = {} if force_full_rebuild else dict(meta.indexed_files)
+
+    pending_chunks = []
+    for path, content in files:
+        if not path or content is None:
+            continue
+        report.files_seen += 1
+        sha = _file_sha(content)
+        old_sha = meta.indexed_files.get(path)
+        if not force_full_rebuild and old_sha == sha:
+            report.files_skipped += 1
+            continue
+
+        # Drop stale chunks for this file before adding fresh ones.
+        if old_sha is not None:
+            store.delete_by_path(path)
+
+        chunks = chunk_file(path, content)
+        if not chunks:
+            # File was empty / binary / too large.  Drop it from the
+            # index entirely so we don't keep referencing stale chunks.
+            new_indexed.pop(path, None)
+            continue
+        pending_chunks.extend(chunks)
+        new_indexed[path] = sha
+        report.files_indexed += 1
+
+    if force_full_rebuild:
+        # Drop everything for paths NOT in the new set as well — handles
+        # files that disappeared.
+        for stale_path in set(meta.indexed_files) - set(new_indexed):
+            store.delete_by_path(stale_path)
+
+    if pending_chunks:
+        report.chunks_added = store.add_chunks(pending_chunks)
+
+    meta.indexed_files = new_indexed
+    meta.embedder = emb.name
+    meta.embedding_dim = emb.dim
+    meta.updated_at = datetime.now(UTC).isoformat()
+    meta.save(pdir)
+
+    return report
+
+
+__all__ = ["IndexMeta", "IndexBuildReport", "build_index_from_files"]
diff --git a/gitpilot/rag/retriever.py b/gitpilot/rag/retriever.py
new file mode 100644
index 0000000..c3d9bbb
--- /dev/null
+++ b/gitpilot/rag/retriever.py
@@ -0,0 +1,132 @@
+"""Top-k semantic retrieval for the RAG pipeline (Batch B7).
+
+Public function :func:`retrieve_top_k` and dataclass
+:class:`RetrievedChunk`.  Thin wrapper over :class:`RagStore.query`
+that also applies a simple Maximum-Marginal-Relevance (MMR) re-rank
+when ``mmr=True`` so the agent doesn't get N near-duplicates from
+the same file.
+"""
+from __future__ import annotations
+
+import logging
+from dataclasses import dataclass
+from pathlib import Path
+from typing import List, Optional
+
+from .embedder import Embedder, cosine_similarity, get_default_embedder
+from .store import QueryHit, RagStore, _persist_dir
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass(frozen=True)
+class RetrievedChunk:
+    path: str
+    start_line: int
+    end_line: int
+    text: str
+    score: float
+
+
+def _to_retrieved(h: QueryHit) -> RetrievedChunk:
+    return RetrievedChunk(
+        path=h.path,
+        start_line=h.start_line,
+        end_line=h.end_line,
+        text=h.text,
+        score=h.score,
+    )
+
+
+def _mmr_rerank(
+    query_vec: List[float],
+    candidates: List[QueryHit],
+    *,
+    k: int,
+    lambda_: float = 0.7,
+    embedder: Embedder,
+) -> List[QueryHit]:
+    """Maximum Marginal Relevance — pick a diverse top-k that still
+    ranks by similarity to the query.  ``lambda_`` weights relevance
+    vs. novelty: 1.0 = pure relevance, 0.0 = pure diversity."""
+    if not candidates or k <= 0:
+        return []
+    # Pre-embed the candidate texts so we can compute pairwise novelty.
+    texts = [c.text for c in candidates]
+    vecs = embedder(texts)
+
+    selected: List[int] = []
+    remaining = list(range(len(candidates)))
+    while remaining and len(selected) < k:
+        best_idx = remaining[0]
+        best_score = -1e9
+        for idx in remaining:
+            rel = cosine_similarity(query_vec, vecs[idx])
+            if selected:
+                novelty = max(
+                    cosine_similarity(vecs[idx], vecs[s])
+                    for s in selected
+                )
+            else:
+                novelty = 0.0
+            score = lambda_ * rel - (1 - lambda_) * novelty
+            if score > best_score:
+                best_score = score
+                best_idx = idx
+        selected.append(best_idx)
+        remaining.remove(best_idx)
+    return [candidates[i] for i in selected]
+
+
+def retrieve_top_k(
+    query: str,
+    *,
+    owner: str,
+    repo: str,
+    branch: str,
+    k: int = 8,
+    embedder: Optional[Embedder] = None,
+    persist_dir: Optional[Path] = None,
+    mmr: bool = True,
+) -> List[RetrievedChunk]:
+    """Return the k most-relevant chunks across the persisted index.
+
+    Returns an empty list (silently) when:
+
+    * the persist dir doesn't exist yet (no index built),
+    * the embedder can't be initialised,
+    * any internal error in ChromaDB.
+
+    Callers should treat "no results" as "fall back to other tools",
+    not as an error.
+    """
+    if not query or k <= 0:
+        return []
+    emb = embedder or get_default_embedder()
+    pdir = persist_dir or _persist_dir(owner, repo, branch)
+    if not pdir.exists():
+        return []
+
+    try:
+        store = RagStore(
+            owner=owner, repo=repo, branch=branch,
+            embedder=emb, persist_dir=pdir,
+        )
+    except Exception as exc:
+        logger.debug("[rag] retriever: store init failed: %s", exc)
+        return []
+
+    # Over-fetch when MMR is on so re-ranking has something to chew on.
+    over_k = max(k, k * 3) if mmr else k
+    hits = store.query(query, k=over_k)
+    if not hits:
+        return []
+    if mmr and len(hits) > k:
+        qv = emb([query])[0]
+        hits = _mmr_rerank(qv, hits, k=k, embedder=emb)
+    else:
+        hits = hits[:k]
+    return [_to_retrieved(h) for h in hits]
+
+
+__all__ = ["RetrievedChunk", "retrieve_top_k"]
diff --git a/gitpilot/rag/store.py b/gitpilot/rag/store.py
new file mode 100644
index 0000000..e92a133
--- /dev/null
+++ b/gitpilot/rag/store.py
@@ -0,0 +1,201 @@
+"""ChromaDB-backed persistent store for the RAG pipeline (Batch B7).
+
+Wraps a per-(owner, repo, branch) Chroma collection so callers don't
+have to think about embedder wiring, persistence paths, or upsert
+semantics.  Storage layout:
+
+    <RAG_ROOT>/<owner>/<repo>/<branch>/
+        chroma.sqlite3 + hnsw segments  (ChromaDB persistent client)
+
+The store can also fall back to an **in-memory** mode for tests —
+when ``persist_dir`` is ``None`` we use ``chromadb.EphemeralClient``,
+which keeps the same API but doesn't write to disk.
+
+Embedder is injected at construction so tests can use the dependency-
+free :class:`HashingEmbedder` while production uses
+:class:`DefaultEmbedder`.
+"""
+from __future__ import annotations
+
+import logging
+import os
+import re
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterable, List, Optional
+
+from .embedder import Embedder
+
+logger = logging.getLogger(__name__)
+
+# Override via env so tests can isolate (and CI can dump on cleanup).
+RAG_ROOT_ENV = "GITPILOT_RAG_ROOT"
+_DEFAULT_RAG_ROOT = Path.home() / ".gitpilot" / "rag"
+
+
+def rag_root() -> Path:
+    override = os.environ.get(RAG_ROOT_ENV)
+    if override:
+        return Path(override)
+    return _DEFAULT_RAG_ROOT
+
+
+def _persist_dir(owner: str, repo: str, branch: str) -> Path:
+    return rag_root() / owner / repo / _sanitize(branch)
+
+
+def _sanitize(s: str) -> str:
+    """Make a path segment safe for any filesystem."""
+    return re.sub(r"[^A-Za-z0-9._-]+", "_", s)[:80] or "_"
+
+
+def _collection_name(owner: str, repo: str, branch: str) -> str:
+    """ChromaDB collection names must be 3–512 chars, alphanumeric +
+    underscore/hyphen, start/end alphanumeric.  Build a deterministic
+    name that meets the rules."""
+    raw = f"gp_{_sanitize(owner)}_{_sanitize(repo)}_{_sanitize(branch)}"
+    # Ensure start/end are alphanumeric.
+    raw = re.sub(r"^_+", "", raw)
+    raw = re.sub(r"_+$", "", raw)
+    return raw[:500] or "gp_default"
+
+
+@dataclass(frozen=True)
+class QueryHit:
+    chunk_id: str
+    path: str
+    start_line: int
+    end_line: int
+    text: str
+    score: float
+
+
+class RagStore:
+    """Thin wrapper around a ChromaDB persistent collection."""
+
+    def __init__(
+        self,
+        *,
+        owner: str,
+        repo: str,
+        branch: str,
+        embedder: Embedder,
+        persist_dir: Optional[Path] = None,
+    ) -> None:
+        import chromadb  # lazy import — heavy module
+
+        self.owner = owner
+        self.repo = repo
+        self.branch = branch
+        self.embedder = embedder
+        self._collection_name = _collection_name(owner, repo, branch)
+
+        if persist_dir is None:
+            persist_dir = _persist_dir(owner, repo, branch)
+        persist_dir.mkdir(parents=True, exist_ok=True)
+        self.persist_dir = persist_dir
+
+        self._client = chromadb.PersistentClient(path=str(persist_dir))
+        # ChromaDB will compute embeddings itself if we hand it our
+        # embedder via the ``embedding_function`` arg, but its newer
+        # versions are picky about the shape.  We compute vectors
+        # ourselves and pass them via ``embeddings=`` on add/query —
+        # makes the store backend-agnostic.
+        self._collection = self._client.get_or_create_collection(
+            name=self._collection_name,
+            metadata={"hnsw:space": "cosine"},
+        )
+
+    # ------------------------------------------------------------------
+    # Mutation
+    # ------------------------------------------------------------------
+    def add_chunks(
+        self,
+        chunks: Iterable[object],   # avoid hard import cycle with chunker
+    ) -> int:
+        ids: List[str] = []
+        documents: List[str] = []
+        metadatas: List[dict[str, object]] = []
+        for c in chunks:
+            ids.append(c.chunk_id)         # type: ignore[attr-defined]
+            documents.append(c.text)       # type: ignore[attr-defined]
+            metadatas.append({
+                "path":       c.path,        # type: ignore[attr-defined]
+                "start_line": c.start_line,  # type: ignore[attr-defined]
+                "end_line":   c.end_line,    # type: ignore[attr-defined]
+                "file_sha":   c.file_sha,    # type: ignore[attr-defined]
+            })
+        if not ids:
+            return 0
+        vectors = self.embedder(documents)
+        self._collection.upsert(
+            ids=ids,
+            documents=documents,
+            metadatas=metadatas,      # type: ignore[arg-type]
+            embeddings=vectors,       # type: ignore[arg-type]
+        )
+        return len(ids)
+
+    def delete_by_path(self, path: str) -> int:
+        """Drop every chunk for one source file (e.g. file removed
+        from the repo, or about to be re-indexed)."""
+        try:
+            res = self._collection.get(where={"path": path})
+            ids = res.get("ids", []) if isinstance(res, dict) else []
+            if ids:
+                self._collection.delete(ids=ids)
+            return len(ids or [])
+        except Exception as exc:
+            logger.debug("[rag] delete_by_path %s failed: %s", path, exc)
+            return 0
+
+    def count(self) -> int:
+        try:
+            return int(self._collection.count())
+        except Exception:
+            return 0
+
+    # ------------------------------------------------------------------
+    # Retrieval
+    # ------------------------------------------------------------------
+    def query(self, text: str, *, k: int = 8) -> List[QueryHit]:
+        if not text or k <= 0:
+            return []
+        if self.count() == 0:
+            return []
+        vec = self.embedder([text])[0]
+        try:
+            res = self._collection.query(
+                query_embeddings=[vec],   # type: ignore[arg-type]
+                n_results=max(1, int(k)),
+            )
+        except Exception as exc:
+            logger.debug("[rag] query failed: %s", exc)
+            return []
+
+        # Chroma returns parallel lists per query — we always pass one.
+        ids_list = (res.get("ids") or [[]])[0]
+        docs_list = (res.get("documents") or [[]])[0]
+        metas_list = (res.get("metadatas") or [[]])[0]
+        dists_list = (res.get("distances") or [[]])[0]
+
+        out: List[QueryHit] = []
+        for i, cid in enumerate(ids_list):
+            doc = docs_list[i] if i < len(docs_list) else ""
+            meta = metas_list[i] if i < len(metas_list) and metas_list[i] else {}
+            dist = float(dists_list[i]) if i < len(dists_list) else 1.0
+            score = max(0.0, 1.0 - dist)
+            out.append(
+                QueryHit(
+                    chunk_id=str(cid),
+                    path=str(meta.get("path", "")),
+                    start_line=int(meta.get("start_line", 0) or 0),  # type: ignore[arg-type]
+                    end_line=int(meta.get("end_line", 0) or 0),  # type: ignore[arg-type]
+                    text=str(doc or ""),
+                    score=score,
+                )
+            )
+        return out
+
+
+__all__ = ["RagStore", "QueryHit", "rag_root"]
diff --git a/gitpilot/rag_consent.py b/gitpilot/rag_consent.py
new file mode 100644
index 0000000..f12e949
--- /dev/null
+++ b/gitpilot/rag_consent.py
@@ -0,0 +1,156 @@
+"""Per-repo consent for the local RAG index (Batch B9).
+
+Storage: ``~/.gitpilot/rag/<owner>/<repo>/.consent`` — a small JSON
+file with the grant timestamp and identity.  Branch-agnostic on
+purpose: building a semantic index is a repo-level decision, and we
+want the second branch of the same repo to inherit consent.
+
+Three operations the router needs:
+
+* :func:`has_consent` — fast: returns ``True`` iff the consent file
+  exists and is well-formed.
+* :func:`grant_consent` — writes the file (idempotent).  Called when
+  the user approves an INDEX plan step.
+* :func:`revoke_consent` — deletes the file *and* removes the
+  persisted index directory.  Called from Settings → Provider.
+
+All paths are sanitised via the same helper the RAG store uses, so
+"weird" owner/repo names (with `/`, spaces, etc.) won't poke outside
+the consent root.
+"""
+from __future__ import annotations
+
+import json
+import logging
+import os
+import shutil
+from dataclasses import asdict, dataclass
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Optional
+
+from .rag.store import rag_root
+
+logger = logging.getLogger(__name__)
+
+CONSENT_FILE = ".consent"
+
+
+@dataclass(frozen=True)
+class ConsentRecord:
+    """Round-trip JSON shape for the consent file."""
+    granted_at: str          # ISO-8601 UTC
+    granted_by: Optional[str] = None    # username / actor id if available
+
+
+def _consent_dir(owner: str, repo: str) -> Path:
+    """Consent lives at the repo level (no branch segment) so all
+    branches of the same repo share the answer."""
+    from .rag.store import _sanitize  # local import — sanitiser kept
+                                       # private to store.py
+    return rag_root() / _sanitize(owner) / _sanitize(repo)
+
+
+def _consent_path(owner: str, repo: str) -> Path:
+    return _consent_dir(owner, repo) / CONSENT_FILE
+
+
+# ----------------------------------------------------------------------
+# Public API
+# ----------------------------------------------------------------------
+
+def has_consent(owner: str, repo: str) -> bool:
+    """Return ``True`` iff the user has previously approved indexing
+    for ``owner/repo``.  Malformed / unreadable files count as "no
+    consent" — fail closed."""
+    if not owner or not repo:
+        return False
+    path = _consent_path(owner, repo)
+    if not path.exists():
+        return False
+    try:
+        raw = json.loads(path.read_text(encoding="utf-8"))
+    except Exception:
+        return False
+    # Minimum shape check.  Anything else weird → no consent.
+    return isinstance(raw, dict) and isinstance(raw.get("granted_at"), str)
+
+
+def grant_consent(
+    owner: str,
+    repo: str,
+    *,
+    granted_by: Optional[str] = None,
+) -> ConsentRecord:
+    """Record consent for ``owner/repo``.  Idempotent: calling twice
+    updates the timestamp but doesn't re-prompt the user."""
+    if not owner or not repo:
+        raise ValueError("grant_consent: owner and repo are required")
+    record = ConsentRecord(
+        granted_at=datetime.now(UTC).isoformat(),
+        granted_by=granted_by,
+    )
+    cdir = _consent_dir(owner, repo)
+    cdir.mkdir(parents=True, exist_ok=True)
+    _consent_path(owner, repo).write_text(
+        json.dumps(asdict(record), indent=2),
+        encoding="utf-8",
+    )
+    return record
+
+
+def revoke_consent(owner: str, repo: str) -> bool:
+    """Delete the consent file AND the persisted index for the repo.
+
+    Returns ``True`` if anything was actually deleted, ``False`` if
+    there was nothing to revoke (already absent).  Never raises on a
+    missing path — revocation is intent, not assertion.
+    """
+    if not owner or not repo:
+        return False
+    cdir = _consent_dir(owner, repo)
+    removed = False
+    cpath = _consent_path(owner, repo)
+    if cpath.exists():
+        try:
+            cpath.unlink()
+            removed = True
+        except OSError as exc:
+            logger.debug("[rag-consent] could not unlink %s: %s", cpath, exc)
+    # Wipe every per-branch index directory under this repo.  The
+    # consent record was repo-level, the indexes are per-branch, so
+    # we recurse through immediate subdirectories.
+    if cdir.exists():
+        try:
+            for entry in cdir.iterdir():
+                if entry.is_dir():
+                    shutil.rmtree(entry, ignore_errors=True)
+                    removed = True
+        except OSError as exc:
+            logger.debug("[rag-consent] could not iterate %s: %s", cdir, exc)
+    return removed
+
+
+def load_record(owner: str, repo: str) -> Optional[ConsentRecord]:
+    """Return the persisted ConsentRecord, or ``None`` if absent /
+    malformed.  Callers that need to surface "consented since 2025-
+    01-02" can use this without writing their own JSON parsing."""
+    if not has_consent(owner, repo):
+        return None
+    try:
+        raw = json.loads(_consent_path(owner, repo).read_text(encoding="utf-8"))
+        return ConsentRecord(
+            granted_at=str(raw.get("granted_at", "")),
+            granted_by=raw.get("granted_by"),
+        )
+    except Exception:
+        return None
+
+
+__all__ = [
+    "ConsentRecord",
+    "grant_consent",
+    "has_consent",
+    "load_record",
+    "revoke_consent",
+]
diff --git a/gitpilot/repo_map.py b/gitpilot/repo_map.py
new file mode 100644
index 0000000..6f817f5
--- /dev/null
+++ b/gitpilot/repo_map.py
@@ -0,0 +1,375 @@
+"""Hierarchical repository map (Batch B6).
+
+Generates a compact, factual "site map" of a repository — what
+languages, what top-level modules, which files are entry points —
+and persists it so every planner prompt can be primed with the same
+high-level overview without re-discovering it each turn.
+
+Inspired by Aider's repo-map, Cursor's project context, and the
+``AGENTS.md`` convention.  This implementation is fully local: no
+LLM call needed.  We read the file tree, count extensions, identify
+"key" files via well-known names, and emit a markdown blob bounded
+by a hard token budget (default 500).
+
+Storage:
+  ~/.gitpilot/repo_maps/<sha1(owner/repo/branch)>.json
+
+Invalidation:
+  Stored alongside the commit SHA the map was built from.  When the
+  branch's HEAD moves, callers can detect the staleness via
+  ``RepoMap.commit_sha`` and refresh.
+
+Wiring:
+  Phase 6 of the enterprise roadmap.  The next batch will inject
+  ``repo_map.agents_md`` into the planner's backstory through the
+  existing ``context_pack`` slot in ``generate_plan``.
+"""
+from __future__ import annotations
+
+import hashlib
+import json
+import logging
+from collections import Counter
+from dataclasses import asdict, dataclass, field
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Callable, Iterable, List, Optional
+
+from . import flags
+from .context_budget import estimate_tokens
+
+logger = logging.getLogger(__name__)
+
+FLAG_REPO_MAP = "repo_map"
+
+DEFAULT_MAP_TOKEN_BUDGET = 500
+MAP_MAX_KEY_FILES = 10
+MAP_MAX_MODULES = 12
+MAP_MAX_FILES_PER_MODULE = 6
+MAPS_DIR_ENV = "GITPILOT_REPO_MAPS_DIR"
+
+# Files we always lift into "key files" when present.  Ordered by
+# importance — first match wins for ranking.
+_WELL_KNOWN_KEY_FILES: tuple[str, ...] = (
+    "README.md",
+    "README.rst",
+    "README",
+    "AGENTS.md",
+    "CLAUDE.md",
+    "pyproject.toml",
+    "package.json",
+    "Cargo.toml",
+    "go.mod",
+    "pom.xml",
+    "build.gradle",
+    "Dockerfile",
+    "docker-compose.yml",
+    "Makefile",
+    ".github/workflows/ci.yml",
+    "LICENSE",
+    "CHANGELOG.md",
+)
+
+
+def _coerce_int_safe(value: object) -> int:
+    if isinstance(value, bool):
+        return 0
+    if isinstance(value, int):
+        return value
+    if isinstance(value, str):
+        try:
+            return int(value)
+        except ValueError:
+            return 0
+    return 0
+
+
+def _coerce_str_list(value: object) -> List[str]:
+    if not isinstance(value, list):
+        return []
+    return [str(x) for x in value if x is not None]
+
+
+def _coerce_lang_counts(value: object) -> dict[str, int]:
+    if not isinstance(value, dict):
+        return {}
+    return {str(k): _coerce_int_safe(v) for k, v in value.items()}
+
+
+@dataclass
+class ModuleSummary:
+    path: str           # directory path, e.g. "src/util"
+    files: List[str] = field(default_factory=list)
+    file_count: int = 0
+
+
+@dataclass
+class RepoMap:
+    """In-memory + on-disk representation of a repo's site map."""
+    owner: str
+    repo: str
+    branch: str
+    commit_sha: Optional[str] = None
+    generated_at: str = field(
+        default_factory=lambda: datetime.now(UTC).isoformat(),
+    )
+    languages: dict[str, int] = field(default_factory=dict)
+    key_files: List[str] = field(default_factory=list)
+    modules: List[ModuleSummary] = field(default_factory=list)
+    total_files: int = 0
+    agents_md: str = ""
+
+    def to_dict(self) -> dict[str, object]:
+        return asdict(self)
+
+    @classmethod
+    def from_dict(cls, data: dict[str, object]) -> "RepoMap":
+        raw_modules = data.get("modules", []) or []
+        modules: List[ModuleSummary] = []
+        if isinstance(raw_modules, list):
+            for m in raw_modules:
+                if isinstance(m, dict):
+                    modules.append(ModuleSummary(
+                        path=str(m.get("path", "") or ""),
+                        files=_coerce_str_list(m.get("files")),
+                        file_count=_coerce_int_safe(m.get("file_count")),
+                    ))
+        out = cls(
+            owner=str(data.get("owner", "") or ""),
+            repo=str(data.get("repo", "") or ""),
+            branch=str(data.get("branch", "") or ""),
+            commit_sha=(
+                str(data["commit_sha"]) if data.get("commit_sha") is not None else None
+            ),
+            generated_at=str(data.get("generated_at", "") or ""),
+            languages=_coerce_lang_counts(data.get("languages")),
+            key_files=_coerce_str_list(data.get("key_files")),
+            modules=modules,
+            total_files=_coerce_int_safe(data.get("total_files")),
+            agents_md=str(data.get("agents_md", "") or ""),
+        )
+        return out
+
+
+# ----------------------------------------------------------------------
+# Storage helpers
+# ----------------------------------------------------------------------
+
+def _maps_root() -> Path:
+    import os
+
+    override = os.environ.get(MAPS_DIR_ENV)
+    if override:
+        return Path(override)
+    return Path.home() / ".gitpilot" / "repo_maps"
+
+
+def _cache_key(owner: str, repo: str, branch: str) -> str:
+    raw = f"{owner}/{repo}@{branch}".encode("utf-8")
+    return hashlib.sha1(raw).hexdigest()[:24]
+
+
+def _cache_path(owner: str, repo: str, branch: str) -> Path:
+    return _maps_root() / f"{_cache_key(owner, repo, branch)}.json"
+
+
+def load_cached(owner: str, repo: str, branch: str) -> Optional[RepoMap]:
+    path = _cache_path(owner, repo, branch)
+    if not path.exists():
+        return None
+    try:
+        data = json.loads(path.read_text(encoding="utf-8"))
+        return RepoMap.from_dict(data)
+    except Exception as exc:
+        logger.debug("[repo-map] could not load %s: %s", path, exc)
+        return None
+
+
+def save_cached(repo_map: RepoMap) -> None:
+    root = _maps_root()
+    root.mkdir(parents=True, exist_ok=True)
+    path = _cache_path(repo_map.owner, repo_map.repo, repo_map.branch)
+    try:
+        path.write_text(json.dumps(repo_map.to_dict(), indent=2), encoding="utf-8")
+    except Exception as exc:  # pragma: no cover - defensive
+        logger.debug("[repo-map] could not save %s: %s", path, exc)
+
+
+# ----------------------------------------------------------------------
+# Map builder
+# ----------------------------------------------------------------------
+
+def _extension_of(path: str) -> str:
+    name = path.rsplit("/", 1)[-1]
+    if "." not in name:
+        return "(no-ext)"
+    return name.rsplit(".", 1)[-1].lower()
+
+
+def _top_dir_of(path: str) -> str:
+    """Return the first directory segment of a path; '' for root files."""
+    if "/" not in path:
+        return ""
+    return path.split("/", 1)[0]
+
+
+def _group_into_modules(paths: List[str]) -> List[ModuleSummary]:
+    """Group files by top-level directory.  Root-level files form a
+    "(root)" pseudo-module so they're still visible."""
+    buckets: dict[str, List[str]] = {}
+    for p in sorted(paths):
+        top = _top_dir_of(p) or "(root)"
+        buckets.setdefault(top, []).append(p)
+
+    modules: List[ModuleSummary] = []
+    for name, files in buckets.items():
+        modules.append(
+            ModuleSummary(
+                path=name,
+                files=files[:MAP_MAX_FILES_PER_MODULE],
+                file_count=len(files),
+            )
+        )
+    # Sort by file count descending so the most-populated modules
+    # come first — these are the ones the planner most needs to see.
+    modules.sort(key=lambda m: (-m.file_count, m.path))
+    return modules[:MAP_MAX_MODULES]
+
+
+def _select_key_files(paths: List[str]) -> List[str]:
+    by_name = {p.rsplit("/", 1)[-1]: p for p in paths}
+    selected: List[str] = []
+    for well_known in _WELL_KNOWN_KEY_FILES:
+        # Match either the bare name at any depth or the exact path.
+        if well_known in paths:
+            selected.append(well_known)
+            continue
+        if "/" in well_known:
+            if well_known in paths:
+                selected.append(well_known)
+            continue
+        if well_known in by_name:
+            selected.append(by_name[well_known])
+        if len(selected) >= MAP_MAX_KEY_FILES:
+            break
+    return selected
+
+
+def _render_agents_md(repo_map: RepoMap, *, token_budget: int) -> str:
+    """Render the markdown blob that gets pinned into planner prompts.
+
+    Bounded by ``token_budget``.  If the first-pass output overshoots
+    we trim modules from the tail (least-populated) and try again.
+    """
+    def _build(modules: List[ModuleSummary]) -> str:
+        lines: list[str] = []
+        lines.append(f"# Repository map — `{repo_map.owner}/{repo_map.repo}` @ `{repo_map.branch}`")
+        lines.append("")
+        lines.append(f"**Total files:** {repo_map.total_files}")
+        if repo_map.languages:
+            top = sorted(repo_map.languages.items(), key=lambda kv: -kv[1])[:8]
+            lines.append(
+                "**Languages:** " + ", ".join(f"`{ext}`={n}" for ext, n in top)
+            )
+        if repo_map.key_files:
+            lines.append("")
+            lines.append("## Key files")
+            for kf in repo_map.key_files:
+                lines.append(f"- `{kf}`")
+        if modules:
+            lines.append("")
+            lines.append("## Modules")
+            for mod in modules:
+                lines.append(f"- **`{mod.path}/`** — {mod.file_count} file(s)")
+                for f in mod.files:
+                    lines.append(f"    - `{f}`")
+        lines.append("")
+        lines.append(
+            "_Use the `Find files matching a pattern`, `Search file contents` "
+            "and `Read file content` tools to drill into anything above._"
+        )
+        return "\n".join(lines)
+
+    modules = list(repo_map.modules)
+    out = _build(modules)
+    while estimate_tokens(out) > token_budget and len(modules) > 3:
+        # Drop the least-populated module and try again.
+        modules = modules[:-1]
+        out = _build(modules)
+    return out
+
+
+def build_repo_map(
+    *,
+    owner: str,
+    repo: str,
+    branch: str,
+    paths: Iterable[str],
+    commit_sha: Optional[str] = None,
+    token_budget: int = DEFAULT_MAP_TOKEN_BUDGET,
+) -> RepoMap:
+    """Deterministically construct a :class:`RepoMap` from a list of
+    repository file paths.  Pure function — no I/O, no network, no
+    LLM call.  Caller is responsible for fetching the paths (today
+    that's ``get_repo_tree`` for GitHub mode or ``Path.rglob`` for
+    local mode).
+    """
+    files = sorted({p.strip() for p in paths if p and isinstance(p, str)})
+    languages = dict(Counter(_extension_of(p) for p in files))
+    key_files = _select_key_files(files)
+    modules = _group_into_modules(files)
+
+    repo_map = RepoMap(
+        owner=owner,
+        repo=repo,
+        branch=branch,
+        commit_sha=commit_sha,
+        languages=languages,
+        key_files=key_files,
+        modules=modules,
+        total_files=len(files),
+    )
+    repo_map.agents_md = _render_agents_md(repo_map, token_budget=token_budget)
+    return repo_map
+
+
+def get_or_build_repo_map(
+    *,
+    owner: str,
+    repo: str,
+    branch: str,
+    paths_provider: Callable[[], Iterable[str]],
+    commit_sha: Optional[str] = None,
+    token_budget: int = DEFAULT_MAP_TOKEN_BUDGET,
+    force: bool = False,
+) -> RepoMap:
+    """Return a cached map if it's still valid for the current commit,
+    otherwise build a fresh one and persist.  ``paths_provider`` is
+    a zero-arg callable that returns ``Iterable[str]`` of repo paths —
+    keeps this function independent of how paths are fetched.
+    """
+    if not force:
+        cached = load_cached(owner, repo, branch)
+        if cached and cached.commit_sha == commit_sha and commit_sha is not None:
+            return cached
+
+    paths = list(paths_provider())
+    fresh = build_repo_map(
+        owner=owner, repo=repo, branch=branch,
+        paths=paths, commit_sha=commit_sha,
+        token_budget=token_budget,
+    )
+    save_cached(fresh)
+    return fresh
+
+
+__all__ = [
+    "FLAG_REPO_MAP",
+    "DEFAULT_MAP_TOKEN_BUDGET",
+    "ModuleSummary",
+    "RepoMap",
+    "build_repo_map",
+    "get_or_build_repo_map",
+    "load_cached",
+    "save_cached",
+]
diff --git a/gitpilot/sandbox_api.py b/gitpilot/sandbox_api.py
new file mode 100644
index 0000000..54d09fb
--- /dev/null
+++ b/gitpilot/sandbox_api.py
@@ -0,0 +1,666 @@
+"""HTTP surface for the sandbox runtime switch.
+
+Three endpoints, all additive:
+
+* ``GET  /api/sandbox/status``   — what's configured, can we reach it?
+* ``PUT  /api/sandbox/config``   — update the persisted SandboxSettings.
+* ``POST /api/sandbox/run``      — execute one ``{language, code}`` snippet
+                                   through the currently-selected backend.
+
+The chat UI uses :func:`run_snippet` to power the per-codeblock Run button
+introduced in the AssistantMessage component, so a user can ask "write a
+hello-world in Python", click Run, and see the output inline without
+leaving GitPilot.  Which sandbox actually executes the snippet (local
+subprocess vs MatrixLab Runner) is controlled by Settings → Sandbox
+Runtime.
+
+Routes are mounted from :mod:`gitpilot.api` via ``app.include_router``.
+"""
+from __future__ import annotations
+
+import asyncio
+import logging
+import os
+import shlex
+import shutil
+import tempfile
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+import httpx
+from fastapi import APIRouter, HTTPException
+from pydantic import BaseModel, Field
+
+from .sandbox import (
+    BACKEND_MATRIXLAB,
+    BACKEND_OFF,
+    BACKEND_SUBPROCESS,
+    DEFAULT_TIMEOUT_SEC,
+    SandboxPolicy,
+    SandboxResult,
+    SandboxUnavailableError,
+    SandboxRunError,
+    MatrixLabSandbox,
+    NullSandbox,
+    SubprocessSandbox,
+)
+from .settings import AppSettings, SandboxSettings, get_settings, update_settings
+
+logger = logging.getLogger(__name__)
+router = APIRouter(prefix="/api/sandbox", tags=["sandbox"])
+
+# How each fenced-code language is launched inside the sandbox.  Anything
+# the user can opt-in to from the Run button has to be listed here; the
+# whitelist keeps random shells (``ruby``, ``perl``, ...) from being
+# silently executed just because an LLM tagged a fence with that name.
+LANGUAGE_RUNNERS: Dict[str, Dict[str, Any]] = {
+    "python": {"suffix": ".py", "argv": ["python3", "{file}"]},
+    "py": {"suffix": ".py", "argv": ["python3", "{file}"]},
+    "javascript": {"suffix": ".js", "argv": ["node", "{file}"]},
+    "js": {"suffix": ".js", "argv": ["node", "{file}"]},
+    "node": {"suffix": ".js", "argv": ["node", "{file}"]},
+    "bash": {"suffix": ".sh", "argv": ["bash", "{file}"]},
+    "sh": {"suffix": ".sh", "argv": ["bash", "{file}"]},
+    "shell": {"suffix": ".sh", "argv": ["bash", "{file}"]},
+}
+
+ALLOWED_BACKENDS = {BACKEND_OFF, BACKEND_SUBPROCESS, BACKEND_MATRIXLAB}
+
+
+# ----------------------------------------------------------------------
+# Request / response models
+# ----------------------------------------------------------------------
+
+class SandboxStatusResponse(BaseModel):
+    backend: str
+    available_backends: list[str]
+    matrixlab_url: str
+    matrixlab_image: str
+    allow_network: bool
+    timeout_sec: int
+    has_token: bool
+    ok: bool
+    error: Optional[str] = None
+    remote: Optional[Dict[str, Any]] = None
+    # Name of the env var currently shadowing the persisted backend
+    # choice, if any.  Used by the Settings panel to render an "env
+    # override" badge so users understand why their UI selection isn't
+    # taking effect.  ``None`` when persistence is authoritative.
+    env_override: Optional[str] = None
+
+
+class SandboxConfigUpdate(BaseModel):
+    backend: Optional[str] = None
+    matrixlab_url: Optional[str] = None
+    matrixlab_token: Optional[str] = None
+    matrixlab_image: Optional[str] = None
+    allow_network: Optional[bool] = None
+    timeout_sec: Optional[int] = Field(default=None, ge=1, le=600)
+
+
+class SandboxRunRequest(BaseModel):
+    language: str
+    code: str
+    timeout_sec: Optional[int] = Field(default=None, ge=1, le=600)
+
+
+class SandboxRunResponse(BaseModel):
+    backend: str
+    language: str
+    command: str
+    exit_code: int
+    stdout: str
+    stderr: str
+    duration_ms: int
+    truncated: bool = False
+    timed_out: bool = False
+    sandbox_id: Optional[str] = None
+
+
+# ----------------------------------------------------------------------
+# Helpers
+# ----------------------------------------------------------------------
+
+def _build_sandbox(cfg: SandboxSettings, *, workspace: Path, timeout: int):
+    """Construct the right sandbox instance from persisted settings.
+
+    Distinct from :func:`gitpilot.sandbox.get_sandbox` because that
+    factory reads ``settings={"tools": {"sandbox": ...}}`` for backwards
+    compatibility with the older shape; here we already have the typed
+    :class:`SandboxSettings` so the indirection isn't needed.
+    """
+    policy = SandboxPolicy(
+        workspace=workspace,
+        timeout_sec=timeout,
+        allow_network=cfg.allow_network,
+        image=cfg.matrixlab_image or None,
+    )
+    backend = (cfg.backend or BACKEND_SUBPROCESS).strip().lower()
+    if backend == BACKEND_OFF:
+        return NullSandbox(policy)
+    if backend == BACKEND_MATRIXLAB:
+        return MatrixLabSandbox(
+            policy,
+            base_url=cfg.matrixlab_url or None,
+            token=cfg.matrixlab_token or None,
+        )
+    return SubprocessSandbox(policy)
+
+
+def _detect_env_override() -> Optional[str]:
+    """Return the env var name currently shadowing the persisted backend
+    choice, or None if persistence wins.  Mirrors the precedence rules
+    in :func:`gitpilot.sandbox._resolve_backend_name` so what we surface
+    in the UI matches what actually executes."""
+    import os as _os
+
+    for name in (
+        "GITPILOT_SANDBOX",
+        "GITPILOT_MATRIXLAB_URL",
+        "GITPILOT_MATRIXLAB_TOKEN",
+        "GITPILOT_MATRIXLAB_IMAGE",
+    ):
+        if _os.environ.get(name):
+            return name
+    return None
+
+
+def _status_from(cfg: SandboxSettings, health: Dict[str, Any]) -> SandboxStatusResponse:
+    return SandboxStatusResponse(
+        backend=cfg.backend,
+        available_backends=sorted(ALLOWED_BACKENDS),
+        matrixlab_url=cfg.matrixlab_url,
+        matrixlab_image=cfg.matrixlab_image,
+        allow_network=cfg.allow_network,
+        timeout_sec=cfg.timeout_sec,
+        has_token=bool(cfg.matrixlab_token),
+        ok=bool(health.get("ok")),
+        error=health.get("error"),
+        remote=health.get("remote"),
+        env_override=_detect_env_override(),
+    )
+
+
+# ----------------------------------------------------------------------
+# Endpoints
+# ----------------------------------------------------------------------
+
+@router.get("/status", response_model=SandboxStatusResponse)
+async def api_sandbox_status() -> SandboxStatusResponse:
+    """Report which backend is selected and whether it's reachable."""
+    s: AppSettings = get_settings()
+    cfg = s.sandbox
+    workspace = Path.cwd()
+    sb = _build_sandbox(cfg, workspace=workspace, timeout=cfg.timeout_sec)
+    try:
+        health = await sb.health()
+    finally:
+        # MatrixLabSandbox owns an httpx client; close it so we don't
+        # leak sockets on every status poll from the settings page.
+        aclose = getattr(sb, "aclose", None)
+        if aclose is not None:
+            await aclose()
+    return _status_from(cfg, health)
+
+
+@router.put("/config", response_model=SandboxStatusResponse)
+async def api_sandbox_config(update: SandboxConfigUpdate) -> SandboxStatusResponse:
+    """Persist new sandbox settings and return the resulting status."""
+    if update.backend is not None and update.backend not in ALLOWED_BACKENDS:
+        raise HTTPException(
+            status_code=400,
+            detail=f"unknown sandbox backend {update.backend!r}; "
+                   f"expected one of {sorted(ALLOWED_BACKENDS)}",
+        )
+
+    s: AppSettings = get_settings()
+    merged: Dict[str, Any] = s.sandbox.model_dump()
+    for field, value in update.model_dump(exclude_none=True).items():
+        merged[field] = value
+
+    updated = update_settings({"sandbox": merged})
+    cfg = updated.sandbox
+
+    # Probe the new configuration so the UI can flip its health pill in
+    # one round-trip (mirrors what /status does).
+    sb = _build_sandbox(cfg, workspace=Path.cwd(), timeout=cfg.timeout_sec)
+    try:
+        health = await sb.health()
+    finally:
+        aclose = getattr(sb, "aclose", None)
+        if aclose is not None:
+            await aclose()
+    return _status_from(cfg, health)
+
+
+@router.post("/run", response_model=SandboxRunResponse)
+async def api_sandbox_run(req: SandboxRunRequest) -> SandboxRunResponse:
+    """Execute a fenced-code snippet through the configured sandbox.
+
+    Powers the per-codeblock Run button in AssistantMessage: the chat
+    UI POSTs ``{language, code}`` and renders ``stdout`` / ``stderr`` /
+    ``exit_code`` next to the snippet.  The selected backend (local
+    subprocess vs MatrixLab) is whatever the user picked in Settings.
+    """
+    lang = req.language.strip().lower()
+    spec = LANGUAGE_RUNNERS.get(lang)
+    if spec is None:
+        raise HTTPException(
+            status_code=400,
+            detail=f"language {req.language!r} is not runnable; "
+                   f"allowed: {sorted(set(LANGUAGE_RUNNERS))}",
+        )
+    if not req.code.strip():
+        raise HTTPException(status_code=400, detail="code is empty")
+
+    s: AppSettings = get_settings()
+    cfg = s.sandbox
+    timeout = req.timeout_sec or cfg.timeout_sec or DEFAULT_TIMEOUT_SEC
+
+    # MatrixLab has a purpose-built snippet endpoint (POST /code/run) that
+    # accepts {language, code} directly and dispatches into the right
+    # per-language sandbox image.  When the user selected the matrixlab
+    # backend, route there instead of running the snippet locally and
+    # asking MatrixLab to re-execute the resulting argv via /repo/run —
+    # /code/run is what the Runner is designed to serve for this flow.
+    if (cfg.backend or "").strip().lower() == BACKEND_MATRIXLAB:
+        return await _run_via_matrixlab_code_endpoint(cfg, lang, req.code, timeout)
+
+    # Materialise the snippet in a fresh tempdir so the workspace jail
+    # in SubprocessSandbox has somewhere to point at.  MatrixLabSandbox
+    # mounts the same path into the container via ``mount_workspace``,
+    # so the runner sees the same file at the same path — keeping the
+    # contract identical across backends.
+    with tempfile.TemporaryDirectory(prefix="gitpilot-run-") as tmp:
+        workspace = Path(tmp)
+        snippet_path = workspace / f"snippet{spec['suffix']}"
+        snippet_path.write_text(req.code, encoding="utf-8")
+        argv = [
+            tok.replace("{file}", str(snippet_path)) for tok in spec["argv"]
+        ]
+        command_str = shlex.join(argv)
+
+        sb = _build_sandbox(cfg, workspace=workspace, timeout=timeout)
+        try:
+            try:
+                result: SandboxResult = await sb.run(
+                    argv, cwd=workspace, timeout=timeout
+                )
+            except SandboxUnavailableError as exc:
+                raise HTTPException(
+                    status_code=503,
+                    detail=f"sandbox backend {cfg.backend!r} is unreachable: {exc}",
+                ) from exc
+            except SandboxRunError as exc:
+                raise HTTPException(
+                    status_code=502,
+                    detail=f"sandbox backend {cfg.backend!r} returned an error: {exc}",
+                ) from exc
+            except PermissionError as exc:
+                raise HTTPException(status_code=400, detail=str(exc)) from exc
+        finally:
+            aclose = getattr(sb, "aclose", None)
+            if aclose is not None:
+                await aclose()
+
+    return SandboxRunResponse(
+        backend=result.backend,
+        language=lang,
+        command=command_str,
+        exit_code=result.exit_code,
+        stdout=result.stdout,
+        stderr=result.stderr,
+        duration_ms=result.duration_ms,
+        truncated=result.truncated,
+        timed_out=result.timed_out,
+        sandbox_id=result.sandbox_id,
+    )
+
+
+# MatrixLab's CodeRunRequest only accepts these literals; aliases ("py",
+# "js", "node", "sh", "shell") get normalised before the call.
+_MATRIXLAB_LANGUAGE = {
+    "python": "python",
+    "py": "python",
+    "javascript": "javascript",
+    "js": "javascript",
+    "node": "javascript",
+    "bash": "bash",
+    "sh": "bash",
+    "shell": "bash",
+}
+
+
+async def _run_via_matrixlab_code_endpoint(
+    cfg: SandboxSettings, lang: str, code: str, timeout: int
+) -> SandboxRunResponse:
+    """POST /code/run on the MatrixLab Runner.
+
+    Direct call (not via :class:`MatrixLabSandbox`) because the snippet
+    endpoint takes ``{language, code}`` rather than the
+    command-with-mounted-workspace shape that ``/repo/run`` expects.
+    """
+    target_lang = _MATRIXLAB_LANGUAGE.get(lang)
+    if target_lang is None:
+        raise HTTPException(
+            status_code=400,
+            detail=f"language {lang!r} is not supported by MatrixLab /code/run",
+        )
+    base_url = (cfg.matrixlab_url or "http://localhost:8000").rstrip("/")
+    headers = {"Content-Type": "application/json"}
+    if cfg.matrixlab_token:
+        headers["Authorization"] = f"Bearer {cfg.matrixlab_token}"
+    body = {
+        "language": target_lang,
+        "code": code,
+        "timeout": timeout,
+        "allow_network": cfg.allow_network,
+    }
+    if cfg.matrixlab_image:
+        body["image"] = cfg.matrixlab_image
+
+    start = time.monotonic()
+    try:
+        async with httpx.AsyncClient(timeout=timeout + 5) as client:
+            resp = await client.post(f"{base_url}/code/run", json=body, headers=headers)
+    except httpx.HTTPError as exc:
+        raise HTTPException(
+            status_code=503,
+            detail=f"sandbox backend 'matrixlab' is unreachable: {exc}",
+        ) from exc
+    duration_ms = int((time.monotonic() - start) * 1000)
+
+    if resp.status_code >= 400:
+        raise HTTPException(
+            status_code=502,
+            detail=f"MatrixLab /code/run returned {resp.status_code}: {resp.text[:400]}",
+        )
+    data = resp.json()
+    return SandboxRunResponse(
+        backend=BACKEND_MATRIXLAB,
+        language=lang,
+        command=f"{target_lang} <snippet>",
+        exit_code=int(data.get("exit_code", -1)),
+        stdout=str(data.get("stdout", "")),
+        stderr=str(data.get("stderr", "")),
+        duration_ms=int(data.get("duration_ms", duration_ms)),
+        truncated=bool(data.get("truncated", False)),
+        timed_out=bool(data.get("timed_out", False)),
+        sandbox_id=data.get("sandbox_id"),
+    )
+
+
+# ----------------------------------------------------------------------
+# MatrixLab lifecycle (install / start) — opt-in via env flag
+# ----------------------------------------------------------------------
+#
+# Lifecycle endpoints shell out to the host (``docker pull``,
+# ``docker run``), so they are gated behind ``GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE=1``.
+# When the gate is off, GET /lifecycle still works — it just reports
+# the inventory and surfaces a clear "operator must enable" message
+# on the action booleans, and POST /install / /start return 403.  This
+# keeps the default GitPilot deployment honest: no shell from a web
+# endpoint unless the operator opted in.
+
+ENV_LIFECYCLE = "GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE"
+# Image the Runner ships under (matches matrixlab/Makefile's
+# $(REGISTRY)/$(DOCKERHUB_NAMESPACE)/matrixlab-runner).  Operator can
+# override via env when running a custom build.
+DEFAULT_RUNNER_IMAGE = os.environ.get(
+    "GITPILOT_MATRIXLAB_RUNNER_IMAGE",
+    "ruslanmv/matrixlab-runner:latest",
+)
+# Sandbox images the Runner spawns per language.  Pulling these at
+# install time means the first /code/run from the chat UI doesn't
+# stall on a multi-hundred-MB image fetch.
+DEFAULT_SANDBOX_IMAGES = [
+    "matrix-lab-sandbox-python:latest",
+    "matrix-lab-sandbox-node:latest",
+    "matrix-lab-sandbox-utils:latest",
+]
+DEFAULT_CONTAINER_NAME = os.environ.get(
+    "GITPILOT_MATRIXLAB_CONTAINER",
+    "gitpilot-matrixlab",
+)
+
+
+class _StepResult(BaseModel):
+    cmd: str
+    exit_code: int
+    stdout: str = ""
+    stderr: str = ""
+    duration_ms: int = 0
+
+
+class MatrixLabLifecycleResponse(BaseModel):
+    docker_available: bool
+    installed: bool
+    running: bool
+    lifecycle_enabled: bool
+    runner_image: str
+    sandbox_images: List[str]
+    container_name: str
+    matrixlab_url: str
+    instructions: Optional[str] = None
+    error: Optional[str] = None
+    steps: List[_StepResult] = Field(default_factory=list)
+
+
+def _lifecycle_enabled() -> bool:
+    return os.environ.get(ENV_LIFECYCLE, "").strip().lower() in {"1", "true", "yes", "on"}
+
+
+async def _run_shell(cmd: List[str], *, timeout: int = 600) -> _StepResult:
+    """Run a host command, capture stdout/stderr, never raise.
+
+    Used for the docker / matrixlab lifecycle commands so the response
+    body always carries the full transcript even when a step fails —
+    matches the "errors are first-class signals" UX of the agent loop.
+    """
+    start = time.monotonic()
+    try:
+        proc = await asyncio.create_subprocess_exec(
+            *cmd,
+            stdout=asyncio.subprocess.PIPE,
+            stderr=asyncio.subprocess.PIPE,
+        )
+        try:
+            stdout_b, stderr_b = await asyncio.wait_for(proc.communicate(), timeout=timeout)
+        except asyncio.TimeoutError:
+            proc.kill()
+            return _StepResult(
+                cmd=shlex.join(cmd),
+                exit_code=-1,
+                stderr=f"timed out after {timeout}s",
+                duration_ms=int((time.monotonic() - start) * 1000),
+            )
+        return _StepResult(
+            cmd=shlex.join(cmd),
+            exit_code=proc.returncode or 0,
+            stdout=stdout_b.decode("utf-8", errors="replace")[:8_000],
+            stderr=stderr_b.decode("utf-8", errors="replace")[:8_000],
+            duration_ms=int((time.monotonic() - start) * 1000),
+        )
+    except FileNotFoundError as exc:
+        return _StepResult(
+            cmd=shlex.join(cmd),
+            exit_code=-2,
+            stderr=str(exc),
+            duration_ms=int((time.monotonic() - start) * 1000),
+        )
+
+
+def _docker_available() -> bool:
+    return shutil.which("docker") is not None
+
+
+async def _docker_image_present(name: str) -> bool:
+    """True when ``docker images -q <name>`` returns at least one ID."""
+    if not _docker_available():
+        return False
+    step = await _run_shell(["docker", "images", "-q", name], timeout=10)
+    return step.exit_code == 0 and bool(step.stdout.strip())
+
+
+async def _matrixlab_running() -> bool:
+    """Probe the configured Runner URL for a healthy /health response."""
+    cfg = get_settings().sandbox
+    base = (cfg.matrixlab_url or "http://localhost:8000").rstrip("/")
+    try:
+        async with httpx.AsyncClient(timeout=3.0) as client:
+            resp = await client.get(f"{base}/health")
+            return resp.status_code == 200
+    except httpx.HTTPError:
+        return False
+
+
+async def _gather_lifecycle_status(steps: Optional[List[_StepResult]] = None) -> MatrixLabLifecycleResponse:
+    cfg = get_settings().sandbox
+    docker_ok = _docker_available()
+    runner_installed = await _docker_image_present(DEFAULT_RUNNER_IMAGE)
+    running = await _matrixlab_running()
+    enabled = _lifecycle_enabled()
+    instructions: Optional[str] = None
+    if not docker_ok:
+        instructions = (
+            "Docker is not installed or not on PATH on the GitPilot host. "
+            "Install Docker (https://docs.docker.com/get-docker/) before "
+            "the Install / Start buttons can do anything."
+        )
+    elif not enabled:
+        instructions = (
+            "Lifecycle automation is off. To let GitPilot pull and start "
+            f"MatrixLab from the Settings panel set the {ENV_LIFECYCLE}=1 "
+            "environment variable on the GitPilot backend and restart. "
+            "Until then, run 'docker compose up -d' from a MatrixLab "
+            "checkout (https://github.com/agent-matrix/matrixlab) yourself."
+        )
+    return MatrixLabLifecycleResponse(
+        docker_available=docker_ok,
+        installed=runner_installed,
+        running=running,
+        lifecycle_enabled=enabled,
+        runner_image=DEFAULT_RUNNER_IMAGE,
+        sandbox_images=DEFAULT_SANDBOX_IMAGES,
+        container_name=DEFAULT_CONTAINER_NAME,
+        matrixlab_url=cfg.matrixlab_url,
+        instructions=instructions,
+        steps=steps or [],
+    )
+
+
+@router.get("/matrixlab/lifecycle", response_model=MatrixLabLifecycleResponse)
+async def api_matrixlab_lifecycle() -> MatrixLabLifecycleResponse:
+    """Report whether MatrixLab is installed locally and running.
+
+    Used by the Settings panel to decide which button to show:
+    Install (when no runner image present) → Start (image present but
+    URL unreachable) → Running (URL healthy).  Always safe to call —
+    pure inspection, no side effects.
+    """
+    return await _gather_lifecycle_status()
+
+
+@router.post("/matrixlab/install", response_model=MatrixLabLifecycleResponse)
+async def api_matrixlab_install() -> MatrixLabLifecycleResponse:
+    """Pull the MatrixLab runner + sandbox images.
+
+    Gated by GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE.  Each pull is a
+    distinct step in the response so the UI can show a per-image
+    progress strip (and the operator can re-read failures verbatim).
+    """
+    if not _lifecycle_enabled():
+        raise HTTPException(
+            status_code=403,
+            detail=(
+                f"set {ENV_LIFECYCLE}=1 on the GitPilot backend to enable "
+                "the Install button"
+            ),
+        )
+    if not _docker_available():
+        raise HTTPException(status_code=503, detail="docker is not on PATH")
+    steps: List[_StepResult] = []
+    images = [DEFAULT_RUNNER_IMAGE, *DEFAULT_SANDBOX_IMAGES]
+    for image in images:
+        steps.append(await _run_shell(["docker", "pull", image], timeout=900))
+    return await _gather_lifecycle_status(steps=steps)
+
+
+@router.post("/matrixlab/start", response_model=MatrixLabLifecycleResponse)
+async def api_matrixlab_start() -> MatrixLabLifecycleResponse:
+    """Start the MatrixLab runner as a detached container.
+
+    Gated by GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE.  The container name
+    is deterministic (``gitpilot-matrixlab`` by default) so repeated
+    clicks reuse it — ``docker start`` an existing stopped container,
+    or ``docker run`` if it doesn't exist yet.  The Docker socket is
+    bind-mounted so the runner can spawn per-language sandbox
+    containers — matches what 'make run' inside a MatrixLab checkout
+    does.
+    """
+    if not _lifecycle_enabled():
+        raise HTTPException(
+            status_code=403,
+            detail=(
+                f"set {ENV_LIFECYCLE}=1 on the GitPilot backend to enable "
+                "the Start button"
+            ),
+        )
+    if not _docker_available():
+        raise HTTPException(status_code=503, detail="docker is not on PATH")
+
+    steps: List[_StepResult] = []
+    # Determine which port to expose locally — derive it from the
+    # configured matrixlab_url so 'Start' agrees with /status.
+    cfg = get_settings().sandbox
+    port = 8000
+    try:
+        from urllib.parse import urlparse
+
+        parsed = urlparse(cfg.matrixlab_url)
+        if parsed.port:
+            port = int(parsed.port)
+    except Exception:  # noqa: BLE001
+        port = 8000
+
+    # Does a container with the canonical name already exist?
+    inspect = await _run_shell(
+        ["docker", "inspect", "--format", "{{.State.Status}}", DEFAULT_CONTAINER_NAME],
+        timeout=15,
+    )
+    steps.append(inspect)
+    if inspect.exit_code == 0:
+        # Container exists — start it if stopped, otherwise leave it be.
+        steps.append(await _run_shell(["docker", "start", DEFAULT_CONTAINER_NAME], timeout=60))
+    else:
+        run_cmd = [
+            "docker", "run", "-d",
+            "--name", DEFAULT_CONTAINER_NAME,
+            "-p", f"{port}:8000",
+            "-v", "/var/run/docker.sock:/var/run/docker.sock",
+            "--restart", "unless-stopped",
+            DEFAULT_RUNNER_IMAGE,
+        ]
+        steps.append(await _run_shell(run_cmd, timeout=120))
+
+    return await _gather_lifecycle_status(steps=steps)
+
+
+@router.post("/matrixlab/stop", response_model=MatrixLabLifecycleResponse)
+async def api_matrixlab_stop() -> MatrixLabLifecycleResponse:
+    """Stop the GitPilot-managed MatrixLab container.
+
+    Only affects the deterministic ``gitpilot-matrixlab`` container —
+    won't touch containers an operator launched manually.  Gated by
+    the same env flag as install/start.
+    """
+    if not _lifecycle_enabled():
+        raise HTTPException(
+            status_code=403,
+            detail=f"set {ENV_LIFECYCLE}=1 to enable lifecycle actions",
+        )
+    if not _docker_available():
+        raise HTTPException(status_code=503, detail="docker is not on PATH")
+    step = await _run_shell(["docker", "stop", DEFAULT_CONTAINER_NAME], timeout=30)
+    return await _gather_lifecycle_status(steps=[step])
diff --git a/gitpilot/session.py b/gitpilot/session.py
index 6b5f907..6915967 100644
--- a/gitpilot/session.py
+++ b/gitpilot/session.py
@@ -42,6 +42,35 @@ class Checkpoint:
     snapshot_path: str | None = None
 
 
+@dataclass
+class Task:
+    """One AI invocation recorded for the right-sidebar Tasks panel.
+
+    Append-only.  Created with ``status="running"`` at the start of a
+    user-facing operation (Plan or Execute), mutated in place once on
+    completion, and never edited again.  The shape intentionally
+    mirrors what Claude Code surfaces in its tasks list: title + kind
+    + status + duration + token usage.
+
+    Cost / cache / payload size are deferred to a later cut — v1 ships
+    only what GitPilot can compute honestly across every supported
+    provider.
+    """
+    id: str = field(default_factory=lambda: uuid.uuid4().hex)
+    kind: str = "plan"           # plan | execute | (future: explore, code_write…)
+    title: str = ""
+    status: str = "running"      # running | completed | failed
+    started_at: str = field(
+        default_factory=lambda: datetime.now(UTC).isoformat(),
+    )
+    completed_at: str | None = None
+    duration_ms: int | None = None
+    prompt_tokens: int | None = None
+    completion_tokens: int | None = None
+    error: str | None = None
+    metadata: dict[str, Any] = field(default_factory=dict)
+
+
 @dataclass
 class Session:
     id: str = field(default_factory=lambda: uuid.uuid4().hex[:16])
@@ -70,6 +99,11 @@ class Session:
     repos: list[dict[str, Any]] = field(default_factory=list)
     active_repo: str | None = None  # full_name of the write-target repo
 
+    # Right-sidebar Tasks panel (Claude-Code-style trace of every AI
+    # invocation in this session).  Append-only.  Backwards-compatible
+    # default for sessions that pre-date this field.
+    tasks: list[Task] = field(default_factory=list)
+
     def add_message(self, role: str, content: str, **meta):
         self.messages.append(Message(role=role, content=content, metadata=meta))
         self.updated_at = datetime.now(UTC).isoformat()
@@ -82,6 +116,9 @@ def from_dict(cls, data: dict[str, Any]) -> Session:
         data = dict(data)  # shallow copy
         data["messages"] = [Message(**m) for m in data.get("messages", [])]
         data["checkpoints"] = [Checkpoint(**c) for c in data.get("checkpoints", [])]
+        # Backwards-compatible: sessions saved before the tasks field
+        # existed simply load with an empty list.
+        data["tasks"] = [Task(**t) for t in data.get("tasks", [])]
 
         # Backwards-compatible migration: populate repos from legacy single-repo
         if not data.get("repos") and data.get("repo_full_name"):
diff --git a/gitpilot/settings.py b/gitpilot/settings.py
index 66cccc5..96252b2 100644
--- a/gitpilot/settings.py
+++ b/gitpilot/settings.py
@@ -64,6 +64,31 @@ class OllaBridgeConfig(BaseModel):
     api_key: str = Field(default="")  # Optional: for authenticated endpoints
 
 
+class SandboxSettings(BaseModel):
+    """Where code/commands generated by GitPilot run.
+
+    ``subprocess`` is the safe local default — host subprocess with a cwd jail,
+    secret-scrubbing, and the destructive-pattern denylist from
+    :mod:`gitpilot.sandbox`.  Switch to ``matrixlab`` to delegate execution to
+    a MatrixLab Runner (containerised, ephemeral, resource-limited) for
+    enterprise-grade isolation.  ``off`` is the pass-through backend
+    (:class:`gitpilot.sandbox.NullSandbox`) — same as subprocess but without
+    the cwd jail; intended for local dev only.
+
+    Persisted fields mirror the env vars the sandbox module already honours
+    (``GITPILOT_SANDBOX``, ``GITPILOT_MATRIXLAB_URL``, ...).  Env vars still
+    take precedence at sandbox-resolution time so deployments can override
+    user settings without touching disk.
+    """
+
+    backend: str = Field(default="subprocess")
+    matrixlab_url: str = Field(default="http://localhost:8000")
+    matrixlab_token: str = Field(default="")
+    matrixlab_image: str = Field(default="")
+    allow_network: bool = Field(default=False)
+    timeout_sec: int = Field(default=120)
+
+
 class AppSettings(BaseModel):
     provider: LLMProvider = Field(default=LLMProvider.ollabridge)
 
@@ -73,6 +98,13 @@ class AppSettings(BaseModel):
     ollama: OllamaConfig = Field(default_factory=OllamaConfig)
     ollabridge: OllaBridgeConfig = Field(default_factory=OllaBridgeConfig)
 
+    # Sandbox runtime for "Run code" actions in the chat UI.  Defaults to a
+    # local subprocess so trying simple snippets works out of the box; switch
+    # to MatrixLab from the Settings modal when an enterprise-grade isolated
+    # runner is needed.  See :class:`SandboxSettings` for the field shape and
+    # :mod:`gitpilot.sandbox` for the resolution precedence.
+    sandbox: SandboxSettings = Field(default_factory=SandboxSettings)
+
     # Lite Mode: optimized for small LLMs (< 7B parameters).
     # Uses simplified prompts, single-agent execution, and pre-fetched context
     # instead of multi-agent pipelines with tool-calling.
@@ -149,6 +181,18 @@ def from_disk(cls) -> AppSettings:
         if os.getenv("GITPILOT_LANGFLOW_PLAN_FLOW_ID"):
             settings.langflow_plan_flow_id = os.getenv("GITPILOT_LANGFLOW_PLAN_FLOW_ID")
 
+        # Sandbox runtime — env always wins (same precedence the runtime
+        # resolution in :mod:`gitpilot.sandbox` already enforces), so an
+        # operator can pin the backend without editing settings.json.
+        if os.getenv("GITPILOT_SANDBOX"):
+            settings.sandbox.backend = os.environ["GITPILOT_SANDBOX"]
+        if os.getenv("GITPILOT_MATRIXLAB_URL"):
+            settings.sandbox.matrixlab_url = os.environ["GITPILOT_MATRIXLAB_URL"]
+        if os.getenv("GITPILOT_MATRIXLAB_TOKEN"):
+            settings.sandbox.matrixlab_token = os.environ["GITPILOT_MATRIXLAB_TOKEN"]
+        if os.getenv("GITPILOT_MATRIXLAB_IMAGE"):
+            settings.sandbox.matrixlab_image = os.environ["GITPILOT_MATRIXLAB_IMAGE"]
+
         # Lite mode may be intentionally controlled by env in CI or deployments.
         env_lite = os.getenv("GITPILOT_LITE_MODE", "").strip().lower()
         if env_lite in ("1", "true", "yes", "on"):
@@ -463,6 +507,10 @@ def update_settings(updates: dict[str, Any]) -> AppSettings:
         merged = _merge_model_config(_settings.ollabridge, updates["ollabridge"])
         _settings.ollabridge = OllaBridgeConfig(**merged)
 
+    if "sandbox" in updates:
+        merged = _merge_model_config(_settings.sandbox, updates["sandbox"])
+        _settings.sandbox = SandboxSettings(**merged)
+
     if "lite_mode" in updates:
         _settings.lite_mode = bool(updates["lite_mode"])
 
diff --git a/gitpilot/task_recorder.py b/gitpilot/task_recorder.py
new file mode 100644
index 0000000..01a6e85
--- /dev/null
+++ b/gitpilot/task_recorder.py
@@ -0,0 +1,118 @@
+"""Right-sidebar Tasks panel — recorder helpers.
+
+Implements the smallest possible contract that lets the chat UI trace
+every user-facing AI invocation (Plan, Execute) the way Claude Code's
+right-pane tasks list does.
+
+Design notes:
+
+* **Append-only.**  Once a task lands in ``Session.tasks`` it is
+  mutated exactly once (on completion) and never edited again.  This
+  matches the audit-trail philosophy of the rest of the session
+  format.
+* **Endpoint-level wrap, not deep in agentic.py.**  ``begin_task`` is
+  called at the start of an endpoint, ``finish_task`` in its finally —
+  the agent stack itself is untouched, so the cut is trivially
+  revertible.
+* **Best-effort persistence.**  A failure to write the task back to
+  disk must never block the user-facing endpoint — the agent already
+  ran, the user already has their result.  We log and move on.
+* **Flag-gated.**  When ``tasks_sidebar`` is off, ``begin_task``
+  returns ``None`` and ``finish_task`` is a no-op so the backend is
+  byte-identical to today.
+"""
+from __future__ import annotations
+
+import logging
+from datetime import UTC, datetime
+from time import perf_counter
+from typing import Optional
+
+from . import flags
+from .session import SessionManager, Task
+
+logger = logging.getLogger(__name__)
+
+FLAG_TASKS_SIDEBAR = "tasks_sidebar"
+
+
+def begin_task(
+    session_mgr: SessionManager,
+    session_id: Optional[str],
+    *,
+    kind: str,
+    title: str,
+) -> Optional[Task]:
+    """Append a ``running`` Task to the session and persist.
+
+    Returns the in-flight Task so the caller can pass it back to
+    :func:`finish_task` later.  Returns ``None`` when recording is
+    disabled or the session can't be loaded — callers must tolerate
+    the absent task gracefully.
+    """
+    if not flags.is_on(FLAG_TASKS_SIDEBAR, default=True):
+        return None
+    if not session_id:
+        return None
+    try:
+        session = session_mgr.load(session_id)
+    except Exception as exc:
+        logger.debug("[tasks] session %s not loadable: %s", session_id, exc)
+        return None
+
+    task = Task(kind=kind, title=title, status="running")
+    # Attach a perf-counter start tick on the in-memory object so
+    # finish_task can compute duration_ms without scanning timestamps.
+    task.metadata["_perf_t0"] = perf_counter()
+    session.tasks.append(task)
+    try:
+        session_mgr.save(session)
+    except Exception as exc:  # pragma: no cover - defensive
+        logger.debug("[tasks] could not persist initial task: %s", exc)
+    return task
+
+
+def finish_task(
+    session_mgr: SessionManager,
+    session_id: Optional[str],
+    task: Optional[Task],
+    *,
+    status: str = "completed",
+    error: Optional[str] = None,
+    prompt_tokens: Optional[int] = None,
+    completion_tokens: Optional[int] = None,
+) -> None:
+    """Mark a previously-begun task as completed/failed and persist.
+
+    Idempotent: safe to call with ``task=None`` (when ``begin_task``
+    returned None because the flag was off or the session was missing).
+    """
+    if task is None or not session_id:
+        return
+    t0 = task.metadata.pop("_perf_t0", None)
+    if isinstance(t0, (int, float)):
+        task.duration_ms = int((perf_counter() - t0) * 1000)
+    task.status = status
+    if error is not None:
+        task.error = error[:500]
+    if prompt_tokens is not None:
+        task.prompt_tokens = prompt_tokens
+    if completion_tokens is not None:
+        task.completion_tokens = completion_tokens
+    task.completed_at = datetime.now(UTC).isoformat()
+
+    # Reload the session before saving so we don't clobber any writes
+    # the agent stack made in the meantime (e.g. branch persistence on
+    # execute) — the running-task entry was already there, we only need
+    # to swap its final state in.
+    try:
+        fresh = session_mgr.load(session_id)
+        for i, existing in enumerate(fresh.tasks):
+            if existing.id == task.id:
+                fresh.tasks[i] = task
+                break
+        else:
+            fresh.tasks.append(task)
+        session_mgr.save(fresh)
+    except Exception as exc:  # pragma: no cover - defensive
+        logger.debug("[tasks] could not persist completed task: %s", exc)
diff --git a/mypy.ini b/mypy.ini
index 9904baa..0bcbcf3 100644
--- a/mypy.ini
+++ b/mypy.ini
@@ -46,7 +46,22 @@ files =
     gitpilot/init_wizard.py,
     gitpilot/_deprecation.py,
     gitpilot/plan_guards.py,
-    gitpilot/context_meter.py
+    gitpilot/context_meter.py,
+    gitpilot/task_recorder.py,
+    gitpilot/grep_backend.py,
+    gitpilot/auto_compact.py,
+    gitpilot/explorer_summary.py,
+    gitpilot/repo_map.py,
+    gitpilot/rag/__init__.py,
+    gitpilot/rag/chunker.py,
+    gitpilot/rag/embedder.py,
+    gitpilot/rag/indexer.py,
+    gitpilot/rag/retriever.py,
+    gitpilot/rag/store.py,
+    gitpilot/edit_backend.py,
+    gitpilot/rag_consent.py,
+    gitpilot/query_router.py,
+    gitpilot/agent_prompts.py
 
 # The minimal in-tree skill front-matter parser uses dynamic typing for
 # its returned dict; keep it permissive without weakening the gate.
diff --git a/tests/test_agent_prompts.py b/tests/test_agent_prompts.py
new file mode 100644
index 0000000..ff91b7d
--- /dev/null
+++ b/tests/test_agent_prompts.py
@@ -0,0 +1,243 @@
+"""Tests for the lean-prompt module (Batch B12).
+
+Pin five properties so a future "let me add one more rule" edit can't
+silently regress the small-model context budget:
+
+1. Every prompt is within its declared character budget.
+2. None of the forbidden small-model keywords appear anywhere in the
+   rendered prompts.
+3. The "Known facts" block is in the bottom 250 chars of every
+   render_plan_task() output (last-segment-attention principle).
+4. The intent-routed rule block is the correct one for each intent
+   (create / modify / fix / delete / find / info / unknown / None).
+5. The flag can be toggled without breaking imports.
+"""
+from __future__ import annotations
+
+import pytest
+
+from gitpilot import flags
+from gitpilot.agent_prompts import (
+    CODE_WRITER_BACKSTORY,
+    CODE_WRITER_BACKSTORY_BUDGET,
+    CREATE_FILE_TASK_CHAR_BUDGET,
+    EXPLORER_BACKSTORY,
+    EXPLORER_BACKSTORY_BUDGET,
+    EXPLORER_TASK_CHAR_BUDGET,
+    FLAG_LEAN_PROMPTS,
+    FORBIDDEN_KEYWORDS,
+    MODIFY_FILE_TASK_CHAR_BUDGET,
+    PLAN_TASK_CHAR_BUDGET,
+    PLANNER_BACKSTORY,
+    PLANNER_BACKSTORY_BUDGET,
+    SPECIALIST_BACKSTORIES,
+    SPECIALIST_BACKSTORY_BUDGET,
+    lean_prompts_enabled,
+    render_create_file_task,
+    render_explorer_task,
+    render_modify_file_task,
+    render_plan_task,
+)
+
+
+# ----------------------------------------------------------------------
+# Per-prompt budget enforcement
+# ----------------------------------------------------------------------
+
+SAMPLE_FILE_LIST = ["README.md", "src/main.py", "src/util.py", "tests/test_main.py"]
+
+
+@pytest.mark.parametrize(
+    "intent",
+    ["create", "modify", "fix", "delete", "find", "info", "unknown", None],
+)
+def test_plan_task_within_budget_for_every_intent(intent: str | None) -> None:
+    rendered = render_plan_task(
+        goal="do something specific that takes a few words to describe",
+        repo_full_name="owner/repo-with-a-longish-name",
+        active_ref="some/branch-name-that-is-longer",
+        file_list=SAMPLE_FILE_LIST,
+        intent=intent,
+    )
+    assert len(rendered) <= PLAN_TASK_CHAR_BUDGET, (
+        f"intent={intent}: {len(rendered)} > {PLAN_TASK_CHAR_BUDGET}"
+    )
+
+
+def test_explorer_task_within_budget() -> None:
+    rendered = render_explorer_task(
+        repo_full_name="owner/repo", active_ref="main",
+    )
+    assert len(rendered) <= EXPLORER_TASK_CHAR_BUDGET
+
+
+def test_create_file_task_within_budget() -> None:
+    rendered = render_create_file_task(
+        file_path="src/very/deep/path/file.py",
+        goal="generate a thing",
+        step_description="step that does the thing",
+    )
+    assert len(rendered) <= CREATE_FILE_TASK_CHAR_BUDGET
+
+
+def test_modify_file_task_within_budget() -> None:
+    rendered = render_modify_file_task(
+        file_path="src/util.py",
+        goal="fix the bug",
+        step_description="patch the validator",
+        current_content="def x():\n    return 1\n",
+    )
+    # The content varies; we cap only the framing.  Subtract the
+    # length of the current content to test just the rules + format.
+    framing = len(rendered) - len("def x():\n    return 1\n")
+    assert framing <= MODIFY_FILE_TASK_CHAR_BUDGET
+
+
+def test_backstories_within_budget() -> None:
+    assert len(EXPLORER_BACKSTORY) <= EXPLORER_BACKSTORY_BUDGET
+    assert len(PLANNER_BACKSTORY) <= PLANNER_BACKSTORY_BUDGET
+    assert len(CODE_WRITER_BACKSTORY) <= CODE_WRITER_BACKSTORY_BUDGET
+    for name, body in SPECIALIST_BACKSTORIES.items():
+        assert len(body) <= SPECIALIST_BACKSTORY_BUDGET, name
+
+
+# ----------------------------------------------------------------------
+# Forbidden-keyword scrub
+# ----------------------------------------------------------------------
+
+def _all_rendered_prompts() -> str:
+    """Every prompt the lean module produces, concatenated.  Used as
+    the corpus for forbidden-keyword greps."""
+    parts: list[str] = [
+        EXPLORER_BACKSTORY, PLANNER_BACKSTORY, CODE_WRITER_BACKSTORY,
+        *SPECIALIST_BACKSTORIES.values(),
+        render_explorer_task(repo_full_name="o/r", active_ref="m"),
+        render_create_file_task(file_path="a.py", goal="g", step_description="d"),
+        render_modify_file_task(
+            file_path="a.py", goal="g", step_description="d", current_content="",
+        ),
+    ]
+    for intent in (
+        "create", "modify", "fix", "delete", "find", "info", "unknown", None,
+    ):
+        parts.append(
+            render_plan_task(
+                goal="g", repo_full_name="o/r", active_ref="m",
+                file_list=["x.py"], intent=intent,
+            )
+        )
+    return "\n".join(parts)
+
+
+@pytest.mark.parametrize("keyword", FORBIDDEN_KEYWORDS)
+def test_no_forbidden_keyword_in_any_rendered_prompt(keyword: str) -> None:
+    corpus = _all_rendered_prompts()
+    assert keyword not in corpus, (
+        f"Forbidden keyword {keyword!r} still appears in a rendered prompt"
+    )
+
+
+# ----------------------------------------------------------------------
+# Facts block is at the bottom
+# ----------------------------------------------------------------------
+
+def test_facts_block_lives_near_end_of_plan_task() -> None:
+    """Small models over-weight the final segment of the prompt.  The
+    "Known facts" block must live in the last 250 chars so the file-
+    list ground truth gets that attention."""
+    rendered = render_plan_task(
+        goal="do thing",
+        repo_full_name="o/r",
+        active_ref="main",
+        file_list=["README.md"],
+        intent="create",
+    )
+    tail = rendered[-300:]
+    assert "Known facts:" in tail
+    assert "does NOT exist" in tail
+
+
+# ----------------------------------------------------------------------
+# Intent → rule block routing
+# ----------------------------------------------------------------------
+
+_INTENT_RULE_MARKERS = {
+    "create":  "at least one CREATE",
+    "modify":  "Use MODIFY only",
+    "fix":     "Use MODIFY only",            # fix aliases to modify
+    "delete":  "Use DELETE only",
+    "find":    "Plan READ actions",
+    "info":    "Empty steps is fine",
+    "unknown": "Match the action to what",
+}
+
+
+@pytest.mark.parametrize("intent, marker", list(_INTENT_RULE_MARKERS.items()))
+def test_each_intent_pulls_its_own_rules(intent: str, marker: str) -> None:
+    rendered = render_plan_task(
+        goal="x", repo_full_name="o/r", active_ref="m",
+        file_list=["a.py"], intent=intent,
+    )
+    assert marker in rendered, f"intent={intent} missing marker {marker!r}"
+
+
+def test_no_intent_falls_back_to_unknown_block() -> None:
+    """When intent is None (router skipped / disabled) we don't want a
+    crash — pick the generic rule block."""
+    rendered = render_plan_task(
+        goal="x", repo_full_name="o/r", active_ref="m",
+        file_list=["a.py"], intent=None,
+    )
+    assert _INTENT_RULE_MARKERS["unknown"] in rendered
+
+
+def test_create_intent_does_not_carry_delete_rules() -> None:
+    """The whole point of intent routing — small models stop seeing
+    the deletion rule block when the goal isn't a deletion."""
+    rendered = render_plan_task(
+        goal="g", repo_full_name="o/r", active_ref="m",
+        file_list=["a.py"], intent="create",
+    )
+    assert "Use DELETE only" not in rendered
+    assert "Use MODIFY only" not in rendered
+
+
+# ----------------------------------------------------------------------
+# Flag plumbing
+# ----------------------------------------------------------------------
+
+def test_flag_default_on() -> None:
+    assert lean_prompts_enabled() is True
+
+
+def test_flag_can_be_turned_off() -> None:
+    flags.set_override(FLAG_LEAN_PROMPTS, False)
+    try:
+        assert lean_prompts_enabled() is False
+    finally:
+        flags.clear_override(FLAG_LEAN_PROMPTS)
+
+
+# ----------------------------------------------------------------------
+# Total prompt-stack budget for the canonical failure scenario
+# ----------------------------------------------------------------------
+
+def test_total_planner_stack_under_3k_chars_on_tiny_repo() -> None:
+    """The original llama3:8b failure trace happened with a planner
+    stack of ~4.5k chars (12-15 KB including tool schemas).  After
+    B12 the stack — backstory + task description — fits in 3 KB on
+    a single-file repo, leaving room for the tool-schema preamble
+    inside an 8 k context window."""
+    stack = (
+        PLANNER_BACKSTORY
+        + render_plan_task(
+            goal="create a simple python code about what says the README.md",
+            repo_full_name="INFN-GE/Nuclear-Physics",
+            active_ref="master",
+            file_list=["README.md"],
+            intent="create",
+        )
+    )
+    assert len(stack) < 3000, (
+        f"planner stack is {len(stack)} chars — small-model budget regression"
+    )
diff --git a/tests/test_agent_tools_contract.py b/tests/test_agent_tools_contract.py
new file mode 100644
index 0000000..590f0b8
--- /dev/null
+++ b/tests/test_agent_tools_contract.py
@@ -0,0 +1,50 @@
+"""Regression tests for the default CrewAI repository tool contract."""
+from __future__ import annotations
+
+import ast
+from pathlib import Path
+
+AGENT_TOOLS = Path(__file__).resolve().parents[1] / "gitpilot" / "agent_tools.py"
+
+
+def _module() -> ast.Module:
+    return ast.parse(AGENT_TOOLS.read_text())
+
+
+def _function(name: str) -> ast.FunctionDef:
+    for node in _module().body:
+        if isinstance(node, ast.FunctionDef) and node.name == name:
+            return node
+    raise AssertionError(f"function {name!r} not found")
+
+
+def test_primary_read_tool_keeps_single_argument_schema() -> None:
+    """Keep the common read tool simple for smaller ReAct models."""
+    read_file = _function("read_file")
+
+    assert [arg.arg for arg in read_file.args.args] == ["file_path"]
+    assert read_file.args.defaults == []
+
+
+def test_default_repository_tools_use_stable_explorer_surface() -> None:
+    """The explorer's default tools should match the pre-B1 safe set."""
+    module = _module()
+    assignments = [
+        node
+        for node in module.body
+        if isinstance(node, ast.Assign)
+        and any(
+            isinstance(target, ast.Name) and target.id == "REPOSITORY_TOOLS"
+            for target in node.targets
+        )
+    ]
+    assert assignments, "REPOSITORY_TOOLS assignment not found"
+
+    value = assignments[-1].value
+    assert isinstance(value, ast.List)
+    assert [elt.id for elt in value.elts if isinstance(elt, ast.Name)] == [
+        "list_repository_files",
+        "get_directory_structure",
+        "read_file",
+        "get_repository_summary",
+    ]
diff --git a/tests/test_auto_compact.py b/tests/test_auto_compact.py
new file mode 100644
index 0000000..1c3e6ec
--- /dev/null
+++ b/tests/test_auto_compact.py
@@ -0,0 +1,172 @@
+"""Tests for the auto-compaction hook (Batch B3).
+
+Pin three things:
+
+* below threshold → no-op (we don't fold prematurely)
+* above threshold → fold older non-essential turns into a single
+  summary system message, keep the last N recent turns
+* idempotency — running compaction twice on already-compacted history
+  doesn't fold the summary into another summary
+"""
+from __future__ import annotations
+
+import pytest
+
+from gitpilot import api as api_module
+from gitpilot import flags
+from gitpilot.auto_compact import (
+    COMPACTED_FLAG,
+    DEFAULT_KEEP_RECENT_TURNS,
+    FLAG_AUTO_COMPACT,
+    SUMMARY_LABEL,
+    maybe_compact_session,
+)
+from gitpilot.context_budget import estimate_tokens
+from gitpilot.session import Message
+
+
+def _make_session_with_history(messages: list[tuple[str, str]]):
+    """Build and save a session with the supplied (role, content) pairs."""
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="compact-test"
+    )
+    for role, content in messages:
+        session.messages.append(Message(role=role, content=content))
+    api_module._session_mgr.save(session)
+    return session
+
+
+def test_no_op_below_threshold() -> None:
+    """A handful of short messages must not trigger compaction even on
+    an 8 k window — that would be ridiculous."""
+    session = _make_session_with_history([
+        ("user", "do thing"),
+        ("assistant", "ok"),
+        ("user", "thanks"),
+    ])
+    report = maybe_compact_session(
+        api_module._session_mgr, session.id, context_window=8_192
+    )
+    assert report.compacted is False
+    reloaded = api_module._session_mgr.load(session.id)
+    assert len(reloaded.messages) == 3   # untouched
+
+
+def test_folds_above_threshold_and_preserves_recent_turns() -> None:
+    """A long history must fold older non-essential turns into a
+    single summary, keeping the last N recent messages verbatim."""
+    chunk = "lorem ipsum " * 200   # ~400 tokens per message under heuristic
+    messages = [("user" if i % 2 == 0 else "assistant", chunk) for i in range(20)]
+    session = _make_session_with_history(messages)
+
+    report = maybe_compact_session(
+        api_module._session_mgr, session.id, context_window=8_192
+    )
+    assert report.compacted is True
+    assert report.before_tokens > report.after_tokens
+    assert report.messages_folded >= 1
+
+    reloaded = api_module._session_mgr.load(session.id)
+    # Recent turns preserved exactly.
+    assert len(reloaded.messages) == 1 + DEFAULT_KEEP_RECENT_TURNS
+    summary = reloaded.messages[0]
+    assert summary.role == "system"
+    assert SUMMARY_LABEL in summary.content
+    assert summary.metadata.get(COMPACTED_FLAG) == "1"
+    # Last N recent turns are still verbatim.
+    for m in reloaded.messages[1:]:
+        assert m.content == chunk
+
+
+def test_idempotent_no_recompact_of_summary() -> None:
+    """Running compaction twice must not fold the summary itself."""
+    chunk = "lorem ipsum " * 200
+    messages = [("user" if i % 2 == 0 else "assistant", chunk) for i in range(20)]
+    session = _make_session_with_history(messages)
+
+    first = maybe_compact_session(
+        api_module._session_mgr, session.id, context_window=8_192
+    )
+    assert first.compacted is True
+
+    second = maybe_compact_session(
+        api_module._session_mgr, session.id, context_window=8_192
+    )
+    # Either: stayed below threshold post-fold (compacted=False), OR
+    # found nothing new to fold (compacted=False with that reason).
+    # Either way: no further folding of an already-summary entry.
+    assert second.compacted is False
+    reloaded = api_module._session_mgr.load(session.id)
+    assert reloaded.messages[0].metadata.get(COMPACTED_FLAG) == "1"
+
+
+def test_flag_off_is_a_noop() -> None:
+    session = _make_session_with_history([
+        ("user", "x" * 8000) for _ in range(20)
+    ])
+    flags.set_override(FLAG_AUTO_COMPACT, False)
+    try:
+        report = maybe_compact_session(
+            api_module._session_mgr, session.id, context_window=8_192
+        )
+    finally:
+        flags.clear_override(FLAG_AUTO_COMPACT)
+    assert report.compacted is False
+    assert report.reason == "flag off"
+
+
+def test_missing_session_returns_clean_report() -> None:
+    report = maybe_compact_session(
+        api_module._session_mgr, "does-not-exist", context_window=8_192
+    )
+    assert report.compacted is False
+
+
+def test_no_session_id_is_noop() -> None:
+    report = maybe_compact_session(
+        api_module._session_mgr, None, context_window=8_192
+    )
+    assert report.compacted is False
+
+
+def test_reserved_response_lowers_effective_budget() -> None:
+    """A bigger reserved_response should make compaction fire SOONER —
+    less effective budget means the threshold is hit earlier."""
+    chunk = "lorem ipsum " * 100   # ~200 tok each
+    msgs = [("user" if i % 2 == 0 else "assistant", chunk) for i in range(20)]
+
+    s1 = _make_session_with_history(msgs)
+    s2 = _make_session_with_history(msgs)
+
+    # Low reservation: budget is bigger → less likely to fire.
+    r_low = maybe_compact_session(
+        api_module._session_mgr, s1.id, context_window=8_192,
+        reserved_response=0,
+    )
+    # High reservation: budget is tight → much more likely to fire.
+    r_high = maybe_compact_session(
+        api_module._session_mgr, s2.id, context_window=8_192,
+        reserved_response=6_000,
+    )
+    # The high-reservation run must compact at least as aggressively
+    # as the low-reservation one.
+    if r_high.compacted:
+        assert True  # at-least-as-aggressive
+    else:
+        # If neither fires, the low one must also not have fired.
+        assert r_low.compacted is False
+
+
+def test_summary_actually_shrinks_token_total() -> None:
+    """The whole point of compaction is to free budget."""
+    chunk = "lorem ipsum " * 400  # ~800 tok each
+    session = _make_session_with_history([
+        ("user" if i % 2 == 0 else "assistant", chunk) for i in range(20)
+    ])
+    before = sum(estimate_tokens(m.content) for m in session.messages)
+    report = maybe_compact_session(
+        api_module._session_mgr, session.id, context_window=8_192
+    )
+    assert report.compacted is True
+    assert report.after_tokens < report.before_tokens
+    assert report.before_tokens == before
diff --git a/tests/test_chat_plan_friendly_errors.py b/tests/test_chat_plan_friendly_errors.py
index ca1a59d..63425b5 100644
--- a/tests/test_chat_plan_friendly_errors.py
+++ b/tests/test_chat_plan_friendly_errors.py
@@ -40,10 +40,10 @@ def _mount_failing_planners(
     """Replace both planner entry points so we can drive the error path
     deterministically — no LLM calls, no GitHub network."""
 
-    async def _bad_main(goal, repo_full_name, token=None, branch_name=None):
+    async def _bad_main(goal, repo_full_name, token=None, branch_name=None, **_kw):
         raise RuntimeError(main_error)
 
-    async def _bad_lite(goal, repo_full_name, token=None, branch_name=None):
+    async def _bad_lite(goal, repo_full_name, token=None, branch_name=None, **_kw):
         if lite_error is None:
             return {"goal": goal, "summary": "lite ok", "steps": []}
         raise RuntimeError(lite_error)
@@ -140,7 +140,7 @@ def test_unknown_runtime_error_is_wrapped_as_500_with_detail(
 def test_planner_success_passes_through(
     client: TestClient, monkeypatch: pytest.MonkeyPatch,
 ) -> None:
-    async def _ok(goal, repo_full_name, token=None, branch_name=None):
+    async def _ok(goal, repo_full_name, token=None, branch_name=None, **_kw):
         return {"goal": goal, "summary": "real plan", "steps": []}
 
     monkeypatch.setattr(api_module, "generate_plan", _ok)
diff --git a/tests/test_edit_backend.py b/tests/test_edit_backend.py
new file mode 100644
index 0000000..ed86948
--- /dev/null
+++ b/tests/test_edit_backend.py
@@ -0,0 +1,327 @@
+"""Tests for the surgical edit backend (Batch B8).
+
+Pin every safety property the executor relies on:
+
+* ``apply_edit`` refuses ambiguous matches by default (the contract
+  that makes Claude Code's ``Edit`` tool reliable across models).
+* Zero matches → clear error with a stripped-form hint when the only
+  difference is indentation.
+* Identical old/new → refused so the planner can't accidentally
+  commit a no-op masquerading as a fix.
+* Unified diffs apply by *context match*, so stale line numbers
+  don't matter.
+* Multi-hunk diffs track a running offset so the second hunk lands
+  in the right place even after the first changed the file's length.
+* Multi-file diffs (more than one ``diff --git`` header) are
+  refused — single file at a time.
+* Trailing newline state is preserved in both directions.
+* Pathologically big inputs (2 000-line file, one-line edit) apply
+  in well under a second.
+"""
+from __future__ import annotations
+
+import time
+
+import pytest
+
+from gitpilot.edit_backend import (
+    EditError,
+    EditReport,
+    apply_edit,
+    apply_unified_diff,
+)
+
+
+# ----------------------------------------------------------------------
+# apply_edit — happy paths
+# ----------------------------------------------------------------------
+
+def test_apply_edit_single_match() -> None:
+    src = "alpha\nbeta\ngamma\n"
+    new, rpt = apply_edit(src, old_string="beta", new_string="BETA")
+    assert new == "alpha\nBETA\ngamma\n"
+    assert rpt.occurrences_replaced == 1
+    assert rpt.bytes_before == len(src)
+    assert rpt.bytes_after == len(new)
+
+
+def test_apply_edit_multiline_with_indentation() -> None:
+    src = (
+        "def foo():\n"
+        "    x = 1\n"
+        "    y = 2\n"
+        "    return x + y\n"
+    )
+    new, rpt = apply_edit(
+        src,
+        old_string="    x = 1\n    y = 2",
+        new_string="    x = 10\n    y = 20",
+    )
+    assert "x = 10" in new and "y = 20" in new
+    assert rpt.occurrences_replaced == 1
+
+
+def test_apply_edit_deletes_with_empty_new_string() -> None:
+    src = "keep\nremove\nkeep\n"
+    new, rpt = apply_edit(src, old_string="remove\n", new_string="")
+    assert new == "keep\nkeep\n"
+    assert rpt.occurrences_replaced == 1
+
+
+def test_apply_edit_expected_occurrences_n() -> None:
+    src = "a\na\nb\n"
+    new, rpt = apply_edit(src, old_string="a", new_string="X", expected_occurrences=2)
+    assert new == "X\nX\nb\n"
+    assert rpt.occurrences_replaced == 2
+
+
+def test_apply_edit_expected_minus_one_replace_all() -> None:
+    src = "x\nx\nx\nx\n"
+    new, rpt = apply_edit(src, old_string="x", new_string="Y", expected_occurrences=-1)
+    assert new == "Y\nY\nY\nY\n"
+    assert rpt.occurrences_replaced == 4
+
+
+# ----------------------------------------------------------------------
+# apply_edit — refusal paths
+# ----------------------------------------------------------------------
+
+def test_apply_edit_refuses_zero_matches() -> None:
+    with pytest.raises(EditError, match="not found"):
+        apply_edit("a\nb\nc\n", old_string="DOESNOTEXIST", new_string="X")
+
+
+def test_apply_edit_zero_match_hint_on_indentation_mismatch() -> None:
+    """Most common cause of a missing match in Python: the agent
+    copied the line WITH the wrong leading indentation.  The error
+    must hint at that so the agent can recover.
+
+    Here the file uses 8 spaces but the agent's edit uses 4 — the
+    bare substring isn't in the file, but stripped(old) is.
+    """
+    # File uses spaces for indentation; agent's edit accidentally
+    # used a tab — substring no longer matches, but stripped form does.
+    src = "def foo():\n    return value\n"
+    try:
+        apply_edit(
+            src,
+            old_string="\treturn value",
+            new_string="\treturn new_value",
+        )
+    except EditError as e:
+        assert "indentation" in str(e).lower()
+    else:
+        pytest.fail("expected EditError")
+
+
+def test_apply_edit_refuses_ambiguous_match() -> None:
+    with pytest.raises(EditError, match="occurs 3 time"):
+        apply_edit("a\na\na\nb\n", old_string="a", new_string="X")
+
+
+def test_apply_edit_refuses_identical_old_new() -> None:
+    with pytest.raises(EditError, match="identical"):
+        apply_edit("x\n", old_string="x", new_string="x")
+
+
+def test_apply_edit_refuses_empty_old_string() -> None:
+    with pytest.raises(EditError, match="empty"):
+        apply_edit("x\n", old_string="", new_string="y")
+
+
+def test_apply_edit_unexpected_count_at_expected_2_but_3_present() -> None:
+    with pytest.raises(EditError, match="occurs 3 time"):
+        apply_edit("a\na\na\n", old_string="a", new_string="X", expected_occurrences=2)
+
+
+# ----------------------------------------------------------------------
+# apply_edit — performance on big files
+# ----------------------------------------------------------------------
+
+def test_apply_edit_on_2000_line_file_is_fast() -> None:
+    big = "\n".join(f"line {i}" for i in range(1, 2001)) + "\n"
+    t0 = time.perf_counter()
+    new, rpt = apply_edit(big, old_string="line 1482", new_string="line FIXED")
+    elapsed = time.perf_counter() - t0
+    assert rpt.occurrences_replaced == 1
+    assert "line FIXED" in new
+    assert "line 1481" in new and "line 1483" in new
+    # Must be near-instant; if this ever drifts to seconds the algo
+    # has regressed.
+    assert elapsed < 0.5
+
+
+# ----------------------------------------------------------------------
+# apply_unified_diff — happy paths
+# ----------------------------------------------------------------------
+
+def test_apply_unified_diff_single_hunk() -> None:
+    content = "line A\nline B\nline C\nline D\n"
+    diff = (
+        "@@ -1,3 +1,3 @@\n"
+        " line A\n"
+        "-line B\n"
+        "+line BB\n"
+        " line C\n"
+    )
+    new, rpt = apply_unified_diff(content, diff)
+    assert new == "line A\nline BB\nline C\nline D\n"
+    assert rpt.occurrences_replaced == 1
+
+
+def test_apply_unified_diff_multiple_hunks_with_offset_tracking() -> None:
+    content = (
+        "header\n"
+        "alpha\n"
+        "beta\n"
+        "gamma\n"
+        "delta\n"
+        "epsilon\n"
+        "footer\n"
+    )
+    diff = (
+        "@@ -1,3 +1,4 @@\n"
+        " header\n"
+        " alpha\n"
+        "+inserted-1\n"
+        " beta\n"
+        "@@ -5,3 +6,3 @@\n"
+        " delta\n"
+        "-epsilon\n"
+        "+EPSILON\n"
+        " footer\n"
+    )
+    new, rpt = apply_unified_diff(content, diff)
+    assert "inserted-1" in new
+    assert "EPSILON" in new
+    assert rpt.occurrences_replaced == 2
+
+
+def test_apply_unified_diff_tolerates_stale_line_numbers() -> None:
+    """The whole point of context-match: if some earlier edit moved
+    the target lines, the hunk header's line numbers are wrong, but
+    the context is still correct.  We match by context, not numbers."""
+    content = "line A\nline B\nline C\n"
+    diff = (
+        "@@ -999,2 +999,2 @@\n"
+        " line A\n"
+        "-line B\n"
+        "+line BB\n"
+    )
+    new, rpt = apply_unified_diff(content, diff)
+    assert new == "line A\nline BB\nline C\n"
+
+
+def test_apply_unified_diff_picks_nearest_match_when_ambiguous() -> None:
+    """When the context appears twice, we land near the line number
+    the hunk advertised."""
+    content = (
+        "X\n"   # 1
+        "Y\n"   # 2
+        "X\n"   # 3
+        "Y\n"   # 4
+        "X\n"   # 5
+        "Y\n"   # 6
+        "X\n"   # 7
+    )
+    diff = (
+        "@@ -5,3 +5,3 @@\n"
+        " X\n"
+        "-Y\n"
+        "+Z\n"
+        " X\n"
+    )
+    new, _ = apply_unified_diff(content, diff)
+    # The hunk targeted lines 5-7 ("X Y X" near line 5), so the
+    # third occurrence wins.  Lines 5,6,7 become X,Z,X — earlier
+    # occurrences are untouched.
+    assert new.splitlines() == ["X", "Y", "X", "Y", "X", "Z", "X"]
+
+
+def test_apply_unified_diff_with_file_headers_ignored() -> None:
+    """Real-world diffs from git often come with --- / +++ headers
+    before the @@ hunks; we must ignore the preamble."""
+    content = "line A\nline B\n"
+    diff = (
+        "--- a/foo.py\n"
+        "+++ b/foo.py\n"
+        "@@ -1,2 +1,2 @@\n"
+        " line A\n"
+        "-line B\n"
+        "+line BB\n"
+    )
+    new, _ = apply_unified_diff(content, diff)
+    assert new == "line A\nline BB\n"
+
+
+def test_apply_unified_diff_preserves_trailing_newline() -> None:
+    """No-newline-at-EOF state must be preserved."""
+    a = "x\ny\nz"   # no trailing newline
+    diff = "@@ -1,3 +1,3 @@\n x\n-y\n+Y\n z\n"
+    new, _ = apply_unified_diff(a, diff)
+    assert not new.endswith("\n")
+
+
+# ----------------------------------------------------------------------
+# apply_unified_diff — refusal paths
+# ----------------------------------------------------------------------
+
+def test_apply_unified_diff_refuses_missing_context() -> None:
+    content = "alpha\nbeta\n"
+    diff = (
+        "@@ -1,2 +1,2 @@\n"
+        " WRONG_CONTEXT\n"
+        "-beta\n"
+        "+BETA\n"
+    )
+    with pytest.raises(EditError, match="locate the hunk"):
+        apply_unified_diff(content, diff)
+
+
+def test_apply_unified_diff_refuses_empty_diff() -> None:
+    with pytest.raises(EditError, match="empty"):
+        apply_unified_diff("x\n", "")
+
+
+def test_apply_unified_diff_refuses_multi_file_patch() -> None:
+    """Two ``diff --git`` headers → caller must split the patch."""
+    diff = (
+        "diff --git a/x b/x\n"
+        "@@ -1,1 +1,1 @@\n"
+        "-x\n"
+        "+y\n"
+        "diff --git a/y b/y\n"
+        "@@ -1,1 +1,1 @@\n"
+        "-a\n"
+        "+b\n"
+    )
+    with pytest.raises(EditError, match="multi-file"):
+        apply_unified_diff("x\n", diff)
+
+
+def test_apply_unified_diff_refuses_malformed_hunk_line() -> None:
+    diff = (
+        "@@ -1,2 +1,2 @@\n"
+        " alpha\n"
+        "GARBAGE_LINE_NO_PREFIX\n"
+        "-beta\n"
+        "+BETA\n"
+    )
+    with pytest.raises(EditError, match="malformed hunk"):
+        apply_unified_diff("alpha\nbeta\n", diff)
+
+
+def test_apply_unified_diff_refuses_no_hunks() -> None:
+    diff = "--- a/foo\n+++ b/foo\n"
+    with pytest.raises(EditError, match="no @@ hunks"):
+        apply_unified_diff("x\n", diff)
+
+
+# ----------------------------------------------------------------------
+# EditReport shape
+# ----------------------------------------------------------------------
+
+def test_edit_report_is_frozen_dataclass() -> None:
+    rpt = EditReport(occurrences_replaced=1, bytes_before=10, bytes_after=12)
+    with pytest.raises(Exception):
+        rpt.occurrences_replaced = 99  # type: ignore[misc]
diff --git a/tests/test_execute_persists_session_branch.py b/tests/test_execute_persists_session_branch.py
new file mode 100644
index 0000000..cc09d27
--- /dev/null
+++ b/tests/test_execute_persists_session_branch.py
@@ -0,0 +1,219 @@
+"""Regression tests for session-branch persistence on /api/chat/execute.
+
+Bug:
+    A session was created on ``master``, the user approved a plan, the
+    executor created a fresh ``gitpilot-<goal>-<ts>`` branch and pushed
+    to it.  ``Session.branch`` was *never updated* to that new branch —
+    so reopening the session next day jumped back to ``master`` and the
+    user couldn't find their work.
+
+Fix:
+    ``/api/chat/execute`` now accepts an optional ``session_id``.  When
+    supplied AND the executor returns a branch name, the endpoint loads
+    the session, writes the new branch onto both ``session.branch`` and
+    (if present) the matching ``session.repos[i].branch``, and saves.
+
+These tests pin every branch of that contract.
+"""
+from __future__ import annotations
+
+from typing import Any, Iterator
+
+import pytest
+from fastapi.testclient import TestClient
+
+from gitpilot import api as api_module
+
+
+@pytest.fixture()
+def client() -> Iterator[TestClient]:
+    yield TestClient(api_module.app)
+
+
+def _stub_executor(branch: str) -> None:
+    """Replace the real execute path with a deterministic stub that
+    returns a fixed branch and step count."""
+
+    async def _fake_execute(plan, repo_full_name, token=None, branch_name=None):
+        return {
+            "status": "completed",
+            "message": "ok",
+            "branch": branch,
+            "executionLog": [{"step": 1, "title": "noop"}],
+        }
+
+    api_module.execute_plan = _fake_execute            # type: ignore[assignment]
+    api_module.execute_plan_lite = _fake_execute       # type: ignore[assignment]
+
+
+def _stub_auth_and_context(monkeypatch: pytest.MonkeyPatch) -> None:
+    """Bypass the GitHub-token + execution_context plumbing so the
+    test doesn't need a network or credentials."""
+    from contextlib import contextmanager
+
+    @contextmanager
+    def _noop_ctx(*_a: Any, **_kw: Any) -> Iterator[None]:
+        yield
+
+    monkeypatch.setattr(api_module, "execution_context", _noop_ctx)
+    monkeypatch.setattr(api_module, "get_github_token", lambda *_a, **_kw: None)
+    monkeypatch.setattr(api_module, "_is_lite_mode_active", lambda: True)
+
+
+def _plan_payload() -> dict:
+    """Minimum PlanResult-shaped dict the endpoint will accept."""
+    return {
+        "goal": "do thing",
+        "summary": "noop",
+        "steps": [],
+    }
+
+
+# ----------------------------------------------------------------------
+# Branch is written through to the session record
+# ----------------------------------------------------------------------
+
+def test_execute_persists_new_branch_on_session(
+    client: TestClient, monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    _stub_auth_and_context(monkeypatch)
+    _stub_executor(branch="gitpilot-do-thing-123456")
+
+    session = api_module._session_mgr.create(
+        repo_full_name="owner/repo", branch="master", name="branch-persist"
+    )
+
+    resp = client.post(
+        "/api/chat/execute",
+        json={
+            "repo_owner": "owner",
+            "repo_name": "repo",
+            "plan": _plan_payload(),
+            "branch_name": None,
+            "session_id": session.id,
+        },
+    )
+    assert resp.status_code == 200, resp.text
+    assert resp.json()["branch"] == "gitpilot-do-thing-123456"
+
+    reloaded = api_module._session_mgr.load(session.id)
+    assert reloaded.branch == "gitpilot-do-thing-123456", (
+        "session.branch must be updated to the branch the executor wrote to"
+    )
+
+
+def test_execute_updates_matching_repos_entry(
+    client: TestClient, monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """Multi-repo sessions: repos[i].branch for the matching full_name
+    is updated alongside the legacy session.branch field."""
+    _stub_auth_and_context(monkeypatch)
+    _stub_executor(branch="gitpilot-multi-987654")
+
+    session = api_module._session_mgr.create(
+        repo_full_name="owner/repo", branch="master", name="multi-repo"
+    )
+    session.repos = [
+        {"full_name": "owner/repo", "branch": "master", "mode": "write"},
+        {"full_name": "owner/other", "branch": "trunk", "mode": "read"},
+    ]
+    session.active_repo = "owner/repo"
+    api_module._session_mgr.save(session)
+
+    resp = client.post(
+        "/api/chat/execute",
+        json={
+            "repo_owner": "owner",
+            "repo_name": "repo",
+            "plan": _plan_payload(),
+            "branch_name": None,
+            "session_id": session.id,
+        },
+    )
+    assert resp.status_code == 200, resp.text
+
+    reloaded = api_module._session_mgr.load(session.id)
+    write_entry = next(r for r in reloaded.repos if r["full_name"] == "owner/repo")
+    read_entry = next(r for r in reloaded.repos if r["full_name"] == "owner/other")
+    assert write_entry["branch"] == "gitpilot-multi-987654"
+    # Untouched second repo retains its original branch.
+    assert read_entry["branch"] == "trunk"
+
+
+# ----------------------------------------------------------------------
+# Backwards compatibility: no session_id, no error
+# ----------------------------------------------------------------------
+
+def test_execute_without_session_id_is_backwards_compatible(
+    client: TestClient, monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    _stub_auth_and_context(monkeypatch)
+    _stub_executor(branch="gitpilot-anon-111111")
+
+    resp = client.post(
+        "/api/chat/execute",
+        json={
+            "repo_owner": "owner",
+            "repo_name": "repo",
+            "plan": _plan_payload(),
+            "branch_name": None,
+        },
+    )
+    assert resp.status_code == 200, resp.text
+    # Result still carries the executor's branch — that's the
+    # frontend's only signal in the legacy path.
+    assert resp.json()["branch"] == "gitpilot-anon-111111"
+
+
+def test_execute_unknown_session_id_does_not_500(
+    client: TestClient, monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """A stale session id from the frontend cache must not poison the
+    execute result — the user already has their commit published."""
+    _stub_auth_and_context(monkeypatch)
+    _stub_executor(branch="gitpilot-stale-222222")
+
+    resp = client.post(
+        "/api/chat/execute",
+        json={
+            "repo_owner": "owner",
+            "repo_name": "repo",
+            "plan": _plan_payload(),
+            "branch_name": None,
+            "session_id": "does-not-exist",
+        },
+    )
+    assert resp.status_code == 200, resp.text
+
+
+# ----------------------------------------------------------------------
+# Sticky mode: caller already on the session branch
+# ----------------------------------------------------------------------
+
+def test_execute_sticky_mode_keeps_branch(
+    client: TestClient, monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """When the request specifies branch_name (sticky mode), the
+    session must record the same branch — this is what reopening the
+    session lands on next time."""
+    _stub_auth_and_context(monkeypatch)
+    _stub_executor(branch="gitpilot-sticky-333333")
+
+    session = api_module._session_mgr.create(
+        repo_full_name="owner/repo", branch="master", name="sticky"
+    )
+
+    resp = client.post(
+        "/api/chat/execute",
+        json={
+            "repo_owner": "owner",
+            "repo_name": "repo",
+            "plan": _plan_payload(),
+            "branch_name": "gitpilot-sticky-333333",
+            "session_id": session.id,
+        },
+    )
+    assert resp.status_code == 200, resp.text
+
+    reloaded = api_module._session_mgr.load(session.id)
+    assert reloaded.branch == "gitpilot-sticky-333333"
diff --git a/tests/test_explorer_summary.py b/tests/test_explorer_summary.py
new file mode 100644
index 0000000..9dc12e7
--- /dev/null
+++ b/tests/test_explorer_summary.py
@@ -0,0 +1,159 @@
+"""Tests for the explorer-report compressor (Batch B5).
+
+Pin the contract the planner relies on:
+
+* Header is preserved so the planner's existing prompt template
+  matches.
+* Every file path the explorer listed survives compression OR is
+  replaced by an honest "…N more files" marker.
+* Compression is a strict no-op when:
+    - the flag is off
+    - the report is already under budget
+    - the report is empty
+* On a pathologically large input, the compressed output is always
+  ≤ the budget plus a small slack — never an order of magnitude over.
+"""
+from __future__ import annotations
+
+import pytest
+
+from gitpilot import flags
+from gitpilot.context_budget import estimate_tokens
+from gitpilot.explorer_summary import (
+    DEFAULT_TOKEN_BUDGET,
+    FLAG_SUBAGENT_EXPLORER,
+    MAX_FILES_LISTED,
+    compress_exploration_report,
+)
+
+
+SHORT_REPORT = """\
+REPOSITORY EXPLORATION REPORT
+=============================
+
+Files Found:
+  - README.md
+  - src/main.py
+  - tests/test_main.py
+
+Key Files:
+  - README.md
+
+Directory Structure:
+README.md
+src/main.py
+tests/test_main.py
+
+File Types: md=1, py=2
+"""
+
+
+def _make_large_report(n_files: int) -> str:
+    """Synthesise an explorer report listing N files under src/."""
+    files = [f"  - src/mod_{i:04d}/file_{j}.py" for i in range(n_files // 5) for j in range(5)]
+    return (
+        "REPOSITORY EXPLORATION REPORT\n"
+        "=============================\n"
+        "\n"
+        "Files Found:\n"
+        + "\n".join(files)
+        + "\n\nKey Files:\n  - README.md\n  - src/__init__.py\n"
+    )
+
+
+# ----------------------------------------------------------------------
+# No-op paths
+# ----------------------------------------------------------------------
+
+def test_short_report_passes_through_unchanged() -> None:
+    out, metrics = compress_exploration_report(SHORT_REPORT)
+    assert out == SHORT_REPORT
+    assert metrics.reason == "under budget"
+    assert metrics.compressed_tokens == metrics.original_tokens
+
+
+def test_empty_report_is_returned_as_is() -> None:
+    out, metrics = compress_exploration_report("")
+    assert out == ""
+    assert metrics.reason == "empty report"
+
+
+def test_flag_off_is_a_noop() -> None:
+    big = _make_large_report(500)
+    flags.set_override(FLAG_SUBAGENT_EXPLORER, False)
+    try:
+        out, metrics = compress_exploration_report(big)
+    finally:
+        flags.clear_override(FLAG_SUBAGENT_EXPLORER)
+    assert out == big
+    assert metrics.reason == "flag off"
+
+
+# ----------------------------------------------------------------------
+# Actual compression
+# ----------------------------------------------------------------------
+
+def test_large_report_shrinks_under_budget() -> None:
+    big = _make_large_report(500)
+    out, metrics = compress_exploration_report(big, token_budget=800)
+    # Token count must drop, hard.
+    assert metrics.original_tokens > metrics.compressed_tokens
+    # Budget honoured (with a tiny safety slack — the loop trims more
+    # if we overshoot, so the final result is at most the budget).
+    assert metrics.compressed_tokens <= 900
+
+
+def test_compressed_report_preserves_header_for_planner() -> None:
+    """The planner template injects the report under a fixed header.
+    Compression must keep that header so the template still works."""
+    big = _make_large_report(500)
+    out, _ = compress_exploration_report(big, token_budget=800)
+    assert "REPOSITORY EXPLORATION REPORT" in out
+    assert "Files Found:" in out
+
+
+def test_compression_caps_files_listed() -> None:
+    big = _make_large_report(500)   # 500 files
+    out, metrics = compress_exploration_report(big, token_budget=800)
+    # Should never list more than the documented hard cap under the
+    # "Files Found:" section (which is what the planner counts).
+    # Filter to lines that look like the synthetic ``mod_XXXX/file_X.py``
+    # entries so the Key Files entry (``src/__init__.py``) doesn't
+    # accidentally inflate the count.
+    listed = [ln for ln in out.splitlines() if "mod_" in ln]
+    assert len(listed) <= MAX_FILES_LISTED
+    assert metrics.truncated is True
+    assert metrics.files_in_original == 500
+    # Honest "N more files" marker present.
+    assert "more files" in out
+
+
+def test_compression_keeps_first_files() -> None:
+    """Order should be preserved when we trim — first N files survive
+    so the planner sees the "earliest discovered" ones (typically the
+    most important: top-level, then breadth-first by directory)."""
+    big = _make_large_report(500)
+    out, _ = compress_exploration_report(big, token_budget=800)
+    # First file in the synthetic list is src/mod_0000/file_0.py
+    assert "src/mod_0000/file_0.py" in out
+
+
+def test_compression_preserves_key_files_section() -> None:
+    """Key Files is the planner's anchor — it must survive even under
+    aggressive compression."""
+    big = _make_large_report(500)
+    out, _ = compress_exploration_report(big, token_budget=400)
+    assert "Key Files:" in out
+
+
+def test_pathological_input_does_not_explode() -> None:
+    """A 10 000-file report should still produce a bounded output.
+    No quadratic blow-ups, no recursion errors."""
+    big = _make_large_report(10_000)
+    out, metrics = compress_exploration_report(big, token_budget=800)
+    assert metrics.compressed_tokens <= 1_200   # generous slack
+    assert "more files" in out
+
+
+def test_default_budget_matches_documented_value() -> None:
+    assert DEFAULT_TOKEN_BUDGET == 800
diff --git a/tests/test_glob_and_windowed_read.py b/tests/test_glob_and_windowed_read.py
new file mode 100644
index 0000000..a672a5b
--- /dev/null
+++ b/tests/test_glob_and_windowed_read.py
@@ -0,0 +1,137 @@
+"""Tests for the Glob / windowed-Read tools (Batch B1).
+
+Pin both:
+
+* The pure-Python ``_glob_match`` semantics — `/`-aware, `**`-as-any.
+  These mirror Claude Code / ripgrep / bash, NOT the looser ``fnmatch``
+  default.
+* The windowed-Read helpers (``_coerce_int`` and the default/max
+  constants) so future refactors don't silently widen them.
+"""
+from __future__ import annotations
+
+import pytest
+
+from gitpilot.agent_tools import (
+    GLOB_DEFAULT_MAX_RESULTS,
+    GLOB_HARD_MAX_RESULTS,
+    READ_DEFAULT_LIMIT,
+    READ_MAX_LIMIT,
+    _coerce_int,
+    _glob_match,
+    _glob_to_regex,
+)
+
+
+PATHS = [
+    "README.md",
+    "LICENSE",
+    "src/main.py",
+    "src/util/io.py",
+    "src/util/__init__.py",
+    "tests/test_main.py",
+    "tests/unit/test_util.py",
+    "docs/intro.md",
+    "docs/guide/setup.md",
+    ".github/workflows/ci.yml",
+]
+
+
+# ----------------------------------------------------------------------
+# Glob semantics
+# ----------------------------------------------------------------------
+
+@pytest.mark.parametrize(
+    "pattern, expected",
+    [
+        ("**/*.py", [
+            "src/main.py",
+            "src/util/__init__.py",
+            "src/util/io.py",
+            "tests/test_main.py",
+            "tests/unit/test_util.py",
+        ]),
+        ("src/**/*.py", [
+            "src/main.py",
+            "src/util/__init__.py",
+            "src/util/io.py",
+        ]),
+        ("**/test_*.py", [
+            "tests/test_main.py",
+            "tests/unit/test_util.py",
+        ]),
+        ("*.md", ["README.md"]),                # top-level only
+        ("**/*.md", ["README.md", "docs/guide/setup.md", "docs/intro.md"]),
+        ("docs/*.md", ["docs/intro.md"]),       # immediate child of docs
+        ("docs/**/*.md", ["docs/guide/setup.md", "docs/intro.md"]),
+        ("README*", ["README.md"]),
+        ("LICENSE", ["LICENSE"]),
+        (".github/**/*.yml", [".github/workflows/ci.yml"]),
+        ("nope/*.py", []),
+    ],
+)
+def test_glob_match_matches_expected_paths(pattern: str, expected: list[str]) -> None:
+    got = sorted(_glob_match(PATHS, pattern))
+    assert got == sorted(expected), f"pattern={pattern!r}"
+
+
+def test_star_does_not_cross_slash() -> None:
+    """`*` must NOT consume `/` — this is the contract that
+    distinguishes shell glob from fnmatch.  ``src/*`` therefore matches
+    only direct children of ``src``, never nested ones."""
+    got = sorted(_glob_match(PATHS, "src/*"))
+    assert got == ["src/main.py"]
+
+
+def test_double_star_crosses_slash() -> None:
+    """`**` must consume any number of segments, including zero."""
+    rx = _glob_to_regex("**/foo.py")
+    assert rx.match("foo.py")          # zero-segment case
+    assert rx.match("src/foo.py")
+    assert rx.match("a/b/c/foo.py")
+    assert not rx.match("foo.txt")
+
+
+def test_character_class_passes_through() -> None:
+    rx = _glob_to_regex("test_[ab].py")
+    assert rx.match("test_a.py")
+    assert rx.match("test_b.py")
+    assert not rx.match("test_c.py")
+
+
+def test_question_mark_matches_single_non_slash() -> None:
+    rx = _glob_to_regex("a?.py")
+    assert rx.match("ab.py")
+    assert not rx.match("a/b.py")
+    assert not rx.match("abc.py")
+
+
+# ----------------------------------------------------------------------
+# Defaults / bounds
+# ----------------------------------------------------------------------
+
+def test_constants_have_sensible_bounds() -> None:
+    assert 1 <= GLOB_DEFAULT_MAX_RESULTS <= GLOB_HARD_MAX_RESULTS
+    assert READ_DEFAULT_LIMIT < READ_MAX_LIMIT
+    assert READ_MAX_LIMIT <= 100_000   # belt-and-braces ceiling
+
+
+# ----------------------------------------------------------------------
+# Coercion (CrewAI passes ints as strings, sometimes as schema dicts)
+# ----------------------------------------------------------------------
+
+@pytest.mark.parametrize(
+    "value, default, expected",
+    [
+        (5, 999, 5),
+        ("5", 999, 5),
+        ("  12 ", 999, 12),
+        (None, 7, 7),
+        ("nope", 7, 7),
+        ({"description": "...", "type": "int"}, 7, 7),
+        (True, 7, 7),         # bool must be ignored, not interpreted as 1
+        (3.7, 7, 3),
+    ],
+)
+def test_coerce_int_handles_crewai_quirks(value, default, expected) -> None:
+    assert _coerce_int(value, default) == expected
diff --git a/tests/test_grep_backend.py b/tests/test_grep_backend.py
new file mode 100644
index 0000000..67f39f3
--- /dev/null
+++ b/tests/test_grep_backend.py
@@ -0,0 +1,196 @@
+"""Tests for the Grep backend (Batch B2).
+
+Pin both:
+
+* The in-memory ``grep`` (used by the GitHub-backed agent tool — files
+  arrive as a dict because there's no local checkout).
+* The ``grep_local`` ripgrep / Python-walk fallback (used by local-git
+  + folder modes).  The fallback is tested without ripgrep installed
+  by walking the temp directory directly; the ripgrep happy-path is
+  only exercised when ``rg`` is available on $PATH.
+"""
+from __future__ import annotations
+
+import shutil
+from pathlib import Path
+
+import pytest
+
+from gitpilot.agent_tools import _glob_to_regex
+from gitpilot.grep_backend import (
+    GREP_DEFAULT_MAX_RESULTS,
+    GREP_HARD_MAX_RESULTS,
+    GrepResult,
+    format_result,
+    grep,
+    grep_local,
+)
+
+
+# ----------------------------------------------------------------------
+# In-memory backend (GitHub mode)
+# ----------------------------------------------------------------------
+
+FILES = {
+    "README.md": "GitPilot README\nSee CONTRIBUTING.md\n",
+    "src/main.py": (
+        "import asyncio\n"
+        "def main():\n"
+        "    print('hello')\n"
+        "\n"
+        "if __name__ == '__main__':\n"
+        "    main()\n"
+    ),
+    "src/util/io.py": (
+        "from typing import Any\n"
+        "def read(path: str) -> str:\n"
+        "    return ''\n"
+    ),
+    "tests/test_main.py": (
+        "from src.main import main\n"
+        "def test_main():\n"
+        "    main()\n"
+    ),
+}
+
+
+def test_grep_finds_basic_token() -> None:
+    r = grep(FILES, "asyncio")
+    assert not r.error
+    assert r.backend == "python"
+    assert len(r.hits) == 1
+    assert r.hits[0].path == "src/main.py"
+    assert r.hits[0].line == 1
+
+
+def test_grep_returns_multiple_hits_sorted() -> None:
+    r = grep(FILES, "def ")
+    # def main (main.py), def read (io.py), def test_main (test_main.py)
+    assert [h.path for h in r.hits] == [
+        "src/main.py",
+        "src/util/io.py",
+        "tests/test_main.py",
+    ]
+
+
+def test_grep_case_insensitive_flag() -> None:
+    assert grep(FILES, "ASYNC").hits == []
+    r = grep(FILES, "ASYNC", case_insensitive=True)
+    assert len(r.hits) == 1
+    assert r.hits[0].path == "src/main.py"
+
+
+def test_grep_invalid_regex_returns_error() -> None:
+    r = grep(FILES, "[unterminated")
+    assert r.error and "invalid regex" in r.error
+    assert r.hits == []
+
+
+def test_grep_path_filter_via_glob() -> None:
+    rx = _glob_to_regex("**/*.py")
+    r = grep(FILES, "import", path_filter=rx)
+    assert all(h.path.endswith(".py") for h in r.hits)
+    # README.md mentions CONTRIBUTING.md, not import — but the filter
+    # would have excluded it anyway.  Belt-and-braces.
+
+
+def test_grep_truncates_above_cap() -> None:
+    big = {f"f{i}.txt": "needle\n" * 50 for i in range(20)}
+    r = grep(big, "needle", max_results=10)
+    assert r.truncated is True
+    assert len(r.hits) == 10
+
+
+def test_grep_respects_hard_cap_even_when_caller_asks_more() -> None:
+    big = {f"f{i}.txt": "needle\n" * 10 for i in range(GREP_HARD_MAX_RESULTS + 50)}
+    r = grep(big, "needle", max_results=10_000)
+    # capped at GREP_HARD_MAX_RESULTS, regardless of the caller value.
+    assert len(r.hits) == GREP_HARD_MAX_RESULTS
+    assert r.truncated is True
+
+
+def test_grep_empty_pattern_match_is_handled_gracefully() -> None:
+    """A pattern that matches every line should still respect the cap
+    rather than spinning on a multi-MB file."""
+    big = {"big.txt": "x\n" * 10_000}
+    r = grep(big, ".", max_results=5)
+    assert len(r.hits) == 5
+    assert r.truncated is True
+
+
+def test_grep_skips_empty_files_quietly() -> None:
+    r = grep({"empty.txt": "", "main.py": "import os\n"}, "import")
+    assert [h.path for h in r.hits] == ["main.py"]
+
+
+# ----------------------------------------------------------------------
+# Local backend (folder / local-git mode)
+# ----------------------------------------------------------------------
+
+def _seed_local_repo(tmp_path: Path) -> Path:
+    (tmp_path / "README.md").write_text("docs\n")
+    (tmp_path / "src").mkdir()
+    (tmp_path / "src" / "main.py").write_text("import asyncio\nprint('hi')\n")
+    (tmp_path / "src" / "util.py").write_text("def helper():\n    return 1\n")
+    (tmp_path / "tests").mkdir()
+    (tmp_path / "tests" / "test_main.py").write_text("def test_x():\n    assert True\n")
+    return tmp_path
+
+
+def test_grep_local_python_fallback_finds_matches(
+    tmp_path: Path, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    _seed_local_repo(tmp_path)
+    # Force the Python walk regardless of ripgrep availability.
+    monkeypatch.setattr("gitpilot.grep_backend.shutil.which", lambda _: None)
+    r = grep_local(str(tmp_path), "import")
+    assert r.backend == "python"
+    assert any(h.path == "src/main.py" and h.line == 1 for h in r.hits)
+
+
+def test_grep_local_respects_glob_filter(
+    tmp_path: Path, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    _seed_local_repo(tmp_path)
+    monkeypatch.setattr("gitpilot.grep_backend.shutil.which", lambda _: None)
+    r = grep_local(str(tmp_path), "def ", glob_filter="src/**/*.py")
+    paths = {h.path for h in r.hits}
+    assert paths == {"src/util.py"}
+
+
+@pytest.mark.skipif(shutil.which("rg") is None, reason="ripgrep not installed")
+def test_grep_local_uses_ripgrep_when_available(tmp_path: Path) -> None:
+    _seed_local_repo(tmp_path)
+    r = grep_local(str(tmp_path), "import")
+    assert r.backend == "ripgrep"
+    assert any(h.path == "src/main.py" for h in r.hits)
+
+
+# ----------------------------------------------------------------------
+# Formatter
+# ----------------------------------------------------------------------
+
+def test_format_result_no_hits() -> None:
+    out = format_result(GrepResult(hits=[]), pattern="xyz")
+    assert "No matches" in out
+
+
+def test_format_result_with_hits_and_truncation() -> None:
+    r = grep(FILES, "def ")
+    out = format_result(r, pattern="def ")
+    assert "def main" in out
+    assert "src/main.py" in out
+
+
+def test_format_result_emits_truncation_hint() -> None:
+    big = {f"f{i}.txt": "needle\n" for i in range(10)}
+    r = grep(big, "needle", max_results=3)
+    out = format_result(r, pattern="needle")
+    assert "truncated" in out.lower()
+
+
+def test_grep_default_cap_is_reasonable() -> None:
+    """The default cap must be small enough to fit in any model's
+    context, big enough to be useful on realistic repos."""
+    assert 50 <= GREP_DEFAULT_MAX_RESULTS <= 200
+    assert GREP_HARD_MAX_RESULTS >= GREP_DEFAULT_MAX_RESULTS
diff --git a/tests/test_query_router.py b/tests/test_query_router.py
new file mode 100644
index 0000000..50fa0c3
--- /dev/null
+++ b/tests/test_query_router.py
@@ -0,0 +1,290 @@
+"""Tests for the deterministic query router (Batch B9).
+
+The router is the auto-strategy layer that picks tools / triggers
+RAG before the LLM even runs.  Tests pin:
+
+* Intent classification on representative dev prompts (fix / find /
+  info / create / delete / modify / unknown).
+* Target-file extraction: quoted, bareword, path-with-slash, with
+  / without repo verification.
+* Fuzzy-query detection (natural language vs. symbol-shaped).
+* RAG / auto-index decisions: only fired for read-leaning intents
+  on big enough repos.
+* File-type policy: ``.py`` ⇒ surgical, ``.lock`` ⇒ reject.
+* ``force_no_rag`` overrides the RAG recommendation.
+* Hint rendering survives every combination without crashing.
+"""
+from __future__ import annotations
+
+import pytest
+
+from gitpilot.query_router import (
+    DEFAULT_FUZZY_REPO_SIZE_FOR_RAG,
+    RouterDecision,
+    classify,
+    render_planner_hint,
+)
+
+
+# A medium-size synthetic repo (60 files) — above the
+# RAG-recommendation threshold.
+REPO = (
+    ["README.md", "pyproject.toml", "src/main.py", "src/auth.py",
+     "src/util.py", "tests/test_main.py", "Dockerfile",
+     "poetry.lock", "package-lock.json"]
+    + [f"src/mod_{i:02d}.py" for i in range(60)]
+)
+
+
+# ----------------------------------------------------------------------
+# Intent classification
+# ----------------------------------------------------------------------
+
+@pytest.mark.parametrize(
+    "goal, expected",
+    [
+        ("Fix the TypeError in src/auth.py", "fix"),
+        ("fix bug in main()", "fix"),
+        ("the app crashes on startup", "fix"),
+        ("a regression happened after the last commit", "fix"),
+
+        ("find calls to authenticate_user", "find"),
+        ("where do we handle login?", "find"),
+        ("search for the cache layer", "find"),
+
+        ("what is this project about", "info"),
+        ("explain how does the planner work", "info"),
+        ("describe the agent pipeline", "info"),
+
+        ("create a new helper for date formatting", "create"),
+        ("add a CLI command for export", "create"),
+        ("generate a Dockerfile for production", "create"),
+
+        ("delete all the .lock files", "delete"),
+        ("remove the unused vendor folder", "delete"),
+
+        ("modify pyproject.toml dependencies", "modify"),
+        ("update README.md installation steps", "modify"),
+        ("rename foo_bar to baz_qux", "modify"),
+        ("refactor the auth module", "modify"),
+
+        ("plonk into snargs", "unknown"),     # nonsense
+        ("", "unknown"),                       # empty
+    ],
+)
+def test_intent_classification(goal: str, expected: str) -> None:
+    d = classify(goal, repo_files=REPO)
+    assert d.intent == expected, f"goal={goal!r}"
+
+
+# ----------------------------------------------------------------------
+# Target-file extraction
+# ----------------------------------------------------------------------
+
+def test_extracts_path_with_slash() -> None:
+    d = classify("Fix bug in src/auth.py", repo_files=REPO)
+    assert d.target_files == ["src/auth.py"]
+
+
+def test_extracts_bareword_file_with_extension() -> None:
+    d = classify("Modify pyproject.toml", repo_files=REPO)
+    assert "pyproject.toml" in d.target_files
+
+
+def test_extracts_quoted_path() -> None:
+    d = classify("Fix bug in `src/util.py`", repo_files=REPO)
+    assert "src/util.py" in d.target_files
+
+
+def test_does_not_invent_files_absent_from_repo() -> None:
+    """A user typo like ``src/aut.py`` must not survive verification."""
+    d = classify("Fix bug in src/aut.py", repo_files=REPO)
+    assert d.target_files == []
+
+
+def test_no_repo_files_keeps_raw_candidates() -> None:
+    """When the caller can't supply the repo file list (offline / first
+    request) we surface the raw candidates so the planner can still
+    use them as hints."""
+    d = classify("Fix bug in src/auth.py", repo_files=None)
+    assert "src/auth.py" in d.target_files
+
+
+# ----------------------------------------------------------------------
+# Fuzzy detection + RAG recommendation
+# ----------------------------------------------------------------------
+
+def test_fuzzy_query_with_big_repo_and_no_index_triggers_auto_index() -> None:
+    d = classify(
+        "where do we handle session tokens and auth refresh",
+        repo_files=REPO,
+        rag_index_exists=False,
+    )
+    assert d.intent == "find"
+    assert d.rag_recommended is True
+    assert d.auto_index_repo is True
+
+
+def test_fuzzy_query_with_existing_index_recommends_rag_but_not_rebuild() -> None:
+    d = classify(
+        "where do we handle session tokens and auth refresh",
+        repo_files=REPO,
+        rag_index_exists=True,
+    )
+    assert d.rag_recommended is True
+    assert d.auto_index_repo is False
+
+
+def test_force_no_rag_suppresses_rag_recommendation() -> None:
+    """The post-Reject retry path sets force_no_rag=True; router
+    must respect it."""
+    d = classify(
+        "where do we handle session tokens",
+        repo_files=REPO,
+        force_no_rag=True,
+    )
+    assert d.rag_recommended is False
+    assert d.auto_index_repo is False
+
+
+def test_symbol_query_uses_grep_not_rag() -> None:
+    """A query that names a real symbol (camelCase / snake_case /
+    SCREAMING) is exact-search territory, not embeddings."""
+    d = classify("find calls to authenticate_user", repo_files=REPO)
+    assert d.rag_recommended is False
+    assert "Search file contents" in d.tool_priority
+
+
+def test_small_repo_skips_auto_index() -> None:
+    small = ["README.md", "main.py"]
+    d = classify(
+        "where do we handle authentication tokens",
+        repo_files=small,
+    )
+    assert d.repo_too_small_for_rag is True
+    assert d.auto_index_repo is False
+
+
+def test_info_intent_does_not_auto_index() -> None:
+    """'What is this project' is answered from the repo map; no need
+    to spin up the embedding pipeline."""
+    d = classify("what does this project do", repo_files=REPO)
+    assert d.intent == "info"
+    assert d.auto_index_repo is False
+
+
+def test_delete_intent_does_not_auto_index() -> None:
+    d = classify("delete all the unused tests", repo_files=REPO)
+    assert d.intent == "delete"
+    assert d.auto_index_repo is False
+
+
+# ----------------------------------------------------------------------
+# File-type policy
+# ----------------------------------------------------------------------
+
+def test_python_target_marks_surgical_with_indentation_hint() -> None:
+    d = classify("Fix bug in src/auth.py", repo_files=REPO)
+    assert d.edit_strategy == "surgical"
+    assert "indentation-sensitive" in d.file_policy_notes
+
+
+def test_lockfile_target_marks_reject() -> None:
+    """Generated lock files are off-limits — edit the manifest."""
+    d = classify("Modify poetry.lock", repo_files=REPO)
+    assert d.edit_strategy == "reject"
+    assert "lock" in d.file_policy_notes.lower()
+
+
+def test_markdown_target_marks_surgical() -> None:
+    d = classify("Update README.md installation steps", repo_files=REPO)
+    assert d.edit_strategy == "surgical"
+
+
+# ----------------------------------------------------------------------
+# Tool priority
+# ----------------------------------------------------------------------
+
+def test_fix_with_target_recommends_read_then_edit() -> None:
+    d = classify("Fix bug in src/auth.py", repo_files=REPO)
+    assert d.tool_priority[:2] == [
+        "Read file content",
+        "Edit a section of a file",
+    ]
+
+
+def test_fix_without_target_recommends_grep_first() -> None:
+    """No specific file mentioned and not fuzzy → grep first."""
+    d = classify("fix authenticate_user bug", repo_files=REPO)
+    assert d.tool_priority[0] in (
+        "Search file contents", "Find code by semantic search",
+    )
+
+
+def test_fuzzy_fix_without_target_prefers_semantic_search() -> None:
+    d = classify(
+        "fix the issue where login tokens expire too quickly",
+        repo_files=REPO,
+    )
+    # Fuzzy + no target → RAG first.
+    assert d.tool_priority[0] == "Find code by semantic search"
+
+
+def test_delete_intent_uses_glob_first() -> None:
+    d = classify("delete all the unused tests", repo_files=REPO)
+    assert d.tool_priority[0] == "Find files matching a pattern"
+
+
+def test_info_intent_recommends_read_only() -> None:
+    d = classify("what does this project do", repo_files=REPO)
+    # No write tools — info queries are read-only.
+    write_tools = {"Edit a section of a file", "Write or update a file in the repository"}
+    assert not (set(d.tool_priority) & write_tools)
+
+
+# ----------------------------------------------------------------------
+# Hint rendering
+# ----------------------------------------------------------------------
+
+def test_render_planner_hint_contains_intent_and_files() -> None:
+    d = classify("Fix bug in src/auth.py", repo_files=REPO)
+    hint = render_planner_hint(d)
+    assert "Intent" in hint and "fix" in hint
+    assert "src/auth.py" in hint
+
+
+def test_render_planner_hint_mentions_index_step_when_auto_index() -> None:
+    d = classify(
+        "where do we handle session tokens",
+        repo_files=REPO,
+        rag_index_exists=False,
+    )
+    hint = render_planner_hint(d)
+    assert "INDEX" in hint  # the planner must know to include the step
+    assert "one-time" in hint
+
+
+def test_render_planner_hint_skips_index_step_when_consent_exists() -> None:
+    d = classify(
+        "where do we handle session tokens",
+        repo_files=REPO,
+        rag_index_exists=True,
+    )
+    hint = render_planner_hint(d)
+    assert "INDEX" not in hint
+
+
+def test_router_decision_is_frozen() -> None:
+    d = classify("info", repo_files=REPO)
+    with pytest.raises(Exception):
+        d.intent = "fix"  # type: ignore[misc]
+
+
+# ----------------------------------------------------------------------
+# DEFAULT_FUZZY_REPO_SIZE_FOR_RAG sanity
+# ----------------------------------------------------------------------
+
+def test_threshold_is_sensible() -> None:
+    """The RAG threshold must be high enough to skip toy repos and
+    low enough that mid-size projects benefit."""
+    assert 10 <= DEFAULT_FUZZY_REPO_SIZE_FOR_RAG <= 200
diff --git a/tests/test_rag_consent.py b/tests/test_rag_consent.py
new file mode 100644
index 0000000..e2eb9b2
--- /dev/null
+++ b/tests/test_rag_consent.py
@@ -0,0 +1,146 @@
+"""Tests for per-repo RAG consent (Batch B9).
+
+Pin every contract the router and executor rely on:
+
+* Initial state: no consent.
+* Grant writes a readable, well-formed JSON file with a UTC
+  timestamp.  Idempotent.
+* Revoke deletes the consent file *and* wipes every per-branch
+  index directory under the same repo.  Returns ``True`` only when
+  something was actually removed; subsequent revokes return False.
+* Malformed consent files count as "no consent" (fail closed).
+* Paths with unsafe characters are sanitised — owner/repo of the
+  form ``"my org/weird repo"`` cannot escape the consent root.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+from gitpilot.rag_consent import (
+    CONSENT_FILE,
+    ConsentRecord,
+    grant_consent,
+    has_consent,
+    load_record,
+    revoke_consent,
+)
+
+
+@pytest.fixture(autouse=True)
+def _isolated_rag_root(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Path:
+    monkeypatch.setenv("GITPILOT_RAG_ROOT", str(tmp_path))
+    return tmp_path
+
+
+# ----------------------------------------------------------------------
+# Round-trip
+# ----------------------------------------------------------------------
+
+def test_initial_state_is_no_consent() -> None:
+    assert has_consent("o", "r") is False
+    assert load_record("o", "r") is None
+
+
+def test_grant_creates_well_formed_file(tmp_path: Path) -> None:
+    rec = grant_consent("o", "r", granted_by="alice")
+    assert isinstance(rec, ConsentRecord)
+    assert rec.granted_by == "alice"
+    assert has_consent("o", "r") is True
+    # Verify the on-disk shape so the router can read it from a
+    # different process.
+    cpath = tmp_path / "o" / "r" / CONSENT_FILE
+    raw = json.loads(cpath.read_text(encoding="utf-8"))
+    assert "granted_at" in raw
+    assert raw["granted_by"] == "alice"
+
+
+def test_grant_is_idempotent() -> None:
+    grant_consent("o", "r")
+    grant_consent("o", "r", granted_by="other")
+    # Second call must not error; the record reflects the latest grant.
+    rec = load_record("o", "r")
+    assert rec is not None
+    assert rec.granted_by == "other"
+
+
+def test_grant_without_granted_by_works() -> None:
+    rec = grant_consent("o", "r")
+    assert rec.granted_by is None
+
+
+def test_revoke_returns_true_when_something_existed() -> None:
+    grant_consent("o", "r")
+    assert revoke_consent("o", "r") is True
+    assert has_consent("o", "r") is False
+
+
+def test_revoke_returns_false_when_nothing_to_revoke() -> None:
+    assert revoke_consent("o", "r") is False
+
+
+def test_revoke_wipes_per_branch_index_directories(tmp_path: Path) -> None:
+    """Revoke must remove every <branch>/ directory under the consent
+    root — that's the index data we shouldn't keep once consent is
+    withdrawn."""
+    grant_consent("o", "r")
+    # Simulate a built index on two branches.
+    (tmp_path / "o" / "r" / "main").mkdir(parents=True)
+    (tmp_path / "o" / "r" / "feature_x").mkdir(parents=True)
+    (tmp_path / "o" / "r" / "main" / "data.bin").write_text("x")
+    (tmp_path / "o" / "r" / "feature_x" / "data.bin").write_text("y")
+
+    assert revoke_consent("o", "r") is True
+    assert not (tmp_path / "o" / "r" / "main").exists()
+    assert not (tmp_path / "o" / "r" / "feature_x").exists()
+
+
+# ----------------------------------------------------------------------
+# Failure modes — fail closed
+# ----------------------------------------------------------------------
+
+def test_malformed_consent_file_counts_as_no_consent(tmp_path: Path) -> None:
+    cdir = tmp_path / "o" / "r"
+    cdir.mkdir(parents=True)
+    (cdir / CONSENT_FILE).write_text("not json at all", encoding="utf-8")
+    assert has_consent("o", "r") is False
+    assert load_record("o", "r") is None
+
+
+def test_consent_file_with_wrong_shape_counts_as_no_consent(
+    tmp_path: Path,
+) -> None:
+    cdir = tmp_path / "o" / "r"
+    cdir.mkdir(parents=True)
+    (cdir / CONSENT_FILE).write_text(json.dumps(["wrong", "shape"]),
+                                     encoding="utf-8")
+    assert has_consent("o", "r") is False
+
+
+def test_missing_owner_or_repo_is_no_consent() -> None:
+    assert has_consent("", "r") is False
+    assert has_consent("o", "") is False
+    assert revoke_consent("", "r") is False
+
+
+def test_grant_rejects_empty_args() -> None:
+    with pytest.raises(ValueError):
+        grant_consent("", "r")
+
+
+# ----------------------------------------------------------------------
+# Path sanitisation — owner/repo with unsafe characters
+# ----------------------------------------------------------------------
+
+def test_unsafe_owner_chars_sanitised(tmp_path: Path) -> None:
+    """A malicious-looking owner string must not escape the
+    consent root via ``..`` or absolute paths."""
+    grant_consent("../etc", "r")
+    # The consent file is somewhere under tmp_path, not outside it.
+    matches = list(tmp_path.rglob(CONSENT_FILE))
+    assert matches, "consent file not created under the sanitised root"
+    for m in matches:
+        # str(m) starts with tmp_path string — never above it.
+        assert str(m).startswith(str(tmp_path))
diff --git a/tests/test_rag_pipeline.py b/tests/test_rag_pipeline.py
new file mode 100644
index 0000000..a15bd70
--- /dev/null
+++ b/tests/test_rag_pipeline.py
@@ -0,0 +1,374 @@
+"""Tests for the local RAG pipeline (Batch B7).
+
+All tests use the dependency-free ``HashingEmbedder`` so the suite
+doesn't need to download the 80 MB MiniLM ONNX model.  The
+production path (DefaultEmbedder via ChromaDB's bundled MiniLM) is
+exercised by a single smoke test that's skipped when the binary
+isn't available.
+
+Coverage:
+
+* Chunker: line-window with overlap, binary skip, oversize skip,
+  determinism, empty/whitespace edge cases.
+* HashingEmbedder: deterministic, dimension respected, similar text
+  ranks higher than unrelated text.
+* RagStore: round-trip add → count → query → delete_by_path; the
+  ChromaDB persistence path survives a fresh process restart.
+* Indexer: incremental build (unchanged files skipped), embedder-
+  change triggers full rebuild, deleted files drop from the index.
+* Retriever: top-k respected, MMR diversifies same-file hits,
+  empty index returns [] not error.
+"""
+from __future__ import annotations
+
+import hashlib
+from pathlib import Path
+
+import pytest
+
+from gitpilot.rag import (
+    HashingEmbedder,
+    IndexMeta,
+    build_index_from_files,
+    chunk_file,
+    retrieve_top_k,
+)
+from gitpilot.rag.chunker import (
+    CHUNK_LINES,
+    CHUNK_OVERLAP,
+    MAX_FILE_BYTES,
+    Chunk,
+)
+from gitpilot.rag.embedder import cosine_similarity
+from gitpilot.rag.indexer import build_index_from_files as build_idx
+from gitpilot.rag.store import RagStore, _collection_name, _sanitize
+
+
+# ----------------------------------------------------------------------
+# Chunker
+# ----------------------------------------------------------------------
+
+def test_chunk_file_basic_window_with_overlap() -> None:
+    content = "\n".join(f"line {i}" for i in range(1, 101))   # 100 lines
+    chunks = chunk_file("a.py", content)
+    # 100 lines, 40-line windows, 5-line overlap, step=35: windows
+    # start at 1, 36, 71 (last one extends to line 100).
+    assert len(chunks) == 3
+    starts = [c.start_line for c in chunks]
+    assert starts == [1, 36, 71]
+    ends = [c.end_line for c in chunks]
+    assert ends == [40, 75, 100]
+    # IDs are deterministic and prefixed by a hash of the path.
+    for c in chunks:
+        assert ":" in c.chunk_id
+
+
+def test_chunk_file_skips_binary_content() -> None:
+    binary = "PNG\x00\x00\x00\x00" + ("A" * 1000)
+    assert chunk_file("img.png", binary) == []
+
+
+def test_chunk_file_skips_oversize_content() -> None:
+    huge = "x" * (MAX_FILE_BYTES + 1)
+    assert chunk_file("huge.txt", huge) == []
+
+
+def test_chunk_file_empty_input_returns_empty() -> None:
+    assert chunk_file("empty.py", "") == []
+    assert chunk_file("ws.py", "   \n  \n") != []  # whitespace lines count
+
+
+def test_chunk_file_is_deterministic() -> None:
+    c = "import os\nimport sys\n" * 30
+    a = chunk_file("x.py", c)
+    b = chunk_file("x.py", c)
+    assert [ch.chunk_id for ch in a] == [ch.chunk_id for ch in b]
+    assert [ch.start_line for ch in a] == [ch.start_line for ch in b]
+
+
+def test_chunk_file_respects_default_constants() -> None:
+    assert CHUNK_LINES == 40
+    assert CHUNK_OVERLAP == 5
+
+
+# ----------------------------------------------------------------------
+# HashingEmbedder
+# ----------------------------------------------------------------------
+
+def test_hashing_embedder_dimension() -> None:
+    e = HashingEmbedder()
+    vecs = e(["hello world", "another sentence"])
+    assert len(vecs) == 2
+    assert len(vecs[0]) == e.dim
+
+
+def test_hashing_embedder_deterministic() -> None:
+    e = HashingEmbedder()
+    a = e(["the quick brown fox"])
+    b = e(["the quick brown fox"])
+    assert a == b
+
+
+def test_hashing_embedder_similar_text_ranks_above_unrelated() -> None:
+    e = HashingEmbedder()
+    query = "authentication middleware"
+    related = "user authentication middleware in flask"
+    unrelated = "matplotlib pyplot bar chart"
+    qv, rv, uv = e([query, related, unrelated])
+    sim_related = cosine_similarity(qv, rv)
+    sim_unrelated = cosine_similarity(qv, uv)
+    assert sim_related > sim_unrelated
+
+
+# ----------------------------------------------------------------------
+# Store names / sanitisation
+# ----------------------------------------------------------------------
+
+def test_collection_name_is_chromadb_safe() -> None:
+    name = _collection_name("My Org", "weird/repo", "feature/ai-stuff")
+    # Alphanumeric / underscore / hyphen only.  Starts and ends
+    # alphanumeric.  Within length bounds.
+    assert 3 <= len(name) <= 512
+    assert name[0].isalnum() and name[-1].isalnum()
+    import re
+    assert re.fullmatch(r"[A-Za-z0-9_-]+", name)
+
+
+def test_sanitize_strips_unsafe_chars() -> None:
+    assert _sanitize("foo bar/baz") == "foo_bar_baz"
+
+
+# ----------------------------------------------------------------------
+# Store round-trip
+# ----------------------------------------------------------------------
+
+def test_store_round_trip_add_count_query(tmp_path: Path) -> None:
+    chunks = [
+        Chunk(
+            chunk_id="A:1", path="src/auth.py",
+            start_line=1, end_line=20,
+            text="def authenticate(user, password):\n    return check_credentials(user, password)\n",
+            file_sha="aaa",
+        ),
+        Chunk(
+            chunk_id="B:1", path="src/util.py",
+            start_line=1, end_line=10,
+            text="def helper(x):\n    return x * 2\n",
+            file_sha="bbb",
+        ),
+    ]
+    s = RagStore(
+        owner="o", repo="r", branch="main",
+        embedder=HashingEmbedder(),
+        persist_dir=tmp_path,
+    )
+    n = s.add_chunks(chunks)
+    assert n == 2
+    assert s.count() == 2
+
+    hits = s.query("authentication password", k=2)
+    assert hits, "query returned no hits"
+    # Best hit should be the auth chunk.
+    assert hits[0].path == "src/auth.py"
+
+
+def test_store_delete_by_path_removes_only_that_file(tmp_path: Path) -> None:
+    s = RagStore(
+        owner="o", repo="r", branch="main",
+        embedder=HashingEmbedder(),
+        persist_dir=tmp_path,
+    )
+    chunks = [
+        Chunk(
+            chunk_id="A:1", path="a.py", start_line=1, end_line=5,
+            text="content of a", file_sha="aaa",
+        ),
+        Chunk(
+            chunk_id="B:1", path="b.py", start_line=1, end_line=5,
+            text="content of b", file_sha="bbb",
+        ),
+    ]
+    s.add_chunks(chunks)
+    assert s.count() == 2
+    removed = s.delete_by_path("a.py")
+    assert removed == 1
+    assert s.count() == 1
+
+
+def test_store_query_empty_index_returns_empty(tmp_path: Path) -> None:
+    s = RagStore(
+        owner="o", repo="r", branch="main",
+        embedder=HashingEmbedder(),
+        persist_dir=tmp_path,
+    )
+    assert s.query("anything", k=5) == []
+
+
+# ----------------------------------------------------------------------
+# Indexer
+# ----------------------------------------------------------------------
+
+def test_indexer_first_run_indexes_all_files(tmp_path: Path) -> None:
+    files = [
+        ("src/auth.py", "def authenticate(user, password):\n    return True\n"),
+        ("README.md", "# Project\nThis is about astronomy and stars.\n"),
+    ]
+    report = build_idx(
+        files, owner="o", repo="r", branch="main",
+        embedder=HashingEmbedder(),
+        persist_dir=tmp_path,
+    )
+    assert report.files_seen == 2
+    assert report.files_indexed == 2
+    assert report.files_skipped == 0
+    assert report.chunks_added >= 2
+
+    meta = IndexMeta.load(tmp_path)
+    assert meta is not None
+    assert set(meta.indexed_files.keys()) == {"src/auth.py", "README.md"}
+
+
+def test_indexer_skips_unchanged_files_on_second_run(tmp_path: Path) -> None:
+    files = [
+        ("src/auth.py", "def authenticate(): pass\n"),
+        ("README.md", "# Project\n"),
+    ]
+    emb = HashingEmbedder()
+    build_idx(files, owner="o", repo="r", branch="main", embedder=emb, persist_dir=tmp_path)
+
+    # Re-run with the same content.
+    report = build_idx(
+        files, owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+    )
+    assert report.files_seen == 2
+    assert report.files_indexed == 0
+    assert report.files_skipped == 2
+
+
+def test_indexer_reindexes_changed_file(tmp_path: Path) -> None:
+    emb = HashingEmbedder()
+    build_idx(
+        [("src/x.py", "old content\n")],
+        owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+    )
+    report = build_idx(
+        [("src/x.py", "new content with totally different words\n")],
+        owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+    )
+    assert report.files_indexed == 1
+    assert report.files_skipped == 0
+
+
+def test_indexer_embedder_change_triggers_full_rebuild(tmp_path: Path) -> None:
+    """If the embedder name changes, all old vectors are incomparable
+    and we must rebuild from scratch."""
+    emb_a = HashingEmbedder()
+    emb_a.name = "fake-A"   # type: ignore[attr-defined]
+    build_idx(
+        [("src/x.py", "hello\n")],
+        owner="o", repo="r", branch="main",
+        embedder=emb_a, persist_dir=tmp_path,
+    )
+    emb_b = HashingEmbedder()
+    emb_b.name = "fake-B"   # type: ignore[attr-defined]
+    report = build_idx(
+        [("src/x.py", "hello\n")],
+        owner="o", repo="r", branch="main",
+        embedder=emb_b, persist_dir=tmp_path,
+    )
+    # Forced full rebuild → file indexed again even though sha matches.
+    assert report.files_indexed == 1
+    assert report.files_skipped == 0
+
+
+# ----------------------------------------------------------------------
+# Retriever
+# ----------------------------------------------------------------------
+
+def test_retrieve_top_k_finds_relevant_chunk(tmp_path: Path) -> None:
+    emb = HashingEmbedder()
+    build_idx(
+        [
+            ("README.md", "# Nuclear shell model\nA tool for nuclear physics.\n"),
+            ("src/util.py", "def helper(x):\n    return x * 2\n"),
+        ],
+        owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+    )
+    hits = retrieve_top_k(
+        "nuclear physics shell",
+        owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+        k=2, mmr=False,
+    )
+    assert hits
+    assert hits[0].path == "README.md"
+
+
+def test_retrieve_top_k_respects_k(tmp_path: Path) -> None:
+    emb = HashingEmbedder()
+    files = [
+        (f"src/f_{i}.py", f"def f_{i}(): return {i}\n")
+        for i in range(10)
+    ]
+    build_idx(
+        files, owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+    )
+    hits = retrieve_top_k(
+        "function definition",
+        owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+        k=3, mmr=False,
+    )
+    assert len(hits) == 3
+
+
+def test_retrieve_top_k_empty_query_returns_empty(tmp_path: Path) -> None:
+    emb = HashingEmbedder()
+    build_idx(
+        [("a.py", "def foo(): pass\n")],
+        owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+    )
+    assert retrieve_top_k(
+        "",
+        owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+    ) == []
+
+
+def test_retrieve_top_k_missing_index_returns_empty(tmp_path: Path) -> None:
+    """When the persist dir doesn't exist yet the retriever must
+    silently return [] — agents fall back to grep, not crash."""
+    assert retrieve_top_k(
+        "anything",
+        owner="o", repo="r", branch="main",
+        embedder=HashingEmbedder(),
+        persist_dir=tmp_path / "never-built",
+    ) == []
+
+
+# ----------------------------------------------------------------------
+# Determinism across process restarts (ChromaDB persistence)
+# ----------------------------------------------------------------------
+
+def test_index_persists_across_store_recreations(tmp_path: Path) -> None:
+    emb = HashingEmbedder()
+    build_idx(
+        [("src/main.py", "import asyncio\ndef main():\n    pass\n")],
+        owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+    )
+    # Construct a fresh store object pointing at the same dir.
+    s = RagStore(
+        owner="o", repo="r", branch="main",
+        embedder=emb, persist_dir=tmp_path,
+    )
+    assert s.count() >= 1
+    hits = s.query("import asyncio", k=1)
+    assert hits
+    assert hits[0].path == "src/main.py"
diff --git a/tests/test_repo_map.py b/tests/test_repo_map.py
new file mode 100644
index 0000000..425d704
--- /dev/null
+++ b/tests/test_repo_map.py
@@ -0,0 +1,249 @@
+"""Tests for the auto-generated repo map (Batch B6).
+
+Pin the contract the planner relies on:
+
+* Pure function: ``build_repo_map`` is deterministic and side-effect
+  free.  Same input → identical output, byte-for-byte.
+* Key-files heuristic surfaces the well-known anchors (README.md,
+  pyproject.toml, etc.) when they exist.
+* Modules are sorted by file count descending so the planner sees
+  the most-populated dirs first.
+* The rendered AGENTS.md blob honours a token budget — even on a
+  10 000-file synthetic repo the output stays bounded.
+* Round-trip through the on-disk cache reconstructs the same object.
+"""
+from __future__ import annotations
+
+import json
+import os
+from pathlib import Path
+
+import pytest
+
+from gitpilot import flags
+from gitpilot.context_budget import estimate_tokens
+from gitpilot.repo_map import (
+    DEFAULT_MAP_TOKEN_BUDGET,
+    FLAG_REPO_MAP,
+    MAP_MAX_MODULES,
+    MAPS_DIR_ENV,
+    RepoMap,
+    build_repo_map,
+    get_or_build_repo_map,
+    load_cached,
+    save_cached,
+)
+
+
+# ----------------------------------------------------------------------
+# Sample inputs
+# ----------------------------------------------------------------------
+
+SMALL = [
+    "README.md",
+    "pyproject.toml",
+    "src/main.py",
+    "src/util/io.py",
+    "src/util/__init__.py",
+    "tests/test_main.py",
+    "docs/intro.md",
+    "Dockerfile",
+    "LICENSE",
+]
+
+
+def _large_paths(n: int) -> list[str]:
+    paths = ["README.md", "pyproject.toml", "LICENSE"]
+    for i in range(n):
+        mod = f"src/mod_{i % 20:02d}"
+        paths.append(f"{mod}/file_{i:04d}.py")
+    return paths
+
+
+# ----------------------------------------------------------------------
+# Determinism / facts
+# ----------------------------------------------------------------------
+
+def test_build_is_deterministic() -> None:
+    a = build_repo_map(owner="o", repo="r", branch="main", paths=SMALL)
+    b = build_repo_map(owner="o", repo="r", branch="main", paths=SMALL)
+    # generated_at differs (it's the timestamp); every other field
+    # must match byte-for-byte.
+    a_dict, b_dict = a.to_dict(), b.to_dict()
+    a_dict.pop("generated_at", None)
+    b_dict.pop("generated_at", None)
+    assert a_dict == b_dict
+
+
+def test_well_known_files_promoted_to_key() -> None:
+    m = build_repo_map(owner="o", repo="r", branch="main", paths=SMALL)
+    assert "README.md" in m.key_files
+    assert "pyproject.toml" in m.key_files
+    assert "Dockerfile" in m.key_files
+    assert "LICENSE" in m.key_files
+
+
+def test_modules_sorted_by_file_count_desc() -> None:
+    m = build_repo_map(owner="o", repo="r", branch="main", paths=SMALL)
+    counts = [mod.file_count for mod in m.modules]
+    assert counts == sorted(counts, reverse=True)
+
+
+def test_root_files_appear_as_root_pseudo_module() -> None:
+    m = build_repo_map(owner="o", repo="r", branch="main", paths=SMALL)
+    root_modules = [mod for mod in m.modules if mod.path == "(root)"]
+    assert len(root_modules) == 1
+    # README, pyproject, Dockerfile, LICENSE are all root-level.
+    assert root_modules[0].file_count >= 4
+
+
+def test_languages_histogram_counts_extensions() -> None:
+    m = build_repo_map(owner="o", repo="r", branch="main", paths=SMALL)
+    assert m.languages.get("py", 0) >= 3
+    assert m.languages.get("md", 0) >= 1
+    assert m.languages.get("toml", 0) == 1
+
+
+def test_total_files_dedupes() -> None:
+    """Duplicate paths must not double-count."""
+    m = build_repo_map(
+        owner="o", repo="r", branch="main",
+        paths=SMALL + SMALL,
+    )
+    assert m.total_files == len(set(SMALL))
+
+
+# ----------------------------------------------------------------------
+# Budget / scaling
+# ----------------------------------------------------------------------
+
+def test_agents_md_under_budget_on_large_repo() -> None:
+    m = build_repo_map(
+        owner="o", repo="r", branch="main",
+        paths=_large_paths(2000),
+        token_budget=500,
+    )
+    assert estimate_tokens(m.agents_md) <= 700  # generous slack
+    assert "Total files" in m.agents_md
+    assert "Modules" in m.agents_md
+
+
+def test_modules_capped_at_hard_max() -> None:
+    """We never want the map to grow unbounded with module count."""
+    paths = [f"mod_{i}/file.py" for i in range(MAP_MAX_MODULES * 3)]
+    m = build_repo_map(owner="o", repo="r", branch="main", paths=paths)
+    assert len(m.modules) <= MAP_MAX_MODULES
+
+
+def test_agents_md_mentions_tools_for_drill_down() -> None:
+    """The map's footer must point the agent to the B1/B2 tools so it
+    knows what to do when it needs more detail than the map shows."""
+    m = build_repo_map(owner="o", repo="r", branch="main", paths=SMALL)
+    assert "Find files matching a pattern" in m.agents_md
+    assert "Search file contents" in m.agents_md
+    assert "Read file content" in m.agents_md
+
+
+# ----------------------------------------------------------------------
+# Persistence
+# ----------------------------------------------------------------------
+
+def test_round_trip_through_cache(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
+    monkeypatch.setenv(MAPS_DIR_ENV, str(tmp_path))
+    a = build_repo_map(
+        owner="o", repo="r", branch="main",
+        paths=SMALL, commit_sha="deadbeef",
+    )
+    save_cached(a)
+    b = load_cached("o", "r", "main")
+    assert b is not None
+    assert b.owner == a.owner
+    assert b.repo == a.repo
+    assert b.branch == a.branch
+    assert b.commit_sha == a.commit_sha
+    assert b.total_files == a.total_files
+    assert b.key_files == a.key_files
+    assert [m.path for m in b.modules] == [m.path for m in a.modules]
+
+
+def test_load_cached_returns_none_for_missing(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
+    monkeypatch.setenv(MAPS_DIR_ENV, str(tmp_path))
+    assert load_cached("nope", "nope", "main") is None
+
+
+# ----------------------------------------------------------------------
+# get_or_build (cache-aware)
+# ----------------------------------------------------------------------
+
+def test_get_or_build_uses_cache_when_sha_matches(
+    tmp_path: Path, monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    monkeypatch.setenv(MAPS_DIR_ENV, str(tmp_path))
+
+    calls = {"n": 0}
+    def _provider():
+        calls["n"] += 1
+        return SMALL
+
+    a = get_or_build_repo_map(
+        owner="o", repo="r", branch="main",
+        paths_provider=_provider, commit_sha="abc123",
+    )
+    b = get_or_build_repo_map(
+        owner="o", repo="r", branch="main",
+        paths_provider=_provider, commit_sha="abc123",
+    )
+    assert a.commit_sha == b.commit_sha == "abc123"
+    # Provider should have been called only on the first build.
+    assert calls["n"] == 1
+
+
+def test_get_or_build_refreshes_on_new_sha(
+    tmp_path: Path, monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    monkeypatch.setenv(MAPS_DIR_ENV, str(tmp_path))
+
+    calls = {"n": 0}
+    def _provider():
+        calls["n"] += 1
+        return SMALL
+
+    get_or_build_repo_map(
+        owner="o", repo="r", branch="main",
+        paths_provider=_provider, commit_sha="aaa",
+    )
+    get_or_build_repo_map(
+        owner="o", repo="r", branch="main",
+        paths_provider=_provider, commit_sha="bbb",
+    )
+    assert calls["n"] == 2
+
+
+def test_force_rebuilds_even_with_same_sha(
+    tmp_path: Path, monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    monkeypatch.setenv(MAPS_DIR_ENV, str(tmp_path))
+
+    calls = {"n": 0}
+    def _provider():
+        calls["n"] += 1
+        return SMALL
+
+    get_or_build_repo_map(
+        owner="o", repo="r", branch="main",
+        paths_provider=_provider, commit_sha="aaa",
+    )
+    get_or_build_repo_map(
+        owner="o", repo="r", branch="main",
+        paths_provider=_provider, commit_sha="aaa",
+        force=True,
+    )
+    assert calls["n"] == 2
+
+
+# ----------------------------------------------------------------------
+# Sanity
+# ----------------------------------------------------------------------
+
+def test_default_budget_is_documented_value() -> None:
+    assert DEFAULT_MAP_TOKEN_BUDGET == 500
diff --git a/tests/test_sandbox_api.py b/tests/test_sandbox_api.py
new file mode 100644
index 0000000..4dae2be
--- /dev/null
+++ b/tests/test_sandbox_api.py
@@ -0,0 +1,228 @@
+"""Tests for the /api/sandbox/* HTTP surface.
+
+These tests pin down the persistence + transparency contract so a
+future change can't silently regress the production behaviour:
+
+- backend persistence: PUT /config writes through, GET /status reads back
+- env override: GITPILOT_SANDBOX leaks into /status.env_override
+- token redaction: GET /settings never returns matrixlab_token
+- snippet execution: POST /run with the default subprocess backend
+  produces stdout the agent can read
+- language whitelist: unknown languages return 400 not 500
+- MatrixLab routing: backend=matrixlab POSTs to /code/run (verified
+  with httpx.MockTransport rather than a live runner)
+- lifecycle gating: mutating endpoints 403 when the env flag is off
+
+These run without Docker, without Ollama, and without MatrixLab — they
+exercise our code path only.
+"""
+from __future__ import annotations
+
+import json
+import os
+from typing import Iterator
+
+import httpx
+import pytest
+from fastapi.testclient import TestClient
+
+from gitpilot.settings import reload_settings
+
+
+@pytest.fixture()
+def client(tmp_path, monkeypatch) -> Iterator[TestClient]:
+    """Spin up the FastAPI app with an isolated config dir so the
+    tests don't read/write the real ~/.gitpilot/settings.json."""
+    # CONFIG_DIR / CONFIG_FILE are module constants captured at import
+    # time from the env, so a plain setenv after import is too late.
+    # Patch the module attributes directly so reload_settings() reads
+    # from our tmp dir.
+    from gitpilot import settings as settings_module
+
+    cfg_dir = tmp_path / "cfg"
+    cfg_dir.mkdir(parents=True, exist_ok=True)
+    monkeypatch.setattr(settings_module, "CONFIG_DIR", cfg_dir)
+    monkeypatch.setattr(settings_module, "CONFIG_FILE", cfg_dir / "settings.json")
+    monkeypatch.setenv("GITPILOT_CONFIG_DIR", str(cfg_dir))
+    # Clear any env var that would override the persisted backend
+    # so each test starts from the on-disk default.
+    for name in (
+        "GITPILOT_SANDBOX",
+        "GITPILOT_MATRIXLAB_URL",
+        "GITPILOT_MATRIXLAB_TOKEN",
+        "GITPILOT_MATRIXLAB_IMAGE",
+        "GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE",
+    ):
+        monkeypatch.delenv(name, raising=False)
+    reload_settings()
+    from gitpilot.api import app
+
+    with TestClient(app) as c:
+        yield c
+    reload_settings()
+
+
+def test_status_defaults_to_subprocess(client: TestClient) -> None:
+    r = client.get("/api/sandbox/status")
+    assert r.status_code == 200
+    data = r.json()
+    assert data["backend"] == "subprocess"
+    assert data["ok"] is True
+    assert data["has_token"] is False
+    assert data["env_override"] is None
+    assert sorted(data["available_backends"]) == ["matrixlab", "off", "subprocess"]
+
+
+def test_config_persists_across_calls(client: TestClient) -> None:
+    r = client.put(
+        "/api/sandbox/config",
+        json={"backend": "matrixlab", "matrixlab_url": "http://matrix.example:9000"},
+    )
+    assert r.status_code == 200
+    data = r.json()
+    assert data["backend"] == "matrixlab"
+    assert data["matrixlab_url"] == "http://matrix.example:9000"
+    # And GET /status now reflects it.
+    r = client.get("/api/sandbox/status")
+    assert r.json()["backend"] == "matrixlab"
+    assert r.json()["matrixlab_url"] == "http://matrix.example:9000"
+
+
+def test_config_rejects_unknown_backend(client: TestClient) -> None:
+    r = client.put("/api/sandbox/config", json={"backend": "docker"})
+    assert r.status_code == 400
+    assert "unknown sandbox backend" in r.json()["detail"]
+
+
+def test_token_redacted_from_settings_response(client: TestClient) -> None:
+    """The secret matrixlab_token must never round-trip back to the
+    browser.  GET /settings returns has_token instead."""
+    client.put(
+        "/api/sandbox/config",
+        json={"backend": "matrixlab", "matrixlab_token": "very-secret"},
+    )
+    r = client.get("/api/settings")
+    sandbox_block = r.json()["sandbox"]
+    assert sandbox_block["has_token"] is True
+    assert "matrixlab_token" not in sandbox_block
+
+
+def test_env_var_surfaces_as_override(client: TestClient, monkeypatch) -> None:
+    """Operators sometimes pin the backend with an env var on the host.
+    The UI must be able to render an "env override" badge so users
+    don't think their UI choice was silently lost."""
+    monkeypatch.setenv("GITPILOT_SANDBOX", "subprocess")
+    reload_settings()  # pick up the env after fixture cleanup
+    r = client.get("/api/sandbox/status")
+    assert r.json()["env_override"] == "GITPILOT_SANDBOX"
+
+
+def test_run_python_via_subprocess_backend(client: TestClient) -> None:
+    """Default backend executes a snippet and surfaces stdout."""
+    r = client.post(
+        "/api/sandbox/run",
+        json={"language": "python", "code": "print('it works'); print(7 + 5)"},
+    )
+    assert r.status_code == 200
+    data = r.json()
+    assert data["backend"] == "subprocess"
+    assert data["exit_code"] == 0
+    assert "it works" in data["stdout"]
+    assert "12" in data["stdout"]
+
+
+def test_run_surfaces_python_traceback(client: TestClient) -> None:
+    """Error retrieval contract: stderr comes back verbatim, exit
+    code non-zero.  This is what the agent's run_in_sandbox tool
+    reads to plan a fix."""
+    r = client.post(
+        "/api/sandbox/run",
+        json={"language": "python", "code": "raise ValueError('boom')"},
+    )
+    assert r.status_code == 200
+    data = r.json()
+    assert data["exit_code"] != 0
+    assert "ValueError" in data["stderr"]
+    assert "boom" in data["stderr"]
+
+
+def test_run_rejects_unknown_language(client: TestClient) -> None:
+    r = client.post(
+        "/api/sandbox/run",
+        json={"language": "ruby", "code": "puts 'hi'"},
+    )
+    assert r.status_code == 400
+    detail = r.json()["detail"]
+    assert "ruby" in detail
+
+
+def test_run_rejects_empty_code(client: TestClient) -> None:
+    r = client.post("/api/sandbox/run", json={"language": "python", "code": "   "})
+    assert r.status_code == 400
+
+
+def test_matrixlab_backend_calls_code_run(client: TestClient, monkeypatch) -> None:
+    """When the user picks matrixlab, GitPilot must POST to /code/run
+    (the snippet endpoint), not /repo/run (which expects a repo_url).
+    Use httpx.MockTransport so we don't need a real runner."""
+    captured: dict = {}
+
+    def handler(request: httpx.Request) -> httpx.Response:
+        captured["url"] = str(request.url)
+        captured["body"] = json.loads(request.content)
+        return httpx.Response(
+            200,
+            json={
+                "sandbox_id": "test-uuid",
+                "exit_code": 0,
+                "stdout": "matrixlab response",
+                "stderr": "",
+                "duration_ms": 123,
+            },
+        )
+
+    mock_transport = httpx.MockTransport(handler)
+    real_async_client = httpx.AsyncClient
+
+    def _patched(*args, **kwargs):
+        return real_async_client(transport=mock_transport, **{k: v for k, v in kwargs.items() if k != "transport"})
+
+    monkeypatch.setattr(httpx, "AsyncClient", _patched)
+
+    client.put("/api/sandbox/config", json={"backend": "matrixlab"})
+    r = client.post(
+        "/api/sandbox/run",
+        json={"language": "py", "code": "print('hi')"},
+    )
+    assert r.status_code == 200, r.text
+    data = r.json()
+    assert data["backend"] == "matrixlab"
+    assert data["stdout"] == "matrixlab response"
+    assert data["sandbox_id"] == "test-uuid"
+    assert captured["url"].endswith("/code/run")
+    # Alias normalisation: "py" → "python" before hitting the runner.
+    assert captured["body"]["language"] == "python"
+    assert captured["body"]["code"] == "print('hi')"
+
+
+def test_lifecycle_mutating_endpoints_gated(client: TestClient) -> None:
+    """Without GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE, install/start/stop
+    must 403 — never silently execute docker on behalf of a browser
+    POST."""
+    for path in ("/api/sandbox/matrixlab/install",
+                 "/api/sandbox/matrixlab/start",
+                 "/api/sandbox/matrixlab/stop"):
+        r = client.post(path)
+        assert r.status_code == 403, f"{path} must require the env flag"
+        assert "GITPILOT_ENABLE_MATRIXLAB_LIFECYCLE" in r.json()["detail"]
+
+
+def test_lifecycle_status_always_safe(client: TestClient) -> None:
+    """GET /lifecycle is read-only; safe to call even when the env
+    flag is off.  Reports lifecycle_enabled=False so the UI can render
+    the right hint."""
+    r = client.get("/api/sandbox/matrixlab/lifecycle")
+    assert r.status_code == 200
+    data = r.json()
+    assert data["lifecycle_enabled"] is False
+    assert "instructions" in data
diff --git a/tests/test_task_recorder.py b/tests/test_task_recorder.py
new file mode 100644
index 0000000..199148b
--- /dev/null
+++ b/tests/test_task_recorder.py
@@ -0,0 +1,291 @@
+"""Tests for the right-sidebar Tasks recorder.
+
+Pin the contract used by ``/api/chat/plan`` and ``/api/chat/execute``
+to record each top-level AI invocation as a Task on the active
+session, and the read endpoint ``/api/sessions/{sid}/tasks`` that
+surfaces them to the chat UI.
+"""
+from __future__ import annotations
+
+from typing import Iterator
+
+import pytest
+from fastapi.testclient import TestClient
+
+from gitpilot import api as api_module
+from gitpilot import flags
+from gitpilot.session import Task
+from gitpilot.task_recorder import (
+    FLAG_TASKS_SIDEBAR,
+    begin_task,
+    finish_task,
+)
+
+
+@pytest.fixture()
+def client() -> Iterator[TestClient]:
+    yield TestClient(api_module.app)
+
+
+# ----------------------------------------------------------------------
+# Recorder primitives
+# ----------------------------------------------------------------------
+
+def test_begin_task_appends_running_entry() -> None:
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="rec-begin"
+    )
+    task = begin_task(api_module._session_mgr, session.id, kind="plan", title="hello")
+    assert task is not None
+    assert task.status == "running"
+    assert task.title == "hello"
+
+    reloaded = api_module._session_mgr.load(session.id)
+    assert len(reloaded.tasks) == 1
+    assert reloaded.tasks[0].status == "running"
+
+
+def test_finish_task_marks_completed_and_records_duration() -> None:
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="rec-finish"
+    )
+    task = begin_task(api_module._session_mgr, session.id, kind="plan", title="t")
+    assert task is not None
+    finish_task(
+        api_module._session_mgr,
+        session.id,
+        task,
+        status="completed",
+        prompt_tokens=120,
+        completion_tokens=40,
+    )
+
+    reloaded = api_module._session_mgr.load(session.id)
+    assert reloaded.tasks[0].status == "completed"
+    assert reloaded.tasks[0].completed_at is not None
+    assert reloaded.tasks[0].duration_ms is not None
+    assert reloaded.tasks[0].prompt_tokens == 120
+    assert reloaded.tasks[0].completion_tokens == 40
+
+
+def test_finish_task_failed_path_records_error() -> None:
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="rec-fail"
+    )
+    task = begin_task(api_module._session_mgr, session.id, kind="plan", title="t")
+    finish_task(
+        api_module._session_mgr,
+        session.id,
+        task,
+        status="failed",
+        error="boom: something exploded",
+    )
+
+    reloaded = api_module._session_mgr.load(session.id)
+    assert reloaded.tasks[0].status == "failed"
+    assert reloaded.tasks[0].error is not None
+    assert "boom" in reloaded.tasks[0].error
+
+
+def test_begin_task_no_session_id_is_noop() -> None:
+    """Calling without a session id must not error — older frontends
+    don't send one, and we must stay byte-identical to today for them."""
+    assert begin_task(api_module._session_mgr, None, kind="plan", title="x") is None
+    # finish_task with no task is also a no-op.
+    finish_task(api_module._session_mgr, None, None)
+
+
+def test_begin_task_flag_off_is_noop() -> None:
+    """The kill switch turns off recording entirely."""
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="rec-flag-off"
+    )
+    flags.set_override(FLAG_TASKS_SIDEBAR, False)
+    try:
+        task = begin_task(
+            api_module._session_mgr, session.id, kind="plan", title="t"
+        )
+    finally:
+        flags.clear_override(FLAG_TASKS_SIDEBAR)
+    assert task is None
+    reloaded = api_module._session_mgr.load(session.id)
+    assert reloaded.tasks == []
+
+
+def test_finish_task_preserves_concurrent_writes_to_session() -> None:
+    """Real-world race: a task is in flight; another endpoint writes a
+    new branch onto the session; finish_task must not clobber that."""
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="rec-race"
+    )
+    task = begin_task(api_module._session_mgr, session.id, kind="execute", title="t")
+
+    # Simulate a concurrent write (the execute handler persisting the
+    # new branch).
+    concurrent = api_module._session_mgr.load(session.id)
+    concurrent.branch = "gitpilot-foo-123456"
+    api_module._session_mgr.save(concurrent)
+
+    finish_task(api_module._session_mgr, session.id, task, status="completed")
+
+    reloaded = api_module._session_mgr.load(session.id)
+    assert reloaded.branch == "gitpilot-foo-123456"  # concurrent write preserved
+    assert reloaded.tasks[0].status == "completed"
+
+
+# ----------------------------------------------------------------------
+# Read endpoint
+# ----------------------------------------------------------------------
+
+def test_tasks_endpoint_returns_documented_shape(
+    client: TestClient,
+) -> None:
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="endpoint-shape"
+    )
+    task = begin_task(
+        api_module._session_mgr, session.id, kind="plan", title="hello world"
+    )
+    finish_task(
+        api_module._session_mgr, session.id, task,
+        status="completed", prompt_tokens=12, completion_tokens=4,
+    )
+
+    r = client.get(f"/api/sessions/{session.id}/tasks")
+    assert r.status_code == 200, r.text
+    body = r.json()
+    assert body["session_id"] == session.id
+    assert len(body["tasks"]) == 1
+    row = body["tasks"][0]
+    for key in (
+        "id", "kind", "title", "status",
+        "started_at", "completed_at", "duration_ms",
+        "prompt_tokens", "completion_tokens", "error",
+    ):
+        assert key in row, f"missing key {key}"
+    assert row["status"] == "completed"
+    assert row["title"] == "hello world"
+
+
+def test_tasks_endpoint_returns_404_for_unknown_session(client: TestClient) -> None:
+    r = client.get("/api/sessions/does-not-exist/tasks")
+    assert r.status_code == 404
+
+
+def test_tasks_endpoint_returns_404_when_flag_off(client: TestClient) -> None:
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="endpoint-flag"
+    )
+    flags.set_override(FLAG_TASKS_SIDEBAR, False)
+    try:
+        r = client.get(f"/api/sessions/{session.id}/tasks")
+    finally:
+        flags.clear_override(FLAG_TASKS_SIDEBAR)
+    assert r.status_code == 404
+
+
+# ----------------------------------------------------------------------
+# Endpoint wiring — Plan + Execute land a task on the session
+# ----------------------------------------------------------------------
+
+def test_chat_plan_records_a_task(
+    client: TestClient, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    """End-to-end: POST /api/chat/plan with a session_id must leave a
+    completed Task entry on that session."""
+    from contextlib import contextmanager
+
+    @contextmanager
+    def _noop_ctx(*_a, **_kw):
+        yield
+
+    async def _ok(goal, repo_full_name, token=None, branch_name=None, **_kw):
+        return {"goal": goal, "summary": "ok", "steps": []}
+
+    monkeypatch.setattr(api_module, "execution_context", _noop_ctx)
+    monkeypatch.setattr(api_module, "generate_plan", _ok)
+    monkeypatch.setattr(api_module, "generate_plan_lite", _ok)
+    monkeypatch.setattr(api_module, "_is_lite_mode_active", lambda: False)
+    monkeypatch.setattr(api_module, "get_github_token", lambda *_a, **_kw: None)
+
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="plan-track"
+    )
+
+    r = client.post(
+        "/api/chat/plan",
+        json={
+            "repo_owner": "o",
+            "repo_name": "r",
+            "goal": "Make a thing",
+            "branch_name": "main",
+            "session_id": session.id,
+        },
+    )
+    assert r.status_code == 200, r.text
+
+    reloaded = api_module._session_mgr.load(session.id)
+    assert len(reloaded.tasks) == 1
+    assert reloaded.tasks[0].kind == "plan"
+    assert reloaded.tasks[0].status == "completed"
+    assert reloaded.tasks[0].title == "Make a thing"
+
+
+def test_chat_plan_records_failed_task_on_error(
+    client: TestClient, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    """Errors from the planner produce a failed Task entry — the trace
+    must capture failures, not just successes."""
+    from contextlib import contextmanager
+
+    @contextmanager
+    def _noop_ctx(*_a, **_kw):
+        yield
+
+    async def _bad(goal, repo_full_name, token=None, branch_name=None, **_kw):
+        raise RuntimeError("completely unrelated boom")
+
+    monkeypatch.setattr(api_module, "execution_context", _noop_ctx)
+    monkeypatch.setattr(api_module, "generate_plan", _bad)
+    monkeypatch.setattr(api_module, "generate_plan_lite", _bad)
+    monkeypatch.setattr(api_module, "_is_lite_mode_active", lambda: False)
+    monkeypatch.setattr(api_module, "get_github_token", lambda *_a, **_kw: None)
+
+    session = api_module._session_mgr.create(
+        repo_full_name="o/r", branch="main", name="plan-fail-track"
+    )
+
+    r = client.post(
+        "/api/chat/plan",
+        json={
+            "repo_owner": "o",
+            "repo_name": "r",
+            "goal": "Make it fail",
+            "session_id": session.id,
+        },
+    )
+    assert r.status_code >= 400
+
+    reloaded = api_module._session_mgr.load(session.id)
+    assert len(reloaded.tasks) == 1
+    assert reloaded.tasks[0].status == "failed"
+    assert reloaded.tasks[0].error is not None
+
+
+# ----------------------------------------------------------------------
+# Session backwards-compatibility
+# ----------------------------------------------------------------------
+
+def test_session_loads_without_tasks_field(tmp_path) -> None:
+    """Session files written before this feature must still load."""
+    from gitpilot.session import Session
+
+    raw = {
+        "id": "abc123",
+        "messages": [],
+        "checkpoints": [],
+        "repos": [],
+        "tasks": [],  # explicit empty
+    }
+    session = Session.from_dict(raw)
+    assert session.tasks == []