diff --git a/.specs/IMPLEMENTATION-PRIORITY-PLAN.md b/.specs/IMPLEMENTATION-PRIORITY-PLAN.md new file mode 100644 index 00000000..f870bba8 --- /dev/null +++ b/.specs/IMPLEMENTATION-PRIORITY-PLAN.md @@ -0,0 +1,324 @@ +# Implementation Priority Plan + +> **Created**: 2026-06-03 +> **Scope**: Review of `.specs/` (171 files) with recommended implementation order +> **Method**: Value × readiness × leverage on the agent reasoning loop + +## How priorities were chosen + +Each candidate was scored on four axes: + +| Axis | Question | +|------|----------| +| **Pain** | Does lack of this cause lost work, bad decisions, or trust failures? | +| **Leverage** | Does shipping this unlock or amplify other specs? | +| **Readiness** | Is the spec decision-complete, with clear files and acceptance criteria? | +| **Fit** | Does it align with what is already in the repo (Code Mode, Supabase, web app)? | + +Specs marked **research-only** (e.g. `agent-governance-substrate/`) were excluded from implementation tracks unless the user explicitly reopens them through HDD. + +Specs for **Letta/DGM** (`letta-specific/`) were deprioritized: valuable for a different product surface, not the hosted Thoughtbox MCP + web app spine. + +--- + +## Landscape snapshot + +### Already substantial in code (do not re-plan from zero) + +| Area | Evidence | Spec status | +|------|----------|-------------| +| **Code Mode** (`thoughtbox_search` + `thoughtbox_execute`) | `src/code-mode/*`, tests | `SPEC-CORE-002`, `code-mode/target-state.md` — alignment mostly done; progressive disclosure removed | +| **LangSmith evaluation stack** | `src/evaluation/*`, `initEvaluation()` in `src/index.ts` | `SPEC-EVAL-001` — Layers 1–5 implemented; **operational gap** is datasets + env + DGM feed | +| **Structured session UI** | `apps/web/src/components/session-area/thought-card.tsx`, view-models | `auditability/SPEC-AUD-001` — **largely satisfied in web app**; legacy Observatory HTML may still lag | +| **Peer notebook control plane (Part 1)** | Merged per `mcp-peer-notebooks/NEXT-IMPLEMENTATION-HANDOFF.md` | ADR-022 unit 1/5 complete | + +### Spec-ready but not implemented (highest opportunity) + +| Spec | Priority in spec | Gap | +|------|------------------|-----| +| `SPEC-SRC-006` | P0 | No `mcpRootUri` in persistence | +| `SPEC-GW-011` | High | No first-call-only schema dedup; `thoughtbox_operations` referenced in docs/tests but not registered in `src/` | +| `SPEC-CHX-001` | High | Quick wins (#1–#3) untouched; no `tb.t`, no mid-session recall ops | +| `harness-optimization/SPEC-HARNESS-T1` | Tier 1 | Plugin hooks / interleaved-thinking prompt not updated | +| `harness-optimization/SPEC-HARNESS-T2` | Tier 2 | Defaults + auto-resume overlap with SRC-006 / CHX | + +### Stale or inconsistent (treat carefully) + +| Item | Issue | +|------|--------| +| **OODA Loops MCP** | `IMPLEMENTATION-READY.md` and `CHANGELOG.md` claim `embed-loops.ts` and loop resources; **those files are not present** in the current tree. Either restore from history or re-implement from the spec narrative in `README.md` / `IMPLEMENTATION-READY.md`. | +| **`loops-mcp-*.md`** | Referenced by README but **missing** from `.specs/` — only summary docs remain. | +| **`SPEC-WRK-001`** | Assumes Stage 2 progressive disclosure; **hosted server exposes all tools at connect** — spec needs revision before build. | + +--- + +## Recommended implementation tracks + +### Track 1 — Session recovery via MCP root (SPEC-SRC-006) + +**What**: Persist `mcpRootUri` on session create; make `load_context` recover the latest session for the current MCP root when `sessionId` is omitted. + +**Why this is #1** + +- Directly addresses a **documented data-loss incident** (67 thoughts orphaned). +- Small, bounded change: persistence types, storage filter, init handler. +- No dependencies; improves every long reasoning run immediately. +- Complements `SPEC-HARNESS-T2` auto-resume — same user story, server-side truth. + +**Reasoning for scope** + +- **In**: `mcpRootUri` on metadata, `listSessions({ mcpRootUri })`, optional `load_context`, capture root on `start_new`. +- **Out**: UI for session picker, cross-root migration tools (defer until recovery works). + +**Phases** + +1. Schema + storage filter + migration/backfill strategy for existing sessions (nullable `mcpRootUri`). +2. Init handler: omit `sessionId` → `findSessionForRoot`. +3. Integration test: two MCP sessions, same root, second call resumes first session. +4. Update onboarding skill to prefer `load_context` without ID after compaction. + +**Success**: Agent reconnecting after ~15m MCP timeout continues the same thoughtbox session without remembering UUIDs. + +**Branch**: `cursor/fix-session-recovery-mcp-root-4cb5` + +--- + +### Track 2 — Gateway schema surfacing completion (SPEC-GW-011) + +**What** + +1. **First-call-only** operation schema embedding in the gateway/observability handler (stops ~50k chars/session repeat). +2. Register **`thoughtbox_operations`** tool (`list` / `get` / `search`) aggregating existing operation catalogs. + +**Why this is #2** + +- Pure **token economics** with measured waste (~5k JSON per `thought` call). +- Spec is tight (4–6h estimate); maps to `src/observatory/gateway-handler.ts` and existing `*/operations.ts` modules. +- Unblocks agents on Code Mode who discover via search but still need schemas without invoking heavy ops. + +**Reasoning for choices** + +- Implement **R1 (dedup)** before **R2 (tool)** — dedup gives instant savings to all clients; tool fixes discoverability for Code Mode-only paths. +- Keep resource templates (spec says out of scope to remove) for MCP clients that use `resources/read`. +- Wire `cleanupSession` to clear `sessionSchemasSeen` — prevents unbounded memory and wrong dedup across sessions. + +**Phases** + +1. `sessionSchemasSeen: Map>` in gateway handler. +2. Unit tests: first response has schema block; second does not; cleanup resets. +3. `thoughtbox_operations` tool in `server-factory.ts` Stage 0. +4. Align `.agents/skills/test-suite` — tests already expect this tool. + +**Success**: 10-thought session returns schema once per operation, not ten times; `thoughtbox_operations { operation: "get", args: { name: "thought" } }` works on `/mcp`. + +**Branch**: `cursor/feat-gateway-schema-surfacing-4cb5` + +--- + +### Track 3 — Cognitive harness quick wins (SPEC-CHX-001, buckets A + B) + +**What** (four PR-sized slices, ordered) + +| Slice | Item | Effort (spec) | +|-------|------|----------------| +| A1 | #1 Auto-numbering docs/examples (SDK comment, remove `thoughtNumber: 1` from canonical example, onboard skill) | ~1h | +| A2 | #2 `tb.t()` / `tb.end()` shorthand in Code Mode | 3–4h | +| B1 | #3 `session_get_thought`, `session_recent_thoughts`, `session_search_within` | ~6–8h | +| A3 | #6 Cipher mode toggle (optional — after A1–B1) | spec §#6 | + +**Why this is #3** + +- Grounded in a **real 146-thought session** — not hypothetical UX. +- **A1 + A2** are lowest risk, highest frequency (every thought). +- **B1** fixes a structural gap: agents cannot cheaply recall thought N in the *current* session without loading the full blob. +- Defers heavier items (#4 subagent attach, #10 KG shortcut, #5 hook suppression) until the write path is cheaper. + +**Reasoning for deferring bucket C/D** + +- **#4 subagent.attach** needs protocol design + Task tool integration — high value but not quick. +- **#5 hook suppression** touches plugin hooks — ship after server recall exists so agents rely on Thoughtbox, not Claude Tasks. +- **#7–#8 audit framing** needs product decision on research vs ops sessions. + +**Success**: New session uses `tb.t("...")` without `thoughtNumber`; agent calls `session_recent_thoughts` with `n: 5` under 500 tokens round-trip. + +**Branches**: one branch per slice (`cursor/chx-auto-numbering-4cb5`, etc.) + +--- + +### Track 4 — Harness optimization Tier 1 (SPEC-HARNESS-T1) + +**What** + +1. SessionStart **Thoughtbox primer** in plugin `session_start.sh` (gated: top-level agent + Thoughtbox in `.mcp.json`). +2. Rewrite **interleaved-thinking** prompt to OODA loop (fix stale `mcp__thoughtbox__thoughtbox` references → `thoughtbox_execute`). +3. PostToolUse **session ID tracker** for tooling continuity. + +**Why this matters** + +- Moves interleaved thinking from **opt-in prompt** to **default rhythm** — product thesis: Think → Act → Think. +- Plugin-only; no server deploy required for primer + prompt. +- Pairs with Track 1 (server knows which session) and Track 3 A1 (agents stop over-tracking numbers). + +**Reasoning** + +- Do **after Track 1** so PostToolUse tracker and `load_context` recovery tell a consistent story. +- Primer capped at ~55 words per spec — avoids context bloat that triggered #5 in CHX. + +**Branch**: `cursor/feat-harness-t1-primer-4cb5` (plugin repo path: `plugins/thoughtbox-claude-code/`) + +--- + +### Track 5 — Evaluation closed loop operationalization (SPEC-EVAL-001 + `evaluation/thoughtbox-eval-strategy.md`) + +**What** (not greenfield — wire what exists) + +1. Document + script **LANGSMITH_API_KEY** setup for staging/prod. +2. Seed **six datasets** from eval strategy (`thoughtbox_core_outcomes`, `thoughtbox_reasoning_patterns`, etc.) via `DatasetManager`. +3. Run one **baseline experiment**: model-only vs Thoughtbox-full vs one ablation. +4. Connect **DGM fitness**: replace 0.0 fitness stubs with evaluator output from `dgmFitnessEvaluator` on experiment completion. + +**Why this is top-tier strategically but #5 in sequence** + +- Code exists (`trace-listener`, `experiment-runner`, `online-monitor`); value is **proving causal lift**, not writing modules. +- Depends on stable agent behavior (Tracks 1–3) so experiments are reproducible. +- Informs every future architectural bet (branch workers, memory, hub). + +**Reasoning** + +- Follow **eval strategy §1** (four-way comparison) literally — avoids "one big benchmark" that hides affordance-specific wins. +- Label examples with `should_use_thoughtbox`, `should_branch`, etc. — precision/recall per affordance beats one scalar score. +- Keep fire-and-forget tracing — never block MCP on LangSmith. + +**Success**: One experiment URL in LangSmith comparing conditions; at least one DGM archive entry with non-zero fitness from real scores. + +**Branch**: `cursor/feat-eval-datasets-baseline-4cb5` + +--- + +### Track 6 — MCP peer notebooks: manifest lifecycle (ADR-022 Part 2) + +**What**: Per `mcp-peer-notebooks/NEXT-IMPLEMENTATION-HANDOFF.md` — compile manifests from notebook sources, draft/approve/activate/retire, enforce active hash at invoke. + +**Why included in top tier** + +- Explicit **next committed unit** after merged Part 1; breaking the sequence wastes durable control-plane work. +- **Governance invariant**: notebook edits must not silently change active capabilities — required before production runtime (Part 3–4). + +**Reasoning for position #6** + +- Larger than Tracks 1–3; needs Supabase migrations + peer-notebook-delivery-guard skill. +- **Do not** parallel with Track 2 unless separate owners — both touch MCP surface area. + +**Out of scope for this track**: web inspection UI (`thoughtbox-2ot`), `local-process`, smolvm — per handoff. + +**Branch**: `cursor/feat-peer-manifest-lifecycle-4cb5` + +--- + +### Track 7 — Parallel branch workers (SPEC-BRANCH-WORKERS) + +**What**: Stateless `tb-branch` Supabase Edge Function + main MCP `branch_spawn` / `branch_merge` / `branch_list` / `branch_get`; Postgres `branches` table; HMAC worker URLs. + +**Why high value but later** + +- Unlocks Thoughtbox’s **core differentiator** (parallel exploration) safely — today `ThoughtHandler` singleton races on concurrent branches. +- Requires Supabase edge deploy, auth signing, and migration discipline — infra-heavy. + +**Reasoning** + +- Ship **after Track 1** — branch workers assume durable session/branch IDs agents can return to. +- Ship **before** full Srcbook Observatory stack — branch exploration benefits all domains, not only notebooks. +- Consider a **thin MVP**: spawn + write + list only; defer merge synthesis polish. + +**Branch**: `cursor/feat-branch-workers-mvp-4cb5` + +--- + +## Secondary tracks (after top 7) + +| Spec / area | Value | Why deferred | +|-------------|-------|--------------| +| **Srcbook Observatory** (`SPEC-SRC-001`–`005`) | High for notebook product | Large vertical (WebSocket channel, preview lifecycle, AI cells); depends on notebook execution substrate stabilizing | +| **OODA Loops MCP** | High for codebase learning | Spec body missing; implementation absent despite CHANGELOG — needs archaeology or rewrite | +| **SPEC-SUM-001** subagent summarize modes | Medium | Useful for handoffs; no data-loss risk | +| **SPEC-OBS-001** MCP sidecar metrics | Medium | Ops/diagnostic; doesn’t improve agent reasoning | +| **SPEC-RLM-001** recursive REPL | Medium–high | Researchy; isolate security review before MVP | +| **Auditability AUD-002–005** | Medium | AUD-001 largely done in web; remaining specs are filters, blast radius, fault attribution | +| **Code Mode Canonical IR / TBX-C1** (`SPEC-CORE-002` layers B–D) | High long-term | Explicitly deferred in `code-mode/target-state.md` — transport alignment first | +| **Agent governance substrate** | Strategic | Research-only; requires HDD + user approval | +| **GCP/terraform specs** | Ops | Stabilization, not product loop | + +--- + +## Suggested sequencing (dependency-aware) + +```mermaid +graph TD + T1[Track 1: MCP root recovery] + T2[Track 2: Schema surfacing] + T3[Track 3: CHX quick wins] + T4[Track 4: Harness T1 plugin] + T5[Track 5: Eval operationalization] + T6[Track 6: Peer manifest lifecycle] + T7[Track 7: Branch workers MVP] + + T1 --> T4 + T1 --> T5 + T3 --> T5 + T2 --> T3 + T1 --> T7 + T6 --> T7 +``` + +**Parallelization safe pairs** + +- Track 2 (server) ∥ Track 4 (plugin) — different repos/paths. +- Track 3 A1/A2 (docs + SDK) ∥ Track 1 — no conflict if branches separate. + +**Avoid parallelizing** + +- Track 6 ∥ major MCP handler refactors (Track 2/3 B1) — merge conflict risk in `server-factory.ts`. + +--- + +## Per-track HDD / delivery notes + +| Track | HDD? | Skill / guard | +|-------|------|----------------| +| 1, 2, 3 | Light ADR optional | `implement` + TDD profile | +| 6 | **Required** | `peer-notebook-delivery-guard`, ADR-022 | +| 7 | **Required** | New staging ADR for edge auth + branch table | +| 5 | Spec update only if dataset schema changes | `eval` skill | +| 4 | Docs + plugin version bump | Plugin release process | + +--- + +## What we explicitly did not prioritize + +1. **Re-implementing Code Mode from SPEC-CORE-002** — already shipped; future work is IR/TBX-C1, not another gateway migration. +2. **SPEC-WRK-001 workflows tool** until spec is rewritten for full-surface (no Stage 2 gating). +3. **Legacy Observatory HTML audit cards** unless product still ships `src/observatory` UI — web app `ThoughtCard` is the product surface. +4. **Letta DGM archive automation** — separate initiative under `letta-specific/`. + +--- + +## Immediate next actions (if starting this week) + +1. **Track 1** — open PR for `mcpRootUri` + auto `load_context` (smallest blast radius, highest reliability win). +2. **Track 2 R1** — first-call-only schema dedup (single-file change + tests). +3. **Track 3 A1** — documentation-only PR for auto-numbering (can merge in hours). +4. File tracker items for **loops MCP spec restoration** and **SPEC-WRK-001 revision** — blocked on doc hygiene, not code. + +--- + +## References (primary specs) + +- `SPEC-SRC-006-session-recovery-via-mcp-root.md` +- `SPEC-GW-011-gateway-schema-surfacing.md` +- `SPEC-CHX-001-cognitive-harness-improvements.md` +- `harness-optimization/SPEC-HARNESS-T1.md`, `SPEC-HARNESS-T2.md` +- `SPEC-EVAL-001-unified-evaluation-system.md`, `evaluation/thoughtbox-eval-strategy.md` +- `mcp-peer-notebooks/NEXT-IMPLEMENTATION-HANDOFF.md`, `mcp-peer-notebooks/SPEC-CONTROL-PLANE.md` +- `SPEC-BRANCH-WORKERS.md` +- `code-mode/target-state.md` (alignment baseline) +- `inventory.md` (Srcbook + OBS inventory)