DemOnJR · DemOnJR · Jun 27, 2026 · Jun 26, 2026 · Jun 26, 2026 · Jun 26, 2026
diff --git a/AUTORESEARCH.md b/AUTORESEARCH.md
@@ -0,0 +1,135 @@
+# Autoresearch — the self-improving "learn a skill" loop
+
+When the agent needs to do something it doesn't know how to do, it researches the
+topic on the public web, synthesizes a reusable `SKILL.md` *grounded only in the
+pages it read*, saves it (quarantined), and applies it — learning the capability
+itself instead of guessing. Inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch)
+(an autonomous loop that produces lightweight steering artifacts; here the artifact
+is a skill).
+
+This matters most for the **local model** (qwen3.5:9b via Ollama): a 9B confidently
+answers niche DevOps questions from memory — often subtly wrong, which is dangerous
+when commands run on real servers.
+
+## How it triggers (the important part)
+
+A weak local model will **not** reliably pick a rarely-used `learn_skill` tool out of
+~15 on its own. Measured trigger recall across every prompt wording we tried was ~0 —
+even for a *fictional* tool it had never heard of, it answered in prose rather than
+admitting the gap.
+
+So the reliable trigger is **not** the model self-selecting the tool. It is a
+**pre-turn classifier** (`autoresearch::assess_gap`): one cheap, temperature-0
+question — *"does this need specific commands/config for a named tool you're unsure
+of? name the topic, or say NONE."* A 9B answers a focused, direct question far more
+reliably than it spontaneously reaches for a rare tool. Measured: **recall ~0.75,
+precision 1.00** (zero false positives on `ls` / math / file edits).
+
+### The autopilot (agent.rs)
+
+On every local, tool-capable, non-casual turn (gated by `agent.learn_autopilot`,
+default on):
+
+1. **Classify** — `assess_gap` runs once. If it returns `NONE`, nothing happens (no
+   latency beyond one tiny call).
+2. **Research** — on a detected gap with no covering skill, `autoresearch::learn`
+   runs the full loop (below). The expensive web research only runs on a genuine gap.
+3. **Inject** — the resulting skill is appended to the system prompt as
+   *"Just-researched skill for this task — APPLY IT"*, and the user sees a
+   *"Learned a skill for X — applying it"* status.
+4. **Answer** — the model answers using the injected, verified-against-sources steps.
+
+The model can also call the `learn_skill` tool directly, and the reflection pass
+writes a `[gap]` memory bullet when the agent visibly declines — but the autopilot
+is what makes it dependable.
+
+## The research loop (`autoresearch::learn`)
+
+1. **Dedup** — if an installed skill already covers the topic, return it; skip research.
+2. **Sanitize the query** — private IPs, internal hostnames (`.internal`/`.local`/
+   `.lan`), the user's own VPS hostnames, credential markers, and high-entropy tokens
+   are stripped *before* the query reaches DuckDuckGo. The search topic is the generic
+   capability, never the specific incident.
+3. **Gather sources** — search, then **fetch the top 1–2 result pages** (load-bearing:
+   snippets alone are too thin to ground real commands). All fetches reuse the
+   SSRF-guarded `web_tools` path.
+4. **Synthesize** — one low-temperature (0.15) call fills a fixed `SKILL.md` skeleton
+   **using only the fetched source text**, with an explicit `# TODO: not found in
+   sources` escape hatch so it leaves gaps blank instead of confabulating.
+5. **Validate, de-fang, scan, save** (`process_synthesized`, a pure function):
+   - structural gate (real `description:` front-matter, ≥1 command, cited sources that
+     match pages actually fetched, no model prompt-leakage);
+   - **de-fang** destructive commands (`rm -rf`, `mkfs`, `dd`, `chmod 777 /`, …) by
+     rewriting the line to `# REQUIRES APPROVAL:` — kept, never silently deleted;
+   - **security scan** (`commit_candidate`): **NVIDIA SkillSpector** is the primary
+     static analyzer when installed (68 patterns / 17 categories: prompt injection,
+     exfiltration, privilege escalation, supply chain, dangerous-code AST, YARA, …),
+     run static-only (`--no-llm`, no API key); the built-in heuristic is the always-on
+     backstop. Both gate at a **stricter threshold** than user-chosen installs (≥40 /
+     `is_blocking`, vs 60 for `skill_install`) — a researched skill is more untrusted
+     than one the user picked, so pipe-to-shell (`curl … | sh`) is refused outright;
+   - **quarantine** under the `unverified/` category with server-authored provenance
+     front-matter (`status: draft`, `origin: autoresearch`, `verified: false`,
+     `sources: […]`) and an UNVERIFIED banner, **never overwriting** an existing skill.
+
+## Why this is safe
+
+A skill is a file the agent later *follows as trusted instructions*, so web text
+laundered into a `SKILL.md` is a prompt-injection / RCE vector. The laundering is
+closed at every step: the query never carries private context out; synthesis is
+grounded and cold; the output is validated, de-fanged, and scanned at a stricter bar
+than installs; it lands in a distinct `unverified/` namespace with a banner so the
+distrust label is re-attached every time it's re-injected; and the agent is told never
+to run a destructive command from a learned skill without the user's approval.
+
+## The security scanner (NVIDIA SkillSpector)
+
+Every skill — researched, downloaded (`skill_install`), or otherwise — is scanned
+before it's saved or installed, because a `SKILL.md` is followed as trusted
+instructions. The scanner is **NVIDIA SkillSpector** ([github.com/NVIDIA/skillspector](https://github.com/NVIDIA/skillspector))
+when installed, falling back to a built-in pure-Rust heuristic otherwise.
+
+- **Install** (one click in Settings → Security, or):
+  `uv tool install git+https://github.com/NVIDIA/skillspector.git` (uv provisions
+  the required Python 3.12 automatically). The app finds the executable via
+  `uv tool dir --bin` even when it isn't on `PATH`.
+- Runs **static-only** by default (`scan … -f json --no-llm`) — no API key, no network
+  beyond the optional OSV.dev dependency check.
+- **Deep analysis (opt-in)**: Settings → Skills → "Deep analysis with the local model"
+  adds SkillSpector's LLM semantic checks against your local Ollama (OpenAI-compatible
+  endpoint; nothing leaves the machine). Use a **non-thinking instruct model** (or a
+  cloud model) — *thinking* models (qwen3.x) emit long `<think>` traces that overrun
+  SkillSpector's completion budget, so a deep scan with them fails and **falls back to
+  the static SkillSpector scan** (never down to the weak built-in heuristic). Stored in
+  `skills.scanner_deep` / `skills.scanner_model`; the endpoint/model derive from the
+  active Ollama provider.
+- Verdict: `risk_assessment.{score,severity,recommendation}` + an `issues[]` list.
+  Blocking on score ≥ threshold, HIGH/CRITICAL severity, or a `DO_NOT_INSTALL`
+  recommendation. Verify with `xconsole-bench scanner` (a malicious sample scores
+  71/HIGH/DO_NOT_INSTALL → blocked; a clean one 0/LOW → allowed); `--deep` exercises the
+  LLM path.
+
+## Settings
+
+- `agent.learn_autopilot` — pre-turn gap detection + auto-research (default **on**).
+- `agent.self_improve` — the reflection pass that writes `[lesson]`/`[gap]` memory
+  bullets (default **on**).
+- **Skill scanner** — Settings → Security shows whether SkillSpector is active and
+  installs it in one click (`skill_scanner_status` / `install_skill_scanner` commands).
+
+## Tested
+
+`xconsole-bench` modes exercise every layer:
+
+- `selftest` — pure, no model/network: injection refused, destructive de-fanged,
+  quarantine + no-overwrite, query sanitization, structural validation, classifier
+  reply parsing (59 checks).
+- `learnclassify` — the gap classifier as a TP/FP/TN/FN confusion matrix.
+- `learntune` — A/B sweep of guidance/tool-description variants (how we learned that
+  prompt-only triggering doesn't work).
+- `learn` — the live full loop on a real topic **and** the autopilot end-to-end
+  (gate → research → inject → grounded answer).
+
+Deferred to a future "overnight" pass (v2): promoting `draft → verified` from
+execution outcomes, refining a skill that failed in use, proactive research of
+recurring `[gap]`s, and a skills dedup/merge pass.
diff --git a/bench/README.md b/bench/README.md
@@ -73,6 +73,79 @@ With **no hooks configured the loop skips the hook path entirely (0 ms)** — ho
 opt-in, so they cost nothing until you add one. The `live_hook_ms` figure is dominated
 by process-spawn latency (lower on Unix `sh -c`); a hook that does real work adds its own time.
 
+## 1a. Harder suites — `hard` and `recall`
+
+The core `agent` suite saturates at 100% on `qwen3.5:9b`, so it no longer
+discriminates. Two harder, **scored** suites add headroom (so the history can show
+learning/regressions):
+
+```bash
+# Discriminative agent suite — tool-boundary routing traps + adversarial
+# action-vs-explain restraint (a 9B does NOT ace these). Reports an overall pass-rate
+# (with a Wilson CI) and a per-tier breakdown (hard / expert).
+./src-tauri/target/release/xconsole-bench.exe hard --samples 3
+
+# Reasoning-unlocks-recall experiment — single-hop factual questions answered three
+# ways: direct, reason-first, and a dummy "Let me think" buffer.
+./src-tauri/target/release/xconsole-bench.exe recall --samples 3
+
+# Closed learning loop — answer unfamiliar-tool tasks COLD (memory), let the
+# autoresearch loop build a skill for each, then answer WARM (skill injected).
+# Records cold/warm as two history points so the dashboard shows the before/after.
+./src-tauri/target/release/xconsole-bench.exe learnloop --samples 3
+```
+
+`learnloop` is the experiment that asks "does the agent actually get better by
+learning?" — cold vs warm pass-rate, plus a persistence check (a second `learn()`
+must dedup to the existing skill, no re-research). It also honestly catches the
+failure mode: a low-quality researched skill can *regress* a task, which is why
+draft skills stay quarantined until execution-verified.
+
+`recall` tests Google Research's *"Thinking to Recall: how reasoning unlocks parametric
+knowledge in LLMs"* on our local model: does a reasoning trace surface facts the model
+has in its weights but can't recall when answering directly? It reports `direct`,
+`reason`, and `buffer` accuracy and the **reasoning gain** (`reason − direct`). Per the
+paper, a large positive gain means reasoning unlocks recall (factual priming); a gain
+from the dummy `buffer` condition isolates the pure compute-buffer effect; a *negative*
+gain flags the paper's failure mode (a hallucinated intermediate fact derailing the
+answer). The `hard`/`recall` scenarios were generated and adversarially fact-checked by
+a multi-agent workflow so their expected answers are correct.
+
+## 1b. Benchmark history — scores over time (HTML dashboard + OKF bundle)
+
+Every **scored** run (`agent`, `hard`, `recall`, `ablation`, `learn`, `llm`, `all`) is
+appended to `bench/results/history.jsonl` and rendered two ways automatically:
+
+- **`bench/results/history.html`** — a self-contained dashboard (open it in any
+  browser; no server, no external assets) charting pass-rate and latency over time,
+  with a **Wilson 95% confidence interval** on every pass-rate.
+- **`bench/history/`** — the same history as an **[Open Knowledge Format](https://github.com/GoogleCloudPlatform/knowledge-catalog/tree/main/okf)**
+  bundle (Google's portable markdown+YAML standard): one typed concept per run
+  (`runs/*.md`), a chronological `log.md`, and an `index.md`. Portable, vendor-neutral,
+  readable in any editor and on GitHub.
+
+```bash
+# Rebuild the dashboard + OKF bundle from the existing history (no model needed):
+./src-tauri/target/release/xconsole-bench.exe report
+
+# Skip recording a run (e.g. a throwaway/tuning run):
+./src-tauri/target/release/xconsole-bench.exe agent --no-history
+```
+
+**Methodology** (applied + cited in the dashboard footer):
+
+- **Confidence intervals, not point estimates.** A pass-rate from a few samples is
+  noisy — 3–5 samples is *often insufficient* and the same source can wander ±1 pass.
+  Each pass-rate is reported with a Wilson 95% CI; when two runs' intervals overlap,
+  the difference isn't real. (Google Research, *"Building better AI benchmarks: how
+  many raters are enough?"* — more items beats more samples for an accuracy metric.)
+- **`time for 100 output tokens` = TTFT + 100 / (tok/s)** — one comparable latency
+  number across runs. (Artificial Analysis methodology.)
+- **Revealed behavior vs. self-report.** The learn-loop eval measures what the model
+  *does* (does it route to `learn_skill`?) against what it *claims* (the classifier's
+  self-assessment) — the gap is the model's overconfidence. (Google Research,
+  *"Evaluating alignment of behavioral dispositions in LLMs."*)
+
 ## 2. `ollama_latency.ps1` — zero-build latency probe
 
 Quick TTFT / tok/s read without compiling, straight against `/api/chat`:

diff --git a/bench/history/index.md b/bench/history/index.md
@@ -0,0 +1,26 @@
+---
+type: index
+title: xConsole benchmark history
+description: Scores and latency of the local-model agent over time, as an Open Knowledge Format bundle.
+tags: [benchmark, index]
+---
+
+# xConsole benchmark history
+
+A portable [Open Knowledge Format](https://github.com/GoogleCloudPlatform/knowledge-catalog/tree/main/okf) bundle: one markdown concept per run, a chronological [log](log.md), and the dashboard at [`../results/history.html`](../results/history.html).
+
+## Runs (newest first)
+
+- [Jun 27 2026 16:28 — learnloop-warm](runs/2026-06-27T14-28-07.174123100-00-00-learnloop-warm.md) — unfamiliar-tool: after learning (cautious): 83% (5/6) [95% CI 44–97%]
+- [Jun 27 2026 16:28 — learnloop-cold](runs/2026-06-27T14-28-07.173732500-00-00-learnloop-cold.md) — unfamiliar-tool: memory only: 67% (4/6) [95% CI 30–90%]
+- [Jun 27 2026 16:09 — learnloop-warm](runs/2026-06-27T14-09-49.628854400-00-00-learnloop-warm.md) — unfamiliar-tool: after learning (cautious): 83% (5/6) [95% CI 44–97%]
+- [Jun 27 2026 16:09 — learnloop-cold](runs/2026-06-27T14-09-49.628492-00-00-learnloop-cold.md) — unfamiliar-tool: memory only: 67% (4/6) [95% CI 30–90%]
+- [Jun 27 2026 15:09 — learnloop-warm](runs/2026-06-27T13-09-35.916153600-00-00-learnloop-warm.md) — unfamiliar-tool: after learning: 80% (4/5) [95% CI 38–96%]
+- [Jun 27 2026 15:09 — learnloop-cold](runs/2026-06-27T13-09-35.915794-00-00-learnloop-cold.md) — unfamiliar-tool: memory only: 80% (4/5) [95% CI 38–96%]
+- [Jun 27 2026 14:01 — recall](runs/2026-06-27T12-01-49.604207400-00-00-recall.md) — recall accuracy (direct): 89% (48/54) [95% CI 78–95%]
+- [Jun 27 2026 03:26 — hard](runs/2026-06-27T01-26-04.406853-00-00-hard.md) — hard-suite pass-rate: 86% (12/14) [95% CI 60–96%]
+- [Jun 27 2026 03:02 — learn](runs/2026-06-27T01-02-38.235947400-00-00-learn.md) — gap-routing accuracy: 33% (4/12) [95% CI 14–61%]
+- [Jun 27 2026 02:59 — ablation](runs/2026-06-27T00-59-48.523315-00-00-ablation.md) — full-prompt pass-rate: 100% (7/7) [95% CI 65–100%]
+- [Jun 27 2026 02:55 — agent](runs/2026-06-27T00-55-00.556526200-00-00-agent.md) — scenario pass-rate: 100% (11/11) [95% CI 74–100%]
+- [Jun 27 2026 02:53 — agent](runs/2026-06-27T00-53-47.450689500-00-00-agent.md) — scenario pass-rate: 100% (11/11) [95% CI 74–100%]
+- [Jun 27 2026 02:52 — llm](runs/2026-06-27T00-52-32.133470100-00-00-llm.md) — latency t100=4124ms, 44.0 tok/s
diff --git a/bench/history/log.md b/bench/history/log.md
@@ -0,0 +1,17 @@
+---
+type: log
+title: Benchmark run log
+---
+
+# Benchmark run log
+
+- Jun 27 2026 02:52 — **llm** latency t100=4124ms, 44.0 tok/s (model qwen3.5:9b)
+- Jun 27 2026 02:53 — **agent** scenario pass-rate: 100% (11/11) [95% CI 74–100%] (model qwen3.5:9b)
+- Jun 27 2026 02:55 — **agent** scenario pass-rate: 100% (11/11) [95% CI 74–100%] (model qwen3.5:9b)
+- Jun 27 2026 02:59 — **ablation** full-prompt pass-rate: 100% (7/7) [95% CI 65–100%] (model qwen3.5:9b)
+- Jun 27 2026 03:02 — **learn** gap-routing accuracy: 33% (4/12) [95% CI 14–61%] (model qwen3.5:9b)
+- Jun 27 2026 03:26 — **hard** hard-suite pass-rate: 86% (12/14) [95% CI 60–96%] (model qwen3.5:9b)
+- Jun 27 2026 14:01 — **recall** recall accuracy (direct): 89% (48/54) [95% CI 78–95%] (model qwen3.5:9b)
+- Jun 27 2026 15:09 — **learnloop-warm** unfamiliar-tool: after learning: 80% (4/5) [95% CI 38–96%] (model qwen3.5:9b)
+- Jun 27 2026 16:09 — **learnloop-warm** unfamiliar-tool: after learning (cautious): 83% (5/6) [95% CI 44–97%] (model qwen3.5:9b)
+- Jun 27 2026 16:28 — **learnloop-warm** unfamiliar-tool: after learning (cautious): 83% (5/6) [95% CI 44–97%] (model qwen3.5:9b)
diff --git a/bench/history/runs/2026-06-27T00-52-32.133470100-00-00-llm.md b/bench/history/runs/2026-06-27T00-52-32.133470100-00-00-llm.md
@@ -0,0 +1,29 @@
+---
+type: benchmark-run
+title: llm — Jun 27 2026 02:52
+mode: llm
+model: qwen3.5:9b
+timestamp: 2026-06-27T00:52:32.133470100+00:00
+samples: 1
+metric: null
+metric_label: latency only
+ci_low: 0
+ci_high: 1
+tags: [benchmark, llm]
+---
+
+# llm run — Jun 27 2026 02:52
+
+latency t100=4124ms, 44.0 tok/s
+
+| metric | value |
+|---|---|
+| model | qwen3.5:9b |
+| samples (N) | 1 |
+| prompt tokens | 4860 |
+| TTFT (ms) | 1853 |
+| total/turn (ms) | 5329 |
+| gen tok/s | 44 |
+| time for 100 tok (ms) | 4124 |
+
+Methodology: pass-rates carry a Wilson 95% CI (small N is often insufficient — Google "how many raters are enough?"); latency uses "time for 100 output tokens" (Artificial Analysis). See [the log](../log.md) and [index](../index.md).
diff --git a/bench/history/runs/2026-06-27T00-53-47.450689500-00-00-agent.md b/bench/history/runs/2026-06-27T00-53-47.450689500-00-00-agent.md
@@ -0,0 +1,29 @@
+---
+type: benchmark-run
+title: agent — Jun 27 2026 02:53
+mode: agent
+model: qwen3.5:9b
+timestamp: 2026-06-27T00:53:47.450689500+00:00
+samples: 3
+metric: 1.0
+metric_label: scenario pass-rate
+ci_low: 0.741
+ci_high: 1
+tags: [benchmark, agent]
+---
+
+# agent run — Jun 27 2026 02:53
+
+scenario pass-rate: 100% (11/11) [95% CI 74–100%]
+
+| metric | value |
+|---|---|
+| model | qwen3.5:9b |
+| samples (N) | 3 |
+| prompt tokens | 3413 |
+| TTFT (ms) | 1699 |
+| total/turn (ms) | 2197 |
+| gen tok/s | 45.4 |
+| time for 100 tok (ms) | 3899 |
+
+Methodology: pass-rates carry a Wilson 95% CI (small N is often insufficient — Google "how many raters are enough?"); latency uses "time for 100 output tokens" (Artificial Analysis). See [the log](../log.md) and [index](../index.md).
diff --git a/bench/history/runs/2026-06-27T00-55-00.556526200-00-00-agent.md b/bench/history/runs/2026-06-27T00-55-00.556526200-00-00-agent.md
@@ -0,0 +1,29 @@
+---
+type: benchmark-run
+title: agent — Jun 27 2026 02:55
+mode: agent
+model: qwen3.5:9b
+timestamp: 2026-06-27T00:55:00.556526200+00:00
+samples: 3
+metric: 1.0
+metric_label: scenario pass-rate
+ci_low: 0.741
+ci_high: 1
+tags: [benchmark, agent]
+---
+
+# agent run — Jun 27 2026 02:55
+
+scenario pass-rate: 100% (11/11) [95% CI 74–100%]
+
+| metric | value |
+|---|---|
+| model | qwen3.5:9b |
+| samples (N) | 3 |
+| prompt tokens | 3413 |
+| TTFT (ms) | 1718 |
+| total/turn (ms) | 2168 |
+| gen tok/s | 45.8 |
+| time for 100 tok (ms) | 3900 |
+
+Methodology: pass-rates carry a Wilson 95% CI (small N is often insufficient — Google "how many raters are enough?"); latency uses "time for 100 output tokens" (Artificial Analysis). See [the log](../log.md) and [index](../index.md).
diff --git a/bench/history/runs/2026-06-27T00-59-48.523315-00-00-ablation.md b/bench/history/runs/2026-06-27T00-59-48.523315-00-00-ablation.md
@@ -0,0 +1,29 @@
+---
+type: benchmark-run
+title: ablation — Jun 27 2026 02:59
+mode: ablation
+model: qwen3.5:9b
+timestamp: 2026-06-27T00:59:48.523315+00:00
+samples: 3
+metric: 1.0
+metric_label: full-prompt pass-rate
+ci_low: 0.646
+ci_high: 1
+tags: [benchmark, ablation]
+---
+
+# ablation run — Jun 27 2026 02:59
+
+full-prompt pass-rate: 100% (7/7) [95% CI 65–100%]
+
+| metric | value |
+|---|---|
+| model | qwen3.5:9b |
+| samples (N) | 3 |
+| prompt tokens | 4802 |
+| TTFT (ms) | 1539 |
+| total/turn (ms) | 3476 |
+| gen tok/s | 55.6 |
+| time for 100 tok (ms) | 3337 |
+
+Methodology: pass-rates carry a Wilson 95% CI (small N is often insufficient — Google "how many raters are enough?"); latency uses "time for 100 output tokens" (Artificial Analysis). See [the log](../log.md) and [index](../index.md).