Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions AUTORESEARCH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Autoresearch β€” the self-improving "learn a skill" loop

When the agent needs to do something it doesn't know how to do, it researches the
topic on the public web, synthesizes a reusable `SKILL.md` *grounded only in the
pages it read*, saves it (quarantined), and applies it β€” learning the capability
itself instead of guessing. Inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch)
(an autonomous loop that produces lightweight steering artifacts; here the artifact
is a skill).

This matters most for the **local model** (qwen3.5:9b via Ollama): a 9B confidently
answers niche DevOps questions from memory β€” often subtly wrong, which is dangerous
when commands run on real servers.

## How it triggers (the important part)

A weak local model will **not** reliably pick a rarely-used `learn_skill` tool out of
~15 on its own. Measured trigger recall across every prompt wording we tried was ~0 β€”
even for a *fictional* tool it had never heard of, it answered in prose rather than
admitting the gap.

So the reliable trigger is **not** the model self-selecting the tool. It is a
**pre-turn classifier** (`autoresearch::assess_gap`): one cheap, temperature-0
question β€” *"does this need specific commands/config for a named tool you're unsure
of? name the topic, or say NONE."* A 9B answers a focused, direct question far more
reliably than it spontaneously reaches for a rare tool. Measured: **recall ~0.75,
precision 1.00** (zero false positives on `ls` / math / file edits).

### The autopilot (agent.rs)

On every local, tool-capable, non-casual turn (gated by `agent.learn_autopilot`,
default on):

1. **Classify** β€” `assess_gap` runs once. If it returns `NONE`, nothing happens (no
latency beyond one tiny call).
2. **Research** β€” on a detected gap with no covering skill, `autoresearch::learn`
runs the full loop (below). The expensive web research only runs on a genuine gap.
3. **Inject** β€” the resulting skill is appended to the system prompt as
*"Just-researched skill for this task β€” APPLY IT"*, and the user sees a
*"Learned a skill for X β€” applying it"* status.
4. **Answer** β€” the model answers using the injected, verified-against-sources steps.

The model can also call the `learn_skill` tool directly, and the reflection pass
writes a `[gap]` memory bullet when the agent visibly declines β€” but the autopilot
is what makes it dependable.

## The research loop (`autoresearch::learn`)

1. **Dedup** β€” if an installed skill already covers the topic, return it; skip research.
2. **Sanitize the query** β€” private IPs, internal hostnames (`.internal`/`.local`/
`.lan`), the user's own VPS hostnames, credential markers, and high-entropy tokens
are stripped *before* the query reaches DuckDuckGo. The search topic is the generic
capability, never the specific incident.
3. **Gather sources** β€” search, then **fetch the top 1–2 result pages** (load-bearing:
snippets alone are too thin to ground real commands). All fetches reuse the
SSRF-guarded `web_tools` path.
4. **Synthesize** β€” one low-temperature (0.15) call fills a fixed `SKILL.md` skeleton
**using only the fetched source text**, with an explicit `# TODO: not found in
sources` escape hatch so it leaves gaps blank instead of confabulating.
5. **Validate, de-fang, scan, save** (`process_synthesized`, a pure function):
- structural gate (real `description:` front-matter, β‰₯1 command, cited sources that
match pages actually fetched, no model prompt-leakage);
- **de-fang** destructive commands (`rm -rf`, `mkfs`, `dd`, `chmod 777 /`, …) by
rewriting the line to `# REQUIRES APPROVAL:` β€” kept, never silently deleted;
- **security scan** (`commit_candidate`): **NVIDIA SkillSpector** is the primary
static analyzer when installed (68 patterns / 17 categories: prompt injection,
exfiltration, privilege escalation, supply chain, dangerous-code AST, YARA, …),
run static-only (`--no-llm`, no API key); the built-in heuristic is the always-on
backstop. Both gate at a **stricter threshold** than user-chosen installs (β‰₯40 /
`is_blocking`, vs 60 for `skill_install`) β€” a researched skill is more untrusted
than one the user picked, so pipe-to-shell (`curl … | sh`) is refused outright;
- **quarantine** under the `unverified/` category with server-authored provenance
front-matter (`status: draft`, `origin: autoresearch`, `verified: false`,
`sources: […]`) and an UNVERIFIED banner, **never overwriting** an existing skill.

## Why this is safe

A skill is a file the agent later *follows as trusted instructions*, so web text
laundered into a `SKILL.md` is a prompt-injection / RCE vector. The laundering is
closed at every step: the query never carries private context out; synthesis is
grounded and cold; the output is validated, de-fanged, and scanned at a stricter bar
than installs; it lands in a distinct `unverified/` namespace with a banner so the
distrust label is re-attached every time it's re-injected; and the agent is told never
to run a destructive command from a learned skill without the user's approval.

## The security scanner (NVIDIA SkillSpector)

Every skill β€” researched, downloaded (`skill_install`), or otherwise β€” is scanned
before it's saved or installed, because a `SKILL.md` is followed as trusted
instructions. The scanner is **NVIDIA SkillSpector** ([github.com/NVIDIA/skillspector](https://github.com/NVIDIA/skillspector))
when installed, falling back to a built-in pure-Rust heuristic otherwise.

- **Install** (one click in Settings β†’ Security, or):
`uv tool install git+https://github.com/NVIDIA/skillspector.git` (uv provisions
the required Python 3.12 automatically). The app finds the executable via
`uv tool dir --bin` even when it isn't on `PATH`.
- Runs **static-only** by default (`scan … -f json --no-llm`) β€” no API key, no network
beyond the optional OSV.dev dependency check.
- **Deep analysis (opt-in)**: Settings β†’ Skills β†’ "Deep analysis with the local model"
adds SkillSpector's LLM semantic checks against your local Ollama (OpenAI-compatible
endpoint; nothing leaves the machine). Use a **non-thinking instruct model** (or a
cloud model) β€” *thinking* models (qwen3.x) emit long `<think>` traces that overrun
SkillSpector's completion budget, so a deep scan with them fails and **falls back to
the static SkillSpector scan** (never down to the weak built-in heuristic). Stored in
`skills.scanner_deep` / `skills.scanner_model`; the endpoint/model derive from the
active Ollama provider.
- Verdict: `risk_assessment.{score,severity,recommendation}` + an `issues[]` list.
Blocking on score β‰₯ threshold, HIGH/CRITICAL severity, or a `DO_NOT_INSTALL`
recommendation. Verify with `xconsole-bench scanner` (a malicious sample scores
71/HIGH/DO_NOT_INSTALL β†’ blocked; a clean one 0/LOW β†’ allowed); `--deep` exercises the
LLM path.

## Settings

- `agent.learn_autopilot` β€” pre-turn gap detection + auto-research (default **on**).
- `agent.self_improve` β€” the reflection pass that writes `[lesson]`/`[gap]` memory
bullets (default **on**).
- **Skill scanner** β€” Settings β†’ Security shows whether SkillSpector is active and
installs it in one click (`skill_scanner_status` / `install_skill_scanner` commands).

## Tested

`xconsole-bench` modes exercise every layer:

- `selftest` β€” pure, no model/network: injection refused, destructive de-fanged,
quarantine + no-overwrite, query sanitization, structural validation, classifier
reply parsing (59 checks).
- `learnclassify` β€” the gap classifier as a TP/FP/TN/FN confusion matrix.
- `learntune` β€” A/B sweep of guidance/tool-description variants (how we learned that
prompt-only triggering doesn't work).
- `learn` β€” the live full loop on a real topic **and** the autopilot end-to-end
(gate β†’ research β†’ inject β†’ grounded answer).

Deferred to a future "overnight" pass (v2): promoting `draft β†’ verified` from
execution outcomes, refining a skill that failed in use, proactive research of
recurring `[gap]`s, and a skills dedup/merge pass.
73 changes: 73 additions & 0 deletions bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,79 @@ With **no hooks configured the loop skips the hook path entirely (0 ms)** β€” ho
opt-in, so they cost nothing until you add one. The `live_hook_ms` figure is dominated
by process-spawn latency (lower on Unix `sh -c`); a hook that does real work adds its own time.

## 1a. Harder suites β€” `hard` and `recall`

The core `agent` suite saturates at 100% on `qwen3.5:9b`, so it no longer
discriminates. Two harder, **scored** suites add headroom (so the history can show
learning/regressions):

```bash
# Discriminative agent suite β€” tool-boundary routing traps + adversarial
# action-vs-explain restraint (a 9B does NOT ace these). Reports an overall pass-rate
# (with a Wilson CI) and a per-tier breakdown (hard / expert).
./src-tauri/target/release/xconsole-bench.exe hard --samples 3

# Reasoning-unlocks-recall experiment β€” single-hop factual questions answered three
# ways: direct, reason-first, and a dummy "Let me think" buffer.
./src-tauri/target/release/xconsole-bench.exe recall --samples 3

# Closed learning loop β€” answer unfamiliar-tool tasks COLD (memory), let the
# autoresearch loop build a skill for each, then answer WARM (skill injected).
# Records cold/warm as two history points so the dashboard shows the before/after.
./src-tauri/target/release/xconsole-bench.exe learnloop --samples 3
```

`learnloop` is the experiment that asks "does the agent actually get better by
learning?" β€” cold vs warm pass-rate, plus a persistence check (a second `learn()`
must dedup to the existing skill, no re-research). It also honestly catches the
failure mode: a low-quality researched skill can *regress* a task, which is why
draft skills stay quarantined until execution-verified.

`recall` tests Google Research's *"Thinking to Recall: how reasoning unlocks parametric
knowledge in LLMs"* on our local model: does a reasoning trace surface facts the model
has in its weights but can't recall when answering directly? It reports `direct`,
`reason`, and `buffer` accuracy and the **reasoning gain** (`reason βˆ’ direct`). Per the
paper, a large positive gain means reasoning unlocks recall (factual priming); a gain
from the dummy `buffer` condition isolates the pure compute-buffer effect; a *negative*
gain flags the paper's failure mode (a hallucinated intermediate fact derailing the
answer). The `hard`/`recall` scenarios were generated and adversarially fact-checked by
a multi-agent workflow so their expected answers are correct.

## 1b. Benchmark history β€” scores over time (HTML dashboard + OKF bundle)

Every **scored** run (`agent`, `hard`, `recall`, `ablation`, `learn`, `llm`, `all`) is
appended to `bench/results/history.jsonl` and rendered two ways automatically:

- **`bench/results/history.html`** β€” a self-contained dashboard (open it in any
browser; no server, no external assets) charting pass-rate and latency over time,
with a **Wilson 95% confidence interval** on every pass-rate.
- **`bench/history/`** β€” the same history as an **[Open Knowledge Format](https://github.com/GoogleCloudPlatform/knowledge-catalog/tree/main/okf)**
bundle (Google's portable markdown+YAML standard): one typed concept per run
(`runs/*.md`), a chronological `log.md`, and an `index.md`. Portable, vendor-neutral,
readable in any editor and on GitHub.

```bash
# Rebuild the dashboard + OKF bundle from the existing history (no model needed):
./src-tauri/target/release/xconsole-bench.exe report

# Skip recording a run (e.g. a throwaway/tuning run):
./src-tauri/target/release/xconsole-bench.exe agent --no-history
```

**Methodology** (applied + cited in the dashboard footer):

- **Confidence intervals, not point estimates.** A pass-rate from a few samples is
noisy β€” 3–5 samples is *often insufficient* and the same source can wander Β±1 pass.
Each pass-rate is reported with a Wilson 95% CI; when two runs' intervals overlap,
the difference isn't real. (Google Research, *"Building better AI benchmarks: how
many raters are enough?"* β€” more items beats more samples for an accuracy metric.)
- **`time for 100 output tokens` = TTFT + 100 / (tok/s)** β€” one comparable latency
number across runs. (Artificial Analysis methodology.)
- **Revealed behavior vs. self-report.** The learn-loop eval measures what the model
*does* (does it route to `learn_skill`?) against what it *claims* (the classifier's
self-assessment) β€” the gap is the model's overconfidence. (Google Research,
*"Evaluating alignment of behavioral dispositions in LLMs."*)

## 2. `ollama_latency.ps1` β€” zero-build latency probe

Quick TTFT / tok/s read without compiling, straight against `/api/chat`:
Expand Down
26 changes: 26 additions & 0 deletions bench/history/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
type: index
title: xConsole benchmark history
description: Scores and latency of the local-model agent over time, as an Open Knowledge Format bundle.
tags: [benchmark, index]
---

# xConsole benchmark history

A portable [Open Knowledge Format](https://github.com/GoogleCloudPlatform/knowledge-catalog/tree/main/okf) bundle: one markdown concept per run, a chronological [log](log.md), and the dashboard at [`../results/history.html`](../results/history.html).

## Runs (newest first)

- [Jun 27 2026 16:28 β€” learnloop-warm](runs/2026-06-27T14-28-07.174123100-00-00-learnloop-warm.md) β€” unfamiliar-tool: after learning (cautious): 83% (5/6) [95% CI 44–97%]
- [Jun 27 2026 16:28 β€” learnloop-cold](runs/2026-06-27T14-28-07.173732500-00-00-learnloop-cold.md) β€” unfamiliar-tool: memory only: 67% (4/6) [95% CI 30–90%]
- [Jun 27 2026 16:09 β€” learnloop-warm](runs/2026-06-27T14-09-49.628854400-00-00-learnloop-warm.md) β€” unfamiliar-tool: after learning (cautious): 83% (5/6) [95% CI 44–97%]
- [Jun 27 2026 16:09 β€” learnloop-cold](runs/2026-06-27T14-09-49.628492-00-00-learnloop-cold.md) β€” unfamiliar-tool: memory only: 67% (4/6) [95% CI 30–90%]
- [Jun 27 2026 15:09 β€” learnloop-warm](runs/2026-06-27T13-09-35.916153600-00-00-learnloop-warm.md) β€” unfamiliar-tool: after learning: 80% (4/5) [95% CI 38–96%]
- [Jun 27 2026 15:09 β€” learnloop-cold](runs/2026-06-27T13-09-35.915794-00-00-learnloop-cold.md) β€” unfamiliar-tool: memory only: 80% (4/5) [95% CI 38–96%]
- [Jun 27 2026 14:01 β€” recall](runs/2026-06-27T12-01-49.604207400-00-00-recall.md) β€” recall accuracy (direct): 89% (48/54) [95% CI 78–95%]
- [Jun 27 2026 03:26 β€” hard](runs/2026-06-27T01-26-04.406853-00-00-hard.md) β€” hard-suite pass-rate: 86% (12/14) [95% CI 60–96%]
- [Jun 27 2026 03:02 β€” learn](runs/2026-06-27T01-02-38.235947400-00-00-learn.md) β€” gap-routing accuracy: 33% (4/12) [95% CI 14–61%]
- [Jun 27 2026 02:59 β€” ablation](runs/2026-06-27T00-59-48.523315-00-00-ablation.md) β€” full-prompt pass-rate: 100% (7/7) [95% CI 65–100%]
- [Jun 27 2026 02:55 β€” agent](runs/2026-06-27T00-55-00.556526200-00-00-agent.md) β€” scenario pass-rate: 100% (11/11) [95% CI 74–100%]
- [Jun 27 2026 02:53 β€” agent](runs/2026-06-27T00-53-47.450689500-00-00-agent.md) β€” scenario pass-rate: 100% (11/11) [95% CI 74–100%]
- [Jun 27 2026 02:52 β€” llm](runs/2026-06-27T00-52-32.133470100-00-00-llm.md) β€” latency t100=4124ms, 44.0 tok/s
17 changes: 17 additions & 0 deletions bench/history/log.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
type: log
title: Benchmark run log
---

# Benchmark run log

- Jun 27 2026 02:52 β€” **llm** latency t100=4124ms, 44.0 tok/s (model qwen3.5:9b)
- Jun 27 2026 02:53 β€” **agent** scenario pass-rate: 100% (11/11) [95% CI 74–100%] (model qwen3.5:9b)
- Jun 27 2026 02:55 β€” **agent** scenario pass-rate: 100% (11/11) [95% CI 74–100%] (model qwen3.5:9b)
- Jun 27 2026 02:59 β€” **ablation** full-prompt pass-rate: 100% (7/7) [95% CI 65–100%] (model qwen3.5:9b)
- Jun 27 2026 03:02 β€” **learn** gap-routing accuracy: 33% (4/12) [95% CI 14–61%] (model qwen3.5:9b)
- Jun 27 2026 03:26 β€” **hard** hard-suite pass-rate: 86% (12/14) [95% CI 60–96%] (model qwen3.5:9b)
- Jun 27 2026 14:01 β€” **recall** recall accuracy (direct): 89% (48/54) [95% CI 78–95%] (model qwen3.5:9b)
- Jun 27 2026 15:09 β€” **learnloop-warm** unfamiliar-tool: after learning: 80% (4/5) [95% CI 38–96%] (model qwen3.5:9b)
- Jun 27 2026 16:09 β€” **learnloop-warm** unfamiliar-tool: after learning (cautious): 83% (5/6) [95% CI 44–97%] (model qwen3.5:9b)
- Jun 27 2026 16:28 β€” **learnloop-warm** unfamiliar-tool: after learning (cautious): 83% (5/6) [95% CI 44–97%] (model qwen3.5:9b)
29 changes: 29 additions & 0 deletions bench/history/runs/2026-06-27T00-52-32.133470100-00-00-llm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
type: benchmark-run
title: llm β€” Jun 27 2026 02:52
mode: llm
model: qwen3.5:9b
timestamp: 2026-06-27T00:52:32.133470100+00:00
samples: 1
metric: null
metric_label: latency only
ci_low: 0
ci_high: 1
tags: [benchmark, llm]
---

# llm run β€” Jun 27 2026 02:52

latency t100=4124ms, 44.0 tok/s

| metric | value |
|---|---|
| model | qwen3.5:9b |
| samples (N) | 1 |
| prompt tokens | 4860 |
| TTFT (ms) | 1853 |
| total/turn (ms) | 5329 |
| gen tok/s | 44 |
| time for 100 tok (ms) | 4124 |

Methodology: pass-rates carry a Wilson 95% CI (small N is often insufficient β€” Google "how many raters are enough?"); latency uses "time for 100 output tokens" (Artificial Analysis). See [the log](../log.md) and [index](../index.md).
29 changes: 29 additions & 0 deletions bench/history/runs/2026-06-27T00-53-47.450689500-00-00-agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
type: benchmark-run
title: agent β€” Jun 27 2026 02:53
mode: agent
model: qwen3.5:9b
timestamp: 2026-06-27T00:53:47.450689500+00:00
samples: 3
metric: 1.0
metric_label: scenario pass-rate
ci_low: 0.741
ci_high: 1
tags: [benchmark, agent]
---

# agent run β€” Jun 27 2026 02:53

scenario pass-rate: 100% (11/11) [95% CI 74–100%]

| metric | value |
|---|---|
| model | qwen3.5:9b |
| samples (N) | 3 |
| prompt tokens | 3413 |
| TTFT (ms) | 1699 |
| total/turn (ms) | 2197 |
| gen tok/s | 45.4 |
| time for 100 tok (ms) | 3899 |

Methodology: pass-rates carry a Wilson 95% CI (small N is often insufficient β€” Google "how many raters are enough?"); latency uses "time for 100 output tokens" (Artificial Analysis). See [the log](../log.md) and [index](../index.md).
29 changes: 29 additions & 0 deletions bench/history/runs/2026-06-27T00-55-00.556526200-00-00-agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
type: benchmark-run
title: agent β€” Jun 27 2026 02:55
mode: agent
model: qwen3.5:9b
timestamp: 2026-06-27T00:55:00.556526200+00:00
samples: 3
metric: 1.0
metric_label: scenario pass-rate
ci_low: 0.741
ci_high: 1
tags: [benchmark, agent]
---

# agent run β€” Jun 27 2026 02:55

scenario pass-rate: 100% (11/11) [95% CI 74–100%]

| metric | value |
|---|---|
| model | qwen3.5:9b |
| samples (N) | 3 |
| prompt tokens | 3413 |
| TTFT (ms) | 1718 |
| total/turn (ms) | 2168 |
| gen tok/s | 45.8 |
| time for 100 tok (ms) | 3900 |

Methodology: pass-rates carry a Wilson 95% CI (small N is often insufficient β€” Google "how many raters are enough?"); latency uses "time for 100 output tokens" (Artificial Analysis). See [the log](../log.md) and [index](../index.md).
29 changes: 29 additions & 0 deletions bench/history/runs/2026-06-27T00-59-48.523315-00-00-ablation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
type: benchmark-run
title: ablation β€” Jun 27 2026 02:59
mode: ablation
model: qwen3.5:9b
timestamp: 2026-06-27T00:59:48.523315+00:00
samples: 3
metric: 1.0
metric_label: full-prompt pass-rate
ci_low: 0.646
ci_high: 1
tags: [benchmark, ablation]
---

# ablation run β€” Jun 27 2026 02:59

full-prompt pass-rate: 100% (7/7) [95% CI 65–100%]

| metric | value |
|---|---|
| model | qwen3.5:9b |
| samples (N) | 3 |
| prompt tokens | 4802 |
| TTFT (ms) | 1539 |
| total/turn (ms) | 3476 |
| gen tok/s | 55.6 |
| time for 100 tok (ms) | 3337 |

Methodology: pass-rates carry a Wilson 95% CI (small N is often insufficient β€” Google "how many raters are enough?"); latency uses "time for 100 output tokens" (Artificial Analysis). See [the log](../log.md) and [index](../index.md).
Loading
Loading