Skip to content

Self-improving local agent: autoresearch learn-loop, SkillSpector scanning, benchmark history + harder evals#10

Open
DemOnJR wants to merge 8 commits into
mainfrom
remaining-branch-work
Open

Self-improving local agent: autoresearch learn-loop, SkillSpector scanning, benchmark history + harder evals#10
DemOnJR wants to merge 8 commits into
mainfrom
remaining-branch-work

Conversation

@DemOnJR

@DemOnJR DemOnJR commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Remaining work on this branch, merged for review. Headlines: the local agent can now research a capability it lacks and build its own (security-scanned, quarantined) skill, skills are gated by NVIDIA SkillSpector, and there's a benchmark history dashboard plus harder, research-grounded evals — which uncovered a real character-dropping bug in the Ollama streaming path.

Everything was verified on the GNU toolchain (cargo +stable-x86_64-pc-windows-gnu check clean, tsc clean, xconsole-bench selftest 64/64). Model work tested live against local qwen3.5:9b.

🐛 Bug fix — Ollama stream dropped characters (affects the app, not just tests)

OllamaProvider::append_content_delta's cumulative-vs-incremental heuristic treated an incremental token that equals/prefixes the accumulated tail (repeated chars — the 2nd 2 of 22, a 4 in 443/8446, the 2nd l of hello) as a duplicate and dropped it, silently clipping characters from every agent reply. Surfaced by the new recall benchmark (where the answer is the first token: 44343). Fixed to only treat a chunk as cumulative when strictly longer than the accumulated content. Regression test added to selftest.

🤖 Autoresearch "learn_skill" loop

When the agent needs something it doesn't know, it researches the web and synthesizes a reusable SKILL.md, then applies it. Key finding: a 9B won't self-select a rare tool (trigger recall ~0 across prompt wordings), so the reliable trigger is a pre-turn classifier (assess_gap, recall ~0.75, precision ~1.0) driving an autopilot. Security is the load-bearing part — query sanitized before egress, synthesis grounded only in fetched pages, destructive commands de-fanged, output scanned + quarantined (unverified/) with provenance, never overwriting. Full write-up in AUTORESEARCH.md.

🛡️ NVIDIA SkillSpector security scanner

Every skill (researched or installed) is scanned before save. Fixed the previously-broken parser (it read the wrong schema and would always report "safe"), wired SkillSpector as the primary engine with the built-in heuristic as backstop, added one-click install + an opt-in deep LLM scan via local Ollama (falls back to static for thinking models), surfaced in Settings → Skills. Verified live (malicious skill → 71/HIGH/DO_NOT_INSTALL blocked).

📊 Benchmark history + methodology

Every scored run is recorded and rendered as a self-contained HTML dashboard (bench/results/history.html) and a Google Open Knowledge Format bundle (bench/history/). Methodology drawn from four sources: Wilson 95% confidence intervals on pass-rates (Google "how many raters are enough?" — small N is noisy), time-for-100-tokens latency (Artificial Analysis), and revealed-vs-self-report framing (Google behavioral-dispositions).

🎯 Harder, discriminative evals

The core agent suite saturated at 100%, so:

  • hard — 14 workflow-generated + adversarially-verified scenarios (tool-boundary routing + adversarial restraint). Result 12/14 (86%) — discriminative; found the model fails cross-machine file-transfer routing (upload_file/download_file) while passing all destructive "explain only" restraint traps.
  • recall — tests Google's "Thinking to Recall": direct vs reason-first vs dummy-buffer. Result direct 89% → reason 96% (+7pts); reasoning unlocked exactly the items direct got wrong (chmod 0/3→3/3), matching the paper.

Also on this branch

Ablation bench (soul/memory/skills/brief cost vs quality), Claude Code–style lifecycle hooks, and installer hardening (Windows manifest for AV false positives, single-exe stub, MinGW/corepack fixes).

🤖 Generated with Claude Code

DemOnJR and others added 8 commits June 26, 2026 23:38
New `xconsole-bench ablation` mode seeds realistic content into a
dedicated agent home per variant and toggles one of the four prompt
systems (soul, memory, skills index, project brief) off at a time on
the real build_system_prompt path, then re-runs the scenario set on the
local model.

6 variants (full, -soul, -memory, -skills, -brief, bare) x 7 scenarios
(tool routing, persona, deploy/pkgmgr knowledge, math control), with a
per-system contribution table (full - without) for Δpass / Δtokens /
Δlatency. Adds Expect::ContainsAny, BenchEnv::build_prompt_with, and
seed_variant_home.

Key finding (qwen3.5:9b): the four systems are only ~700 of ~4,500
prompt tokens; the tool JSON schema (~3,000 tok) is the dominant cost
and latency leak. The systems buy +3 passes, all on knowledge grounding
(deploy/pkgmgr go 0/3 without them); skills index is ~dead weight for
coding/VPS tasks; memory and brief are redundant for overlapping facts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…h it, build a quarantined skill, apply it

When the local agent needs to do something it doesn't know, it now researches
the web and synthesizes a reusable SKILL.md itself, then applies it — instead of
guessing. Inspired by karpathy/autoresearch. Design hardened by a 3-critic
adversarial review before building (AUTORESEARCH.md documents the full system).

The reliable trigger is NOT the model self-selecting a tool: measured trigger
recall for a 9B was ~0 across every prompt wording (it answers from memory even
for a fictional tool). The reliable mechanism is a pre-turn classifier
(autoresearch::assess_gap) — one cheap temp-0 question "named tool you're unsure
of? topic or NONE" — which a 9B answers well (recall ~0.75, precision ~1.0). On a
detected gap the autopilot (agent.rs) researches and injects the skill so the
model applies it that turn. Verified end-to-end: fail2ban ask -> gap detected ->
skill built -> grounded answer (jail.local, maxretry=3, bantime=1h).

Security (a researched skill is later FOLLOWED as trusted instructions, so web
text is an injection/RCE laundering vector): the search query is sanitized before
egress (private IPs, internal hosts, the user's own VPS names, credential markers
stripped); synthesis is grounded only in fetched source text at low temp with a
'# TODO: not found in sources' escape hatch; output is structurally validated,
destructive commands de-fanged to '# REQUIRES APPROVAL:' lines, scanned with the
skill_scan engine at a STRICTER threshold than skill_install (>=40 vs 60, so
curl|sh ~55 is refused), and quarantined under unverified/ with provenance
front-matter and an UNVERIFIED banner, never overwriting.

- src-tauri/src/ai/autoresearch.rs: new module (assess_gap, learn,
  process_synthesized pure pipeline, sanitize_query, defang, validate, scan).
- agent.rs: pre-turn autopilot (gated by agent.learn_autopilot, default on).
- tools.rs: learn_skill tool (def + dispatch + ollama tool lists + label).
- context.rs: short LEARN_GUIDANCE backup note (the classifier is the trigger).
- reflection.rs: [gap] detection primes the next turn.
- web_tools.rs: public fetch_text/research_sources + DDG result-URL parser.
- bench.rs: learn / learntune / learnclassify modes + 59-check pure selftest
  (injection refused, defang, quarantine, no-overwrite, query sanitization,
  validation, classifier parsing) — runs with no model/network.

Live web research depends on DuckDuckGo availability (intermittent under load);
the loop degrades safely to 'I'm not certain' when sources can't be fetched.
v2 (deferred): execution-outcome draft->verified promotion, skill refine edge,
proactive research of recurring gaps, skills dedup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The skill_scan.scan_with_skillspector stub parsed the WRONG schema (flat
risk_score/risk_severity/filtered_findings), so if a user had SkillSpector
installed it would read zeros and always report 'safe' — a silent hole.
Fixed to the real schema (risk_assessment.{score,severity,recommendation}
+ issues[]), split out a pure testable parse_skillspector_json, and made
is_blocking() honor the DO_NOT_INSTALL recommendation.

- Discovery: skillspector on PATH, else `uv tool dir --bin`/skillspector
  (uv tool run does not work for git-installed tools). Invoked static-only
  (scan -f json --no-llm) — no API key.
- autoresearch::commit_candidate now runs SkillSpector as the PRIMARY scanner
  (external_scan -> scan_skill) with the built-in heuristic as an always-on
  backstop, both at the stricter autoresearch threshold. skill_install already
  routed through scan_skill, so it now works too.
- App commands skill_scanner_status / install_skill_scanner (install via
  `uv tool install git+https://github.com/NVIDIA/skillspector.git`; uv
  provisions Python 3.12). Settings -> Security SkillScannerCard shows the
  active engine and installs in one click.
- bench `scanner` mode + skill_scan unit tests verify it end-to-end.

Verified live (SkillSpector v2.3.7 installed via uv): a malicious SKILL.md
scores 71/HIGH/DO_NOT_INSTALL and is blocked (Data Exfiltration, Privilege
Escalation, Prompt Injection, Supply Chain); a clean one 0/LOW is allowed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…canner UI to Skills tab

Extends the SkillSpector layer with an optional deep scan that runs its
LLM semantic analysis against the local model (OpenAI-compatible Ollama
endpoint — no API key, nothing leaves the machine), plus a Skills-tab UI.

- ScanOptions{deep,base_url,model} + scan_options_from_db (reads
  skills.scanner_deep / skills.scanner_model; endpoint+model derive from the
  active Ollama provider). scan_skill -> scan_skill_with(opts), threaded through
  autoresearch::learn/external_scan, skill_install, scan_skill_path (now takes
  State<Db>), and the bench.
- Deep scan sets SKILLSPECTOR_PROVIDER/OPENAI_BASE_URL/OPENAI_API_KEY/
  SKILLSPECTOR_MODEL and drops --no-llm.
- ROBUSTNESS: a deep scan that fails or times out (90s) falls back to the STATIC
  SkillSpector scan — never down to the weak built-in heuristic — so enabling
  deep is never worse than static. Verified live.
- UI: moved the scanner card from Security to the Skills tab; it shows the active
  engine, installs SkillSpector in one click, and toggles deep analysis + model.
- bench `scanner [--deep]` exercises both paths.

Finding: local THINKING models (qwen3.x) are unsuitable for the deep scan —
their <think> traces exhaust SkillSpector's completion budget so LLM batches
fail; the run then falls back to static SkillSpector. Use a non-thinking instruct
or cloud model for deep. The static scan is the always-on workhorse (default).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…unded methodology

Every scored bench run (agent/ablation/learn/llm/all) is now recorded to
bench/results/history.jsonl and rendered two ways, applying methodology from
four sources the user asked me to evaluate:

- bench/results/history.html — a self-contained dashboard (inline CSS + vanilla-JS
  SVG charts, data embedded, no external assets) showing pass-rate and latency over
  time. Every pass-rate carries a WILSON 95% CONFIDENCE INTERVAL — the rater paper's
  lesson (3-5 samples is often insufficient; overlapping CIs aren't a real difference;
  even 11/11 shows CI 74-100%). Latency uses "time for 100 output tokens" = TTFT +
  100/(tok/s) (Artificial Analysis). Footer cites all sources.
- bench/history/ — the same history as a Google OPEN KNOWLEDGE FORMAT v0.1 bundle
  (markdown + YAML frontmatter, one typed concept per run, a chronological log.md and
  an index.md). Portable, vendor-neutral, GitHub-renderable. OKF verdict: it fits our
  use case exactly — and our SKILL.md files are already proto-OKF.

New `report` mode regenerates both from history (no model); `--no-history` skips
recording. The learn eval's routing vs. classifier captures "revealed behavior vs.
self-report" (Google behavioral-dispositions paper) — the model's overconfidence.

Seeded with real runs (llm, agent x2, ablation, learn). bench/README.md documents it.

Sources: research.google "how many raters are enough?" + "behavioral dispositions";
artificialanalysis.ai/methodology; Google Cloud Open Knowledge Format.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ecall' reasoning experiment

Two new scored benchmarks (the core `agent` eval saturates at 100%, so it no
longer discriminates) — and a real app bug they uncovered.

BUG FIX (affects the app, not just the bench): OllamaProvider::append_content_delta
silently DROPPED characters. Its cumulative-vs-incremental heuristic treated an
incremental token that equals/prefixes the accumulated tail (repeated chars — the
2nd "2" of "22", a "4" in "443"/"8446", the 2nd "l" of "hello") as a duplicate and
dropped it, clipping characters from every agent reply. Only visible in `recall`,
where the answer is the first token ("443"→"43", "22"→"2"). Fix: treat a chunk as
cumulative only when STRICTLY longer than the accumulated content; otherwise append
verbatim. Regression test added to selftest (64 checks pass).

- `hard` — 14 workflow-generated + adversarially-verified scenarios (tool-boundary
  routing traps + adversarial action-vs-explain restraint), tiered, reported with a
  Wilson CI. Result 12/14 (86%) — finally discriminative; surfaced that the model
  fails cross-machine file-transfer routing (download_file/upload_file), while all
  six destructive "explain only" restraint traps passed.
- `recall` — tests Google's "Thinking to Recall: how reasoning unlocks parametric
  knowledge": each single-hop fact answered direct / reason-first / dummy-buffer.
  Result direct 89% → reason 96% (+7pts); CIs overlap (easy facts saturate) but
  reasoning unlocked exactly the items direct got wrong (chmod 0/3→3/3), matching
  the paper. Per-item unlocked/regressed counting in the verdict.
- run_scenario_suite() refactor with per-tier reporting; both modes feed the history
  dashboard + OKF bundle. bench/README documents them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scan_or_none / external_scan created their scratch dir keyed only on
process::id(). cargo runs unit tests in parallel within one process, so two
concurrent process_synthesized() calls shared the same dir — one wiped the
other's SKILL.md mid-scan, the scanner read nothing, scored 0, and an injection
skill was wrongly Saved instead of Refused (process_refuses_injection_skill).
Single-threaded runs (the bench selftest) never hit it. Fix: unique per-call
scratch dir (pid + a process-wide atomic counter), which also makes concurrent
learning on multiple agent turns safe.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…heck

`learnloop` measures whether the agent actually improves by learning: answer
unfamiliar-tool tasks COLD (memory only), run the autoresearch loop to build a
skill for each, then answer WARM (researched skill injected, as the autopilot
does). Records cold + warm as two history points so the dashboard shows the
before/after, and confirms skills PERSIST (a second learn() dedups instead of
re-researching).

First live run (5 tasks, all via real web research): persistence 5/5; net lift
0 (cold 4/5 → warm 4/5). 4/5 tasks the 9B already knew (no headroom), and the
one genuine gap (caddy unix-socket syntax) REGRESSED 1/3→0/3 when given the
researched skill — a real instance of the skill-quality risk. Honest outcome:
the learning plumbing works end-to-end, but skill quality (and the deferred
execution-verified draft→verified promotion) is the bottleneck, and harder
genuine-gap tasks are needed to measure outcome lift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant