Self-improving local agent: autoresearch learn-loop, SkillSpector scanning, benchmark history + harder evals by DemOnJR · Pull Request #10 · DemOnJR/xConsole

DemOnJR · 2026-06-27T12:08:32Z

Remaining work on this branch, merged for review. Headlines: the local agent can now research a capability it lacks and build its own (security-scanned, quarantined) skill, skills are gated by NVIDIA SkillSpector, and there's a benchmark history dashboard plus harder, research-grounded evals — which uncovered a real character-dropping bug in the Ollama streaming path.

Everything was verified on the GNU toolchain (cargo +stable-x86_64-pc-windows-gnu check clean, tsc clean, xconsole-bench selftest 64/64). Model work tested live against local qwen3.5:9b.

🐛 Bug fix — Ollama stream dropped characters (affects the app, not just tests)

OllamaProvider::append_content_delta's cumulative-vs-incremental heuristic treated an incremental token that equals/prefixes the accumulated tail (repeated chars — the 2nd 2 of 22, a 4 in 443/8446, the 2nd l of hello) as a duplicate and dropped it, silently clipping characters from every agent reply. Surfaced by the new recall benchmark (where the answer is the first token: 443→43). Fixed to only treat a chunk as cumulative when strictly longer than the accumulated content. Regression test added to selftest.

🤖 Autoresearch "learn_skill" loop

When the agent needs something it doesn't know, it researches the web and synthesizes a reusable SKILL.md, then applies it. Key finding: a 9B won't self-select a rare tool (trigger recall ~0 across prompt wordings), so the reliable trigger is a pre-turn classifier (assess_gap, recall ~0.75, precision ~1.0) driving an autopilot. Security is the load-bearing part — query sanitized before egress, synthesis grounded only in fetched pages, destructive commands de-fanged, output scanned + quarantined (unverified/) with provenance, never overwriting. Full write-up in AUTORESEARCH.md.

🛡️ NVIDIA SkillSpector security scanner

Every skill (researched or installed) is scanned before save. Fixed the previously-broken parser (it read the wrong schema and would always report "safe"), wired SkillSpector as the primary engine with the built-in heuristic as backstop, added one-click install + an opt-in deep LLM scan via local Ollama (falls back to static for thinking models), surfaced in Settings → Skills. Verified live (malicious skill → 71/HIGH/DO_NOT_INSTALL blocked).

📊 Benchmark history + methodology

Every scored run is recorded and rendered as a self-contained HTML dashboard (bench/results/history.html) and a Google Open Knowledge Format bundle (bench/history/). Methodology drawn from four sources: Wilson 95% confidence intervals on pass-rates (Google "how many raters are enough?" — small N is noisy), time-for-100-tokens latency (Artificial Analysis), and revealed-vs-self-report framing (Google behavioral-dispositions).

🎯 Harder, discriminative evals

The core agent suite saturated at 100%, so:

hard — 14 workflow-generated + adversarially-verified scenarios (tool-boundary routing + adversarial restraint). Result 12/14 (86%) — discriminative; found the model fails cross-machine file-transfer routing (upload_file/download_file) while passing all destructive "explain only" restraint traps.
recall — tests Google's "Thinking to Recall": direct vs reason-first vs dummy-buffer. Result direct 89% → reason 96% (+7pts); reasoning unlocked exactly the items direct got wrong (chmod 0/3→3/3), matching the paper.

Also on this branch

Ablation bench (soul/memory/skills/brief cost vs quality), Claude Code–style lifecycle hooks, and installer hardening (Windows manifest for AV false positives, single-exe stub, MinGW/corepack fixes).

🤖 Generated with Claude Code

New `xconsole-bench ablation` mode seeds realistic content into a dedicated agent home per variant and toggles one of the four prompt systems (soul, memory, skills index, project brief) off at a time on the real build_system_prompt path, then re-runs the scenario set on the local model. 6 variants (full, -soul, -memory, -skills, -brief, bare) x 7 scenarios (tool routing, persona, deploy/pkgmgr knowledge, math control), with a per-system contribution table (full - without) for Δpass / Δtokens / Δlatency. Adds Expect::ContainsAny, BenchEnv::build_prompt_with, and seed_variant_home. Key finding (qwen3.5:9b): the four systems are only ~700 of ~4,500 prompt tokens; the tool JSON schema (~3,000 tok) is the dominant cost and latency leak. The systems buy +3 passes, all on knowledge grounding (deploy/pkgmgr go 0/3 without them); skills index is ~dead weight for coding/VPS tasks; memory and brief are redundant for overlapping facts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…h it, build a quarantined skill, apply it When the local agent needs to do something it doesn't know, it now researches the web and synthesizes a reusable SKILL.md itself, then applies it — instead of guessing. Inspired by karpathy/autoresearch. Design hardened by a 3-critic adversarial review before building (AUTORESEARCH.md documents the full system). The reliable trigger is NOT the model self-selecting a tool: measured trigger recall for a 9B was ~0 across every prompt wording (it answers from memory even for a fictional tool). The reliable mechanism is a pre-turn classifier (autoresearch::assess_gap) — one cheap temp-0 question "named tool you're unsure of? topic or NONE" — which a 9B answers well (recall ~0.75, precision ~1.0). On a detected gap the autopilot (agent.rs) researches and injects the skill so the model applies it that turn. Verified end-to-end: fail2ban ask -> gap detected -> skill built -> grounded answer (jail.local, maxretry=3, bantime=1h). Security (a researched skill is later FOLLOWED as trusted instructions, so web text is an injection/RCE laundering vector): the search query is sanitized before egress (private IPs, internal hosts, the user's own VPS names, credential markers stripped); synthesis is grounded only in fetched source text at low temp with a '# TODO: not found in sources' escape hatch; output is structurally validated, destructive commands de-fanged to '# REQUIRES APPROVAL:' lines, scanned with the skill_scan engine at a STRICTER threshold than skill_install (>=40 vs 60, so curl|sh ~55 is refused), and quarantined under unverified/ with provenance front-matter and an UNVERIFIED banner, never overwriting. - src-tauri/src/ai/autoresearch.rs: new module (assess_gap, learn, process_synthesized pure pipeline, sanitize_query, defang, validate, scan). - agent.rs: pre-turn autopilot (gated by agent.learn_autopilot, default on). - tools.rs: learn_skill tool (def + dispatch + ollama tool lists + label). - context.rs: short LEARN_GUIDANCE backup note (the classifier is the trigger). - reflection.rs: [gap] detection primes the next turn. - web_tools.rs: public fetch_text/research_sources + DDG result-URL parser. - bench.rs: learn / learntune / learnclassify modes + 59-check pure selftest (injection refused, defang, quarantine, no-overwrite, query sanitization, validation, classifier parsing) — runs with no model/network. Live web research depends on DuckDuckGo availability (intermittent under load); the loop degrades safely to 'I'm not certain' when sources can't be fetched. v2 (deferred): execution-outcome draft->verified promotion, skill refine edge, proactive research of recurring gaps, skills dedup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The skill_scan.scan_with_skillspector stub parsed the WRONG schema (flat risk_score/risk_severity/filtered_findings), so if a user had SkillSpector installed it would read zeros and always report 'safe' — a silent hole. Fixed to the real schema (risk_assessment.{score,severity,recommendation} + issues[]), split out a pure testable parse_skillspector_json, and made is_blocking() honor the DO_NOT_INSTALL recommendation. - Discovery: skillspector on PATH, else `uv tool dir --bin`/skillspector (uv tool run does not work for git-installed tools). Invoked static-only (scan -f json --no-llm) — no API key. - autoresearch::commit_candidate now runs SkillSpector as the PRIMARY scanner (external_scan -> scan_skill) with the built-in heuristic as an always-on backstop, both at the stricter autoresearch threshold. skill_install already routed through scan_skill, so it now works too. - App commands skill_scanner_status / install_skill_scanner (install via `uv tool install git+https://github.com/NVIDIA/skillspector.git`; uv provisions Python 3.12). Settings -> Security SkillScannerCard shows the active engine and installs in one click. - bench `scanner` mode + skill_scan unit tests verify it end-to-end. Verified live (SkillSpector v2.3.7 installed via uv): a malicious SKILL.md scores 71/HIGH/DO_NOT_INSTALL and is blocked (Data Exfiltration, Privilege Escalation, Prompt Injection, Supply Chain); a clean one 0/LOW is allowed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…canner UI to Skills tab Extends the SkillSpector layer with an optional deep scan that runs its LLM semantic analysis against the local model (OpenAI-compatible Ollama endpoint — no API key, nothing leaves the machine), plus a Skills-tab UI. - ScanOptions{deep,base_url,model} + scan_options_from_db (reads skills.scanner_deep / skills.scanner_model; endpoint+model derive from the active Ollama provider). scan_skill -> scan_skill_with(opts), threaded through autoresearch::learn/external_scan, skill_install, scan_skill_path (now takes State<Db>), and the bench. - Deep scan sets SKILLSPECTOR_PROVIDER/OPENAI_BASE_URL/OPENAI_API_KEY/ SKILLSPECTOR_MODEL and drops --no-llm. - ROBUSTNESS: a deep scan that fails or times out (90s) falls back to the STATIC SkillSpector scan — never down to the weak built-in heuristic — so enabling deep is never worse than static. Verified live. - UI: moved the scanner card from Security to the Skills tab; it shows the active engine, installs SkillSpector in one click, and toggles deep analysis + model. - bench `scanner [--deep]` exercises both paths. Finding: local THINKING models (qwen3.x) are unsuitable for the deep scan — their <think> traces exhaust SkillSpector's completion budget so LLM batches fail; the run then falls back to static SkillSpector. Use a non-thinking instruct or cloud model for deep. The static scan is the always-on workhorse (default). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…unded methodology Every scored bench run (agent/ablation/learn/llm/all) is now recorded to bench/results/history.jsonl and rendered two ways, applying methodology from four sources the user asked me to evaluate: - bench/results/history.html — a self-contained dashboard (inline CSS + vanilla-JS SVG charts, data embedded, no external assets) showing pass-rate and latency over time. Every pass-rate carries a WILSON 95% CONFIDENCE INTERVAL — the rater paper's lesson (3-5 samples is often insufficient; overlapping CIs aren't a real difference; even 11/11 shows CI 74-100%). Latency uses "time for 100 output tokens" = TTFT + 100/(tok/s) (Artificial Analysis). Footer cites all sources. - bench/history/ — the same history as a Google OPEN KNOWLEDGE FORMAT v0.1 bundle (markdown + YAML frontmatter, one typed concept per run, a chronological log.md and an index.md). Portable, vendor-neutral, GitHub-renderable. OKF verdict: it fits our use case exactly — and our SKILL.md files are already proto-OKF. New `report` mode regenerates both from history (no model); `--no-history` skips recording. The learn eval's routing vs. classifier captures "revealed behavior vs. self-report" (Google behavioral-dispositions paper) — the model's overconfidence. Seeded with real runs (llm, agent x2, ablation, learn). bench/README.md documents it. Sources: research.google "how many raters are enough?" + "behavioral dispositions"; artificialanalysis.ai/methodology; Google Cloud Open Knowledge Format. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ecall' reasoning experiment Two new scored benchmarks (the core `agent` eval saturates at 100%, so it no longer discriminates) — and a real app bug they uncovered. BUG FIX (affects the app, not just the bench): OllamaProvider::append_content_delta silently DROPPED characters. Its cumulative-vs-incremental heuristic treated an incremental token that equals/prefixes the accumulated tail (repeated chars — the 2nd "2" of "22", a "4" in "443"/"8446", the 2nd "l" of "hello") as a duplicate and dropped it, clipping characters from every agent reply. Only visible in `recall`, where the answer is the first token ("443"→"43", "22"→"2"). Fix: treat a chunk as cumulative only when STRICTLY longer than the accumulated content; otherwise append verbatim. Regression test added to selftest (64 checks pass). - `hard` — 14 workflow-generated + adversarially-verified scenarios (tool-boundary routing traps + adversarial action-vs-explain restraint), tiered, reported with a Wilson CI. Result 12/14 (86%) — finally discriminative; surfaced that the model fails cross-machine file-transfer routing (download_file/upload_file), while all six destructive "explain only" restraint traps passed. - `recall` — tests Google's "Thinking to Recall: how reasoning unlocks parametric knowledge": each single-hop fact answered direct / reason-first / dummy-buffer. Result direct 89% → reason 96% (+7pts); CIs overlap (easy facts saturate) but reasoning unlocked exactly the items direct got wrong (chmod 0/3→3/3), matching the paper. Per-item unlocked/regressed counting in the verdict. - run_scenario_suite() refactor with per-tier reporting; both modes feed the history dashboard + OKF bundle. bench/README documents them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

scan_or_none / external_scan created their scratch dir keyed only on process::id(). cargo runs unit tests in parallel within one process, so two concurrent process_synthesized() calls shared the same dir — one wiped the other's SKILL.md mid-scan, the scanner read nothing, scored 0, and an injection skill was wrongly Saved instead of Refused (process_refuses_injection_skill). Single-threaded runs (the bench selftest) never hit it. Fix: unique per-call scratch dir (pid + a process-wide atomic counter), which also makes concurrent learning on multiple agent turns safe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…heck `learnloop` measures whether the agent actually improves by learning: answer unfamiliar-tool tasks COLD (memory only), run the autoresearch loop to build a skill for each, then answer WARM (researched skill injected, as the autopilot does). Records cold + warm as two history points so the dashboard shows the before/after, and confirms skills PERSIST (a second learn() dedups instead of re-researching). First live run (5 tasks, all via real web research): persistence 5/5; net lift 0 (cold 4/5 → warm 4/5). 4/5 tasks the 9B already knew (no headroom), and the one genuine gap (caddy unix-socket syntax) REGRESSED 1/3→0/3 when given the researched skill — a real instance of the skill-quality risk. Honest outcome: the learning plumbing works end-to-end, but skill quality (and the deferred execution-verified draft→verified promotion) is the bottleneck, and harder genuine-gap tasks are needed to measure outcome lift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

DemOnJR and others added 8 commits June 26, 2026 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-improving local agent: autoresearch learn-loop, SkillSpector scanning, benchmark history + harder evals#10

Self-improving local agent: autoresearch learn-loop, SkillSpector scanning, benchmark history + harder evals#10
DemOnJR wants to merge 8 commits into
mainfrom
remaining-branch-work

DemOnJR commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DemOnJR commented Jun 27, 2026

🐛 Bug fix — Ollama stream dropped characters (affects the app, not just tests)

🤖 Autoresearch "learn_skill" loop

🛡️ NVIDIA SkillSpector security scanner

📊 Benchmark history + methodology

🎯 Harder, discriminative evals

Also on this branch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant