Self-improving local agent: autoresearch learn-loop, SkillSpector scanning, benchmark history + harder evals#10
Open
DemOnJR wants to merge 8 commits into
Open
Self-improving local agent: autoresearch learn-loop, SkillSpector scanning, benchmark history + harder evals#10DemOnJR wants to merge 8 commits into
DemOnJR wants to merge 8 commits into
Conversation
New `xconsole-bench ablation` mode seeds realistic content into a dedicated agent home per variant and toggles one of the four prompt systems (soul, memory, skills index, project brief) off at a time on the real build_system_prompt path, then re-runs the scenario set on the local model. 6 variants (full, -soul, -memory, -skills, -brief, bare) x 7 scenarios (tool routing, persona, deploy/pkgmgr knowledge, math control), with a per-system contribution table (full - without) for Δpass / Δtokens / Δlatency. Adds Expect::ContainsAny, BenchEnv::build_prompt_with, and seed_variant_home. Key finding (qwen3.5:9b): the four systems are only ~700 of ~4,500 prompt tokens; the tool JSON schema (~3,000 tok) is the dominant cost and latency leak. The systems buy +3 passes, all on knowledge grounding (deploy/pkgmgr go 0/3 without them); skills index is ~dead weight for coding/VPS tasks; memory and brief are redundant for overlapping facts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…h it, build a quarantined skill, apply it When the local agent needs to do something it doesn't know, it now researches the web and synthesizes a reusable SKILL.md itself, then applies it — instead of guessing. Inspired by karpathy/autoresearch. Design hardened by a 3-critic adversarial review before building (AUTORESEARCH.md documents the full system). The reliable trigger is NOT the model self-selecting a tool: measured trigger recall for a 9B was ~0 across every prompt wording (it answers from memory even for a fictional tool). The reliable mechanism is a pre-turn classifier (autoresearch::assess_gap) — one cheap temp-0 question "named tool you're unsure of? topic or NONE" — which a 9B answers well (recall ~0.75, precision ~1.0). On a detected gap the autopilot (agent.rs) researches and injects the skill so the model applies it that turn. Verified end-to-end: fail2ban ask -> gap detected -> skill built -> grounded answer (jail.local, maxretry=3, bantime=1h). Security (a researched skill is later FOLLOWED as trusted instructions, so web text is an injection/RCE laundering vector): the search query is sanitized before egress (private IPs, internal hosts, the user's own VPS names, credential markers stripped); synthesis is grounded only in fetched source text at low temp with a '# TODO: not found in sources' escape hatch; output is structurally validated, destructive commands de-fanged to '# REQUIRES APPROVAL:' lines, scanned with the skill_scan engine at a STRICTER threshold than skill_install (>=40 vs 60, so curl|sh ~55 is refused), and quarantined under unverified/ with provenance front-matter and an UNVERIFIED banner, never overwriting. - src-tauri/src/ai/autoresearch.rs: new module (assess_gap, learn, process_synthesized pure pipeline, sanitize_query, defang, validate, scan). - agent.rs: pre-turn autopilot (gated by agent.learn_autopilot, default on). - tools.rs: learn_skill tool (def + dispatch + ollama tool lists + label). - context.rs: short LEARN_GUIDANCE backup note (the classifier is the trigger). - reflection.rs: [gap] detection primes the next turn. - web_tools.rs: public fetch_text/research_sources + DDG result-URL parser. - bench.rs: learn / learntune / learnclassify modes + 59-check pure selftest (injection refused, defang, quarantine, no-overwrite, query sanitization, validation, classifier parsing) — runs with no model/network. Live web research depends on DuckDuckGo availability (intermittent under load); the loop degrades safely to 'I'm not certain' when sources can't be fetched. v2 (deferred): execution-outcome draft->verified promotion, skill refine edge, proactive research of recurring gaps, skills dedup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The skill_scan.scan_with_skillspector stub parsed the WRONG schema (flat
risk_score/risk_severity/filtered_findings), so if a user had SkillSpector
installed it would read zeros and always report 'safe' — a silent hole.
Fixed to the real schema (risk_assessment.{score,severity,recommendation}
+ issues[]), split out a pure testable parse_skillspector_json, and made
is_blocking() honor the DO_NOT_INSTALL recommendation.
- Discovery: skillspector on PATH, else `uv tool dir --bin`/skillspector
(uv tool run does not work for git-installed tools). Invoked static-only
(scan -f json --no-llm) — no API key.
- autoresearch::commit_candidate now runs SkillSpector as the PRIMARY scanner
(external_scan -> scan_skill) with the built-in heuristic as an always-on
backstop, both at the stricter autoresearch threshold. skill_install already
routed through scan_skill, so it now works too.
- App commands skill_scanner_status / install_skill_scanner (install via
`uv tool install git+https://github.com/NVIDIA/skillspector.git`; uv
provisions Python 3.12). Settings -> Security SkillScannerCard shows the
active engine and installs in one click.
- bench `scanner` mode + skill_scan unit tests verify it end-to-end.
Verified live (SkillSpector v2.3.7 installed via uv): a malicious SKILL.md
scores 71/HIGH/DO_NOT_INSTALL and is blocked (Data Exfiltration, Privilege
Escalation, Prompt Injection, Supply Chain); a clean one 0/LOW is allowed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…canner UI to Skills tab
Extends the SkillSpector layer with an optional deep scan that runs its
LLM semantic analysis against the local model (OpenAI-compatible Ollama
endpoint — no API key, nothing leaves the machine), plus a Skills-tab UI.
- ScanOptions{deep,base_url,model} + scan_options_from_db (reads
skills.scanner_deep / skills.scanner_model; endpoint+model derive from the
active Ollama provider). scan_skill -> scan_skill_with(opts), threaded through
autoresearch::learn/external_scan, skill_install, scan_skill_path (now takes
State<Db>), and the bench.
- Deep scan sets SKILLSPECTOR_PROVIDER/OPENAI_BASE_URL/OPENAI_API_KEY/
SKILLSPECTOR_MODEL and drops --no-llm.
- ROBUSTNESS: a deep scan that fails or times out (90s) falls back to the STATIC
SkillSpector scan — never down to the weak built-in heuristic — so enabling
deep is never worse than static. Verified live.
- UI: moved the scanner card from Security to the Skills tab; it shows the active
engine, installs SkillSpector in one click, and toggles deep analysis + model.
- bench `scanner [--deep]` exercises both paths.
Finding: local THINKING models (qwen3.x) are unsuitable for the deep scan —
their <think> traces exhaust SkillSpector's completion budget so LLM batches
fail; the run then falls back to static SkillSpector. Use a non-thinking instruct
or cloud model for deep. The static scan is the always-on workhorse (default).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…unded methodology Every scored bench run (agent/ablation/learn/llm/all) is now recorded to bench/results/history.jsonl and rendered two ways, applying methodology from four sources the user asked me to evaluate: - bench/results/history.html — a self-contained dashboard (inline CSS + vanilla-JS SVG charts, data embedded, no external assets) showing pass-rate and latency over time. Every pass-rate carries a WILSON 95% CONFIDENCE INTERVAL — the rater paper's lesson (3-5 samples is often insufficient; overlapping CIs aren't a real difference; even 11/11 shows CI 74-100%). Latency uses "time for 100 output tokens" = TTFT + 100/(tok/s) (Artificial Analysis). Footer cites all sources. - bench/history/ — the same history as a Google OPEN KNOWLEDGE FORMAT v0.1 bundle (markdown + YAML frontmatter, one typed concept per run, a chronological log.md and an index.md). Portable, vendor-neutral, GitHub-renderable. OKF verdict: it fits our use case exactly — and our SKILL.md files are already proto-OKF. New `report` mode regenerates both from history (no model); `--no-history` skips recording. The learn eval's routing vs. classifier captures "revealed behavior vs. self-report" (Google behavioral-dispositions paper) — the model's overconfidence. Seeded with real runs (llm, agent x2, ablation, learn). bench/README.md documents it. Sources: research.google "how many raters are enough?" + "behavioral dispositions"; artificialanalysis.ai/methodology; Google Cloud Open Knowledge Format. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ecall' reasoning experiment
Two new scored benchmarks (the core `agent` eval saturates at 100%, so it no
longer discriminates) — and a real app bug they uncovered.
BUG FIX (affects the app, not just the bench): OllamaProvider::append_content_delta
silently DROPPED characters. Its cumulative-vs-incremental heuristic treated an
incremental token that equals/prefixes the accumulated tail (repeated chars — the
2nd "2" of "22", a "4" in "443"/"8446", the 2nd "l" of "hello") as a duplicate and
dropped it, clipping characters from every agent reply. Only visible in `recall`,
where the answer is the first token ("443"→"43", "22"→"2"). Fix: treat a chunk as
cumulative only when STRICTLY longer than the accumulated content; otherwise append
verbatim. Regression test added to selftest (64 checks pass).
- `hard` — 14 workflow-generated + adversarially-verified scenarios (tool-boundary
routing traps + adversarial action-vs-explain restraint), tiered, reported with a
Wilson CI. Result 12/14 (86%) — finally discriminative; surfaced that the model
fails cross-machine file-transfer routing (download_file/upload_file), while all
six destructive "explain only" restraint traps passed.
- `recall` — tests Google's "Thinking to Recall: how reasoning unlocks parametric
knowledge": each single-hop fact answered direct / reason-first / dummy-buffer.
Result direct 89% → reason 96% (+7pts); CIs overlap (easy facts saturate) but
reasoning unlocked exactly the items direct got wrong (chmod 0/3→3/3), matching
the paper. Per-item unlocked/regressed counting in the verdict.
- run_scenario_suite() refactor with per-tier reporting; both modes feed the history
dashboard + OKF bundle. bench/README documents them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scan_or_none / external_scan created their scratch dir keyed only on process::id(). cargo runs unit tests in parallel within one process, so two concurrent process_synthesized() calls shared the same dir — one wiped the other's SKILL.md mid-scan, the scanner read nothing, scored 0, and an injection skill was wrongly Saved instead of Refused (process_refuses_injection_skill). Single-threaded runs (the bench selftest) never hit it. Fix: unique per-call scratch dir (pid + a process-wide atomic counter), which also makes concurrent learning on multiple agent turns safe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…heck `learnloop` measures whether the agent actually improves by learning: answer unfamiliar-tool tasks COLD (memory only), run the autoresearch loop to build a skill for each, then answer WARM (researched skill injected, as the autopilot does). Records cold + warm as two history points so the dashboard shows the before/after, and confirms skills PERSIST (a second learn() dedups instead of re-researching). First live run (5 tasks, all via real web research): persistence 5/5; net lift 0 (cold 4/5 → warm 4/5). 4/5 tasks the 9B already knew (no headroom), and the one genuine gap (caddy unix-socket syntax) REGRESSED 1/3→0/3 when given the researched skill — a real instance of the skill-quality risk. Honest outcome: the learning plumbing works end-to-end, but skill quality (and the deferred execution-verified draft→verified promotion) is the bottleneck, and harder genuine-gap tasks are needed to measure outcome lift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Remaining work on this branch, merged for review. Headlines: the local agent can now research a capability it lacks and build its own (security-scanned, quarantined) skill, skills are gated by NVIDIA SkillSpector, and there's a benchmark history dashboard plus harder, research-grounded evals — which uncovered a real character-dropping bug in the Ollama streaming path.
Everything was verified on the GNU toolchain (
cargo +stable-x86_64-pc-windows-gnu checkclean,tscclean,xconsole-bench selftest64/64). Model work tested live against localqwen3.5:9b.🐛 Bug fix — Ollama stream dropped characters (affects the app, not just tests)
OllamaProvider::append_content_delta's cumulative-vs-incremental heuristic treated an incremental token that equals/prefixes the accumulated tail (repeated chars — the 2nd2of22, a4in443/8446, the 2ndlofhello) as a duplicate and dropped it, silently clipping characters from every agent reply. Surfaced by the newrecallbenchmark (where the answer is the first token:443→43). Fixed to only treat a chunk as cumulative when strictly longer than the accumulated content. Regression test added to selftest.🤖 Autoresearch "learn_skill" loop
When the agent needs something it doesn't know, it researches the web and synthesizes a reusable
SKILL.md, then applies it. Key finding: a 9B won't self-select a rare tool (trigger recall ~0 across prompt wordings), so the reliable trigger is a pre-turn classifier (assess_gap, recall ~0.75, precision ~1.0) driving an autopilot. Security is the load-bearing part — query sanitized before egress, synthesis grounded only in fetched pages, destructive commands de-fanged, output scanned + quarantined (unverified/) with provenance, never overwriting. Full write-up inAUTORESEARCH.md.🛡️ NVIDIA SkillSpector security scanner
Every skill (researched or installed) is scanned before save. Fixed the previously-broken parser (it read the wrong schema and would always report "safe"), wired SkillSpector as the primary engine with the built-in heuristic as backstop, added one-click install + an opt-in deep LLM scan via local Ollama (falls back to static for thinking models), surfaced in Settings → Skills. Verified live (malicious skill → 71/HIGH/DO_NOT_INSTALL blocked).
📊 Benchmark history + methodology
Every scored run is recorded and rendered as a self-contained HTML dashboard (
bench/results/history.html) and a Google Open Knowledge Format bundle (bench/history/). Methodology drawn from four sources: Wilson 95% confidence intervals on pass-rates (Google "how many raters are enough?" — small N is noisy), time-for-100-tokens latency (Artificial Analysis), and revealed-vs-self-report framing (Google behavioral-dispositions).🎯 Harder, discriminative evals
The core
agentsuite saturated at 100%, so:hard— 14 workflow-generated + adversarially-verified scenarios (tool-boundary routing + adversarial restraint). Result 12/14 (86%) — discriminative; found the model fails cross-machine file-transfer routing (upload_file/download_file) while passing all destructive "explain only" restraint traps.recall— tests Google's "Thinking to Recall": direct vs reason-first vs dummy-buffer. Result direct 89% → reason 96% (+7pts); reasoning unlocked exactly the items direct got wrong (chmod 0/3→3/3), matching the paper.Also on this branch
Ablation bench (soul/memory/skills/brief cost vs quality), Claude Code–style lifecycle hooks, and installer hardening (Windows manifest for AV false positives, single-exe stub, MinGW/corepack fixes).
🤖 Generated with Claude Code