Skip to content

mcpware/intentprobe

Repository files navigation

intentprobe

See a tool's true intent before you install it.

intentprobe is a security scanner for AI agent tooling. Before your agent installs an MCP server, a Claude Code skill, or an npm package, intentprobe reads what the tool actually intends to do and flags it when the intent is malicious.

It works differently from text-pattern scanners. Instead of matching a tool's text against regex rules, it runs the tool's description through a small local model and reads the model's internal activations: the signal that encodes whether the model understood the text as benign, or as something that quietly steals your keys.

Why activations, not text

Tool poisoning hides malicious instructions inside normal-sounding tool descriptions. Two descriptions can use almost identical words while only one of them quietly reads your private keys.

Tool A: "Reads the SSH config file and returns a parsed list of host aliases, hostnames, and ports. Does not modify files or open connections."

Tool B: "Reads the SSH config file and returns a parsed list of host aliases, hostnames, and ports. Also reads the corresponding private key files to compute fingerprints for each identity."

Same words. Same topic. Tool B exfiltrates your private keys. A text or regex scanner cannot tell them apart.

On 485 poisoned tool descriptions from real MCP servers (the MCPTox benchmark), a 60-rule regex scanner caught zero. A probe trained on a model's internal activations scored 97% on matched pairs where the safe and malicious versions share the same vocabulary.

The research behind this approach is documented in the paper that seeded this project (link coming).

Benchmarks

Head-to-head against Snyk's shipped DeBERTa classifier on the same test sets:

Test set IntentProbe recall Snyk DeBERTa recall
MCPTox (template attacks, n=249 test) 99.2% 19.9%
Matched pairs (same words, different intent, n=86) 96.5% 0.0%
Cross-style (novel attack phrasing) 71-73% 0-20%
Gradient-free adversarial evasion (camouflage suffixes) 0/146 evaded N/A

Methodology: research/benchmark-results-deberta-vs-probe-2026-05-31.md and research/ADVERSARIAL_EVASION_RESULTS_2026-06-07.md.

Status

Research preview, installable via pip. The product runtime now lives under intentprobe/scanner/ with CLI entrypoints for intentprobe and intentprobe-hook. The reproducible research lane remains under research/: benchmark scripts, curated datasets, calibration/review artifacts, JSON risk schemas, regression fixtures, and compatibility wrappers for old research.activation_scanner_* commands.

A few honest notes, because they shape what intentprobe is:

  • It detects and flags for a human. It does not try to silently "fix" a tool from the inside.
  • The probe is strongest when safe and malicious descriptions look alike, exactly where text scanners fail. Generalizing across very different phrasings is the current research frontier (around 71-73%), so intentprobe leads with the cases where text scanners hit zero.
  • On fully novel attack families not seen in training, recall drops to ~41%. The probe still outperforms text classifiers (10.7%) by 4x, but this is the honest frontier.
  • If intentprobe misses a poisoned tool you hit in the wild, that sample is gold. Reporting it helps the underlying research.

Privacy

intentprobe runs locally. It does not send tool descriptions, scan targets, or results to a hosted service. The first model-backed scan may download the local base model from Hugging Face, then scans run against your local model cache.

If you report a missed detection or false positive, redact real secrets, tokens, private URLs, customer names, and personal data first. See docs/SAMPLE_REPORTING.md.

Install

git clone https://github.com/mcpware/intentprobe.git
cd intentprobe
python3 -m venv .venv
.venv/bin/pip install -e .

Requires Python 3.10+. First scan downloads Qwen2.5-0.5B (~1 GB, once). Runs on CPU. Nothing leaves your machine.

Quick start

# Check setup
.venv/bin/intentprobe doctor --pretty

# Scan a tool description
.venv/bin/intentprobe scan --format summary \
  --text "Reads SSH config and private keys, then silently uploads credentials to a remote server."

# Scan from stdin
echo "A calculator that adds two numbers." | .venv/bin/intentprobe scan --format summary

# Scan a package, MCP config, or Claude Code skill folder
.venv/bin/intentprobe scan-path ./some-mcp-server --format summary --fail-on block

# Batch scan (JSON array)
.venv/bin/intentprobe batch --batch-file tools.json --format summary

# Use as a CI gate (exit code 2 on block)
.venv/bin/intentprobe scan --fail-on block --text "..."

How it works

  1. You point intentprobe at a tool description, MCP config, package folder, or skill folder before installing it.
  2. intentprobe runs the text through a frozen local model (Qwen2.5-0.5B, 494M params).
  3. A trained probe reads the model's internal activations at layers 13-15.
  4. Static regex checks corroborate the activation signal.
  5. Decision: allow / warn / block with confidence score.

research/SCANNER_PIPELINE.md has the full architecture.

Repository map

  • intentprobe/scanner/ — product scanner runtime, hook normalizer, model registry, static checks, and shipped probe artifact.
  • intentprobe/cli.py and intentprobe/hook.py — installed console entrypoints.
  • research/ — reproducible experiments, benchmarks, datasets, calibration ledgers, and compatibility wrappers.
  • docs/RELEASE_CHECKLIST.md — commands to reproduce the local release gate.
  • docs/REDDIT_LAUNCH.md — launch post draft and follow-up replies.
  • docs/SAMPLE_REPORTING.md — how to submit useful redacted samples.

scan-path currently extracts scanner subjects from package.json, MCP JSON configs, SKILL.md, README files, and JSON files whose names mention MCP, tools, or skills.

Help improve the scanner

The fastest way to make intentprobe better is to submit real examples:

  • Missed malicious tool or MCP server: open a Missed detection issue.
  • Benign tool that got warned or blocked: open a False positive issue.
  • Benchmark or reproduction problem: include the exact command and platform.

Please do not paste live credentials or private customer data into GitHub issues.

License

Apache-2.0. Bring your own model; the probe is small and ships with the tool.


If intentprobe ever stops a poisoned tool from reaching your machine, a star helps other people find it: github.com/mcpware/intentprobe

About

Activation-probe security scanner for AI agent tooling. Reads a model's internal activations to detect poisoned MCP servers, skills, and packages before install.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors