A minimal Python + PyTorch + Transformers research project for Anthropic-style activation probes on open-weight models.
It does three things:
- Collects hidden-state activations from a target model.
- Trains a tiny linear probe to classify safe vs unsafe exchanges.
- Runs a guarded generation loop that probes hidden states during generation and pauses/escalates when risk rises.
This is a research MVP, not production-ready safety infrastructure.
Install uv if you do not already have it:
curl -LsSf https://astral.sh/uv/install.sh | shThen install the project dependencies:
uv syncThe default model is Gemma 4 E2B. It is smaller than many frontier models, but still needs substantial disk and memory:
uv run train-probe \
--model_id google/gemma-4-E2B-it \
--data_path data/training_data.jsonl \
--layer -4 \
--out_dir ./probe_out_public_safetyThen run guarded generation:
uv run guarded-generate \
--model_id google/gemma-4-E2B-it \
--probe_path ./probe_out_public_safety/probe.pt \
--config_path ./probe_out_public_safety/config.json \
--prompt "Explain SQL injection at a high level"Use Modal when the model download or GPU memory is too large for your local machine. The Modal runner uses an A100 GPU by default and persists Hugging Face downloads plus probe outputs in Modal Volumes.
flowchart LR
LocalRepo["Local repo"] --> ModalRun["uv run modal run modal_train.py"]
ModalRun --> ModalImage["Modal image"]
ModalImage --> A100Job["A100 training job"]
HFSecret["huggingface-secret HF_TOKEN"] --> A100Job
HFCache["open-constitution-hf-cache"] --> A100Job
A100Job --> HFCache
A100Job --> Outputs["open-constitution-outputs"]
Outputs --> ModalGenerate["uv run modal run modal_train.py::generate"]
HFCache --> ModalGenerate
Outputs --> Download["modal volume get"]
Download --> LocalProbe["local probe_out_public_safety"]
Confirm that Modal is authenticated:
uv run modal profile currentCreate the Hugging Face token secret once so model downloads are authenticated:
uv run modal secret create huggingface-secret HF_TOKEN=$HF_TOKENRun Gemma 4 probe training on Modal:
uv run modal run modal_train.pyThe default Modal training run uses a deterministic label-balanced sample of 20,000 examples so
activation extraction finishes in a practical window. To run on the full dataset, pass
--max-examples 0, but expect a much longer job.
Run guarded generation on Modal using the saved probe, without downloading the model or probe locally:
uv run modal run modal_train.py::generate \
--prompt "Explain SQL injection at a high level"Download the saved probe outputs after the job finishes:
uv run modal volume get open-constitution-outputs probe_out_public_safety ./probe_out_public_safetyThe first run downloads the model into the open-constitution-hf-cache Modal Volume. Later
runs reuse that cache. If your Modal workspace supports larger GPUs and you want extra
headroom, change gpu="A100" to gpu="A100-80GB" in modal_train.py.
data/training_data.jsonl expects one JSON object per line:
{"prompt":"...", "response":"...", "label":0}
{"prompt":"...", "response":"...", "label":1}Where:
label: 0= safe / allowedlabel: 1= restricted / unsafe
For a real probe, you need thousands of examples across allowed and disallowed policy boundaries.
data/training_data.jsonl contains metadata-rich prompt/response examples for probe training.
It stays compatible with the current loader because every row includes prompt, response,
and binary label.
Additional metadata fields include split, topic, prompt_intent, response_behavior, and
dataset_version.
For this dataset, label: 1 means unsafe assistant behavior, not merely unsafe user intent.
Malicious prompts paired with safe refusals are labeled 0; this is intentional so the probe
learns not to flag every refusal about a dangerous topic as unsafe.
To prepare the training data from the configured public safety sources, run:
uv run prepare-training-data \
--output_path data/training_data.jsonl \
--summary_path data/dataset_summary.jsonTrain on the converted dataset with:
uv run train-probe \
--model_id google/gemma-4-E2B-it \
--data_path data/training_data.jsonl \
--layer -4 \
--out_dir ./probe_out_public_safety \
--max_examples 20000flowchart TD
Exchange["prompt + response"] --> ForwardPass["target model forward pass with output_hidden_states=True"]
ForwardPass --> HiddenState["selected layer hidden state at final token"]
HiddenState --> Probe["linear probe"]
Probe --> RiskScore["risk score"]
During generation:
flowchart TD
Generate["generate N tokens"] --> ReadHiddenState["read selected hidden state"]
ReadHiddenState --> ProbeRisk["probe risk score"]
ProbeRisk --> SmoothScore["smooth score over a small window"]
SmoothScore --> GuardDecision["continue / pause / escalate / refuse"]
This MVP:
- Uses the final token hidden state only.
- Trains a simple linear probe.
- Uses tiny example data.
- Does not implement Anthropic's exact training tricks.
- Does not patch vLLM.
- Does not replace a real exchange classifier.
Recommended next steps:
- Generate a serious policy dataset.
- Test multiple layers and layer combinations.
- Add token-level labels or soft token weighting.
- Add score smoothing and calibration curves.
- Add a second-stage exchange classifier.
- Integrate into vLLM once the probe is validated.
The default model is now:
google/gemma-4-E2B-itGemma 4 on Hugging Face uses AutoProcessor + AutoModelForImageTextToText, while many text-only models use AutoTokenizer + AutoModelForCausalLM. The MVP handles both.
The repo also uses model chat templates by default through:
tokenizer_or_processor.apply_chat_template(...)or, if the processor wraps a tokenizer:
tokenizer_or_processor.tokenizer.apply_chat_template(...)Disable chat templates with:
--no_chat_templateExample Gemma 4 run:
uv run train-probe \
--model_id google/gemma-4-E2B-it \
--data_path data/training_data.jsonl \
--layer -4 \
--out_dir ./probe_out_public_safetyThen:
uv run guarded-generate \
--model_id google/gemma-4-E2B-it \
--probe_path ./probe_out_public_safety/probe.pt \
--config_path ./probe_out_public_safety/config.json \
--prompt "Explain SQL injection at a high level"Layer sweep:
uv run sweep-layers \
--model_id google/gemma-4-E2B-it \
--data_path data/training_data.jsonl \
--layers="-2,-4,-6,-8,-10,-12"Note: Gemma models may require accepting Google's license terms on Hugging Face before download. google/gemma-4-E2B-it downloads about 10 GB of weights, so use a machine or cloud runtime with enough free disk space.