LatentLens interprets what hidden representations in LLM-based models (LLMs, VLMs, ...) encode by finding their nearest neighbors in a bank of contextual text embeddings. Unlike Logit Lens (which projects to vocabulary space), LatentLens compares against embeddings in context — yielding highly interpretable results.
Works with any HuggingFace model (LLMs, VLMs, etc.). No training required.
pip install latentlensOption A: Build your own index of contextual embeddings — point to any HuggingFace model + a text corpus:
import latentlens
# Use the bundled concepts.txt (117k sentences from 23k concepts)
index = latentlens.build_index("meta-llama/Meta-Llama-3-8B", corpus="concepts.txt")
index.save("llama3_index/")
# Or use your own domain-specific corpus
index = latentlens.build_index("meta-llama/Meta-Llama-3-8B", corpus="my_texts.txt")Option B: Load a pre-built index (we provide indices for popular models):
index = latentlens.ContextualIndex.from_pretrained("McGill-NLP/contextual_embeddings-llama3.1-8b")Search — pass any hidden states [num_tokens, hidden_dim] and get back interpretable nearest neighbors:
results = index.search(hidden_states, top_k=5)
# results[i] = [Neighbor(token_str=' dog', similarity=0.42, contextual_layer=27), ...]
# Or search only specific contextual layers:
results = index.search(hidden_states, top_k=5, layers=[8, 27])Full Example: Interpret Hidden States
import latentlens
# Load any HuggingFace model
model, tokenizer = latentlens.load_model("Qwen/Qwen2.5-7B")
# Load a pre-built index (or build your own — see Getting Started)
index = latentlens.ContextualIndex.from_pretrained("McGill-NLP/contextual_embeddings-qwen2.5-7b")
# Get hidden states from your input
# hidden_states[0] = input embeddings
# hidden_states[1] through hidden_states[N] = transformer layer outputs (N = num_hidden_layers)
inputs = tokenizer("a photo of a dog", return_tensors="pt").to("cuda")
hidden_states = latentlens.get_hidden_states(model, inputs["input_ids"])
# Interpret layer 27 — search auto-normalizes the query
results = index.search(hidden_states[27].squeeze(0), top_k=5)
for i, neighbors in enumerate(results):
token = tokenizer.decode(inputs["input_ids"][0, i])
nn = neighbors[0]
print(f"{token:>10} → {nn.token_str!r} (sim={nn.similarity:.2f}, layer={nn.contextual_layer})")For VLMs like Qwen2.5-VL, use apply_chat_template() to format the input — this inserts
the image placeholder tokens that tell the model where visual features go:
import torch
import latentlens
from transformers import AutoProcessor
from PIL import Image
# Load a VLM and its processor
model, tokenizer = latentlens.load_model("Qwen/Qwen2.5-VL-7B-Instruct", dtype=torch.float16)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
# Load pre-built index for this VLM
index = latentlens.ContextualIndex.from_pretrained("McGill-NLP/contextual_embeddings-qwen2.5-vl-7b")
# Process image + text — apply_chat_template inserts the image placeholder tokens
image = Image.open("example.jpg")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image."},
]}]
text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image], text=[text_prompt], return_tensors="pt", padding=True).to("cuda")
# Extract hidden states — pass all processor outputs
hidden_states = latentlens.get_hidden_states(model, **inputs)
# Interpret the last layer's hidden states for all tokens
results = index.search(hidden_states[27].squeeze(0), top_k=5)| Component | What it is | How to get it |
|---|---|---|
| A model | Any HuggingFace LLM or VLM | latentlens.load_model("model_name") |
| A contextual index | Bank of text embeddings from that model | build_index(model, corpus) or from_pretrained() |
| Hidden states to interpret | Your tokens of interest | latentlens.get_hidden_states(model, input_ids) |
The index is built once and reused. See Bundled Corpus below for details on the included corpus, or provide your own domain-specific text.
Tip: If you already have a loaded model, pass it directly to avoid loading twice: build_index("meta-llama/Meta-Llama-3-8B", corpus="concepts.txt", model=model, tokenizer=tokenizer)
We include concepts.txt — a general-purpose corpus of 117k sentences covering ~23k concepts (5 sentences per concept at varying lengths). Concepts are derived by intersecting WordNet lemmas with Brown Corpus vocabulary to obtain a broad, common-usage set; an LLM then generates sentences for each concept. All pre-built indices are built from this corpus using prefix deduplication (identical prefixes produce identical embeddings in causal LMs, so we store each unique prefix only once).
You can also provide your own domain-specific corpus as a .txt file (one sentence per line), a .csv file (first column), or a Python list of strings:
# Custom corpus
index = latentlens.build_index("meta-llama/Meta-Llama-3-8B", corpus="my_domain_texts.txt")
index = latentlens.build_index("meta-llama/Meta-Llama-3-8B", corpus=["sentence 1", "sentence 2"])Relation to the paper: The original LatentLens paper (Krojer et al., 2026) used ~3M Visual Genome phrases (
reproduce/vg_phrases.txt) — a corpus tailored to interpreting visual tokens in VLMs. The library instead usesconcepts.txt, which provides broad coverage of concepts humans care about, making it suitable for interpreting any LLM-based model, not just VLMs. We validate this improved corpus and extraction pipeline in a forthcoming companion paper. To reproduce the original paper results exactly, see Reproducing Paper Results.
We provide pre-computed contextual embeddings for popular LLMs and VLMs, built from the bundled concepts.txt corpus across 8 layers each. Browse all indices in our HuggingFace Collection.
LLMs:
| Model | HuggingFace Repo | Layers | Size |
|---|---|---|---|
| Llama-3.1-8B | McGill-NLP/contextual_embeddings-llama3.1-8b |
1, 2, 4, 8, 16, 24, 30, 31 | 32 GB |
| Gemma-2-9B | McGill-NLP/contextual_embeddings-gemma2-9b |
1, 2, 4, 8, 16, 24, 40, 41 | 15 GB |
| Qwen2.5-7B | McGill-NLP/contextual_embeddings-qwen2.5-7b |
1, 2, 4, 8, 16, 24, 26, 27 | 28 GB |
VLMs (text-only forward passes through the VLM's finetuned LLM backbone):
| Model | HuggingFace Repo | Layers | Size |
|---|---|---|---|
| Qwen2.5-VL-7B | McGill-NLP/contextual_embeddings-qwen2.5-vl-7b |
1, 2, 4, 8, 16, 24, 26, 27 | 28 GB |
| Qwen3-VL-8B | McGill-NLP/contextual_embeddings-qwen3-vl-8b |
1, 2, 4, 8, 16, 24, 34, 35 | 32 GB |
| Molmo2-8B | McGill-NLP/contextual_embeddings-molmo2-8b |
1, 2, 4, 8, 16, 24, 34, 35 | 32 GB |
Load specific layers to save memory and download time:
# Load only early + late layers (~2 layers instead of 8)
index = latentlens.ContextualIndex.from_pretrained(
"McGill-NLP/contextual_embeddings-llama3.1-8b", layers=[1, 31]
)For a self-contained demo interpreting visual tokens in Qwen2-VL (no library install needed):
python quickstart.py # uses bundled example.png
python quickstart.py --image path/to/image.jpg # your own imagePre-computed contextual embeddings are downloaded automatically from HuggingFace. Requires a GPU with >=24GB VRAM.
Note: This section reproduces the original paper, which uses a different corpus (~3M Visual Genome phrases) and extraction pipeline (reservoir sampling, float8 storage) than the library above. The library's
build_index()and pre-built indices useconcepts.txtinstead.
This section walks through reproducing our main results on visual token interpretability in VLMs.
We study how frozen LLMs process visual tokens from vision encoders. We train MLP connectors mapping visual tokens to LLM embedding space, then analyze interpretability using three methods:
| Method | What it does |
|---|---|
| EmbeddingLens | Nearest neighbors in LLM input embedding matrix |
| LogitLens | Apply LM head to intermediate representations |
| LatentLens (ours) | Nearest neighbors in contextual text embeddings |
git clone https://github.com/McGill-NLP/latentlens.git
cd latentlens
# Install with uv (recommended)
uv pip install -e ".[dev]"
# Or with pip
pip install -e ".[dev]"Downloads our trained MLP connectors from HuggingFace, then downloads and converts the base LLMs and vision encoders to Molmo's weight format (~50GB total).
9 model configurations (3 LLMs × 3 vision encoders):
| LLM / Vision | ViT-L/14-336 (CLIP) | DINOv2-L-336 | SigLIP-L |
|---|---|---|---|
| OLMo-7B | olmo-vit |
olmo-dino |
olmo-siglip |
| LLaMA3-8B | llama-vit |
llama-dino |
llama-siglip |
| Qwen2-7B | qwen-vit |
qwen-dino |
qwen-siglip |
We also support Qwen2-VL-7B-Instruct (off-the-shelf VLM, no connector needed).
# Download all models (connectors + base models)
./reproduce/step1_download.sh
# Or download just connector weights (~3GB)
./reproduce/step1_download.sh --connectors-onlyWhat gets downloaded and converted:
| Component | Source | Size |
|---|---|---|
| MLP Connectors (9) | McGill-NLP/latentlens-connectors |
~350MB each |
| OLMo-7B | allenai/OLMo-7B-1024-preview |
~14GB |
| LLaMA3-8B | meta-llama/Meta-Llama-3-8B |
~16GB |
| Qwen2-7B | Qwen/Qwen2-7B |
~14GB |
| ViT-L/14-336 | openai/clip-vit-large-patch14-336 |
~1GB |
| DINOv2-L-336 | facebook/dinov2-large |
~1GB |
| SigLIP-L | google/siglip-so400m-patch14-384 |
~1GB |
Directory structure after download:
checkpoints/ # Connector weights + model configs
├── olmo-vit/
│ ├── model.pt # Connector weights (~350MB)
│ └── config.yaml # Model architecture config
├── olmo-dino/
│ └── ...
└── ...
pretrained/ # Converted base models (Molmo format)
├── olmo-1024-preview.pt
├── llama3-8b.pt
├── qwen2-7b.pt
├── vit-l-14-336.pt
├── dinov2-large-336.pt
└── siglip-so400m-14-384.pt
If you want to train the connectors from scratch instead of using our pretrained weights:
Prerequisites:
- Multi-GPU setup recommended (configs default to 4 GPUs; adjustable via
NPROC_PER_NODE) - PixMo-Cap dataset (see setup below)
Setup:
# Install training dependencies
pip install -e ".[train]"
# Set data directory (images + processed dataset will be stored here)
export MOLMO_DATA_DIR=/path/to/data
# Download the PixMo-Cap dataset (downloads images, may take a while)
python -c "from molmo.data.pixmo_datasets import PixMoCap; PixMoCap.download(n_procs=8)"
# Download base models (LLMs + vision encoders)
./reproduce/step1_download.shTrain:
# Train all 9 models
./reproduce/step0_train.sh
# Or train a single model
./reproduce/step0_train.sh --model olmo-vit
# Or use a different GPU count
NPROC_PER_NODE=8 ./reproduce/step0_train.sh --model olmo-vitEach model trains for 12,000 steps on PixMo-Cap with the LLM and vision encoder frozen — only the MLP connector is trained.
You can also run the training script directly:
torchrun --nproc_per_node=4 reproduce/scripts/train_connector.py \
reproduce/configs/olmo-vit.yaml
# Dry run (parse config, init model, no actual training)
torchrun --nproc_per_node=1 reproduce/scripts/train_connector.py \
reproduce/configs/olmo-vit.yaml --dry_runAfter training, extract connector weights:
python scripts/extract_connector.py \
--checkpoint checkpoints/<save_folder>/step12000-unsharded \
--output connectors/olmo-vit.ptFor LatentLens analysis, you need contextual text embeddings from each LLM. This is the most time-consuming step (~13h per LLM on a single GPU, processing ~3M Visual Genome phrases).
# Extract for all LLMs sequentially
./reproduce/step2_extract_contextual.sh
# Or for a specific LLM:
python reproduce/scripts/extract_embeddings.py \
--model allenai/OLMo-7B-1024-preview \
--layers 1 2 4 8 16 24 30 31 \
--output-dir contextual_embeddings/olmo-7bSpeed up with multiple GPUs: The fastest approach is to run each LLM on a separate GPU in parallel, reducing wall time from ~40h to ~13h:
CUDA_VISIBLE_DEVICES=0 ./reproduce/step2_extract_contextual.sh olmo &
CUDA_VISIBLE_DEVICES=1 ./reproduce/step2_extract_contextual.sh llama &
CUDA_VISIBLE_DEVICES=2 ./reproduce/step2_extract_contextual.sh qwen &
waitThe script supports checkpointing — if interrupted, it resumes from the last saved progress.
LatentLens (contextual nearest neighbors):
CUDA_VISIBLE_DEVICES=0 python reproduce/scripts/run_latentlens.py \
--ckpt-path checkpoints/olmo-vit \
--contextual-dir contextual_embeddings/olmo-7b/allenai_OLMo-7B-1024-preview \
--visual-layer 0,1,2,4,8,16,24,30,31 \
--num-images 300 \
--output-dir results/latentlens/olmo-vitLogitLens:
# Single GPU
CUDA_VISIBLE_DEVICES=0 python reproduce/scripts/run_logitlens.py \
--ckpt-path checkpoints/olmo-vit \
--layers 0,1,2,4,8,16,24,30,31 \
--num-images 300 \
--output-dir results/logitlens/olmo-vit
# Multi-GPU (optional, faster)
torchrun --nproc_per_node=4 reproduce/scripts/run_logitlens.py \
--ckpt-path checkpoints/olmo-vit \
--layers 0,1,2,4,8,16,24,30,31 \
--num-images 300 \
--output-dir results/logitlens/olmo-vitEmbeddingLens:
# Single GPU
CUDA_VISIBLE_DEVICES=0 python reproduce/scripts/run_embedding_lens.py \
--ckpt-path checkpoints/olmo-vit \
--llm_layer 0,1,2,4,8,16,24,30,31 \
--num-images 300 \
--output-base-dir results/embedding_lens/olmo-vit
# Multi-GPU (optional, faster)
torchrun --nproc_per_node=4 reproduce/scripts/run_embedding_lens.py \
--ckpt-path checkpoints/olmo-vit \
--llm_layer 0,1,2,4,8,16,24,30,31 \
--num-images 300 \
--output-base-dir results/embedding_lens/olmo-vitThe paper's main results use GPT-5 to evaluate whether nearest neighbors are semantically related to image patches. This requires an OpenAI API key and costs ~$80-100 for full reproduction.
# Set API key
export OPENAI_API_KEY="your-key-here"
# Evaluate a single model (~$1 for 100 patches)
python reproduce/scripts/evaluate/evaluate_interpretability.py \
--results-dir results/latentlens/olmo-vit \
--images-dir /path/to/pixmo-cap/validation \
--output-dir evaluation/latentlens/olmo-vit \
--num-patches 100
# For SigLIP or DINOv2 models, pass --model-name so the evaluator
# uses the correct vision encoder grid size:
python reproduce/scripts/evaluate/evaluate_interpretability.py \
--results-dir results/latentlens/olmo-siglip \
--images-dir /path/to/pixmo-cap/validation \
--output-dir evaluation/latentlens/olmo-siglip \
--model-name olmo-siglip \
--num-patches 100
# Aggregate results across models
python reproduce/scripts/evaluate/aggregate_results.py \
--eval-dir evaluation/ \
--output results/my_results.jsonNote on
--model-name: The evaluation script needs to know the vision encoder to determine the correct patch grid size. Pass--model-namematching the model you are evaluating:
- CLIP (ViT-L/14) models (
olmo-vit,llama-vit,qwen-vit): No--model-nameneeded — CLIP's 24x24 grid is the default.- SigLIP models (
olmo-siglip,llama-siglip,qwen-siglip): Pass--model-namecontaining "siglip" (e.g.,--model-name olmo-siglip).- DINOv2 models (
olmo-dino,llama-dino,qwen-dino): Pass--model-namecontaining "dinov2" (e.g.,--model-name olmo-dino).- Qwen2-VL: Pass
--model-name qwen2vl.
| Model | LLM Layers | Vision Patches | Layers Analyzed |
|---|---|---|---|
| OLMo-7B / LLaMA3-8B | 32 | 576 (24×24) | 0, 1, 2, 4, 8, 16, 24, 30, 31 |
| Qwen2-7B / Qwen2-VL | 28 | 729 (27×27) | 0, 1, 2, 4, 8, 16, 24, 26, 27 |
Note: SigLIP uses 27×27 patches (729 total), while CLIP and DINOv2 use 24×24 (576 total).
# Run all experiments
./reproduce/run_all.sh
# Or step by step:
./reproduce/step2_extract_contextual.sh # ~40h sequential, ~13h with 3 GPUs in parallel
./reproduce/step3_run_analysis.sh # ~13.5h (9 models × 3 methods)├── concepts.txt # Bundled corpus (117k sentences from 23k concepts)
├── quickstart.py # Try LatentLens in 5 minutes (standalone)
├── latentlens/ # Library: build & search contextual indices
│ ├── index.py # ContextualIndex, Neighbor, search, save/load
│ ├── extract.py # build_index(), corpus loading, prefix dedup
│ └── models.py # load_model(), get_hidden_states(), MODEL_DEFAULTS
├── molmo/ # Molmo VLM infrastructure (for reproduction)
│ ├── model.py # Model architecture with layer hooks
│ ├── config.py # Configuration classes
│ ├── train.py # Trainer class (for connector training)
│ ├── optim.py # Optimizer and LR schedulers
│ ├── eval/ # Loss evaluator (training only)
│ └── data/ # Image preprocessing, datasets
└── reproduce/ # Paper reproduction
├── scripts/ # Analysis scripts
│ ├── run_latentlens.py
│ ├── run_logitlens.py
│ ├── run_embedding_lens.py
│ ├── extract_embeddings.py
│ ├── train_connector.py # Training entry point (torchrun)
│ └── evaluate/ # LLM judge evaluation
├── configs/ # Model configurations (YAML)
├── vg_phrases.txt # Visual Genome phrases corpus
├── step0_train.sh # Train connectors from scratch (optional)
├── step1_download.sh
├── step2_extract_contextual.sh
└── step3_run_analysis.sh
@article{krojer2026latentlens,
title={LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs},
author={Krojer, Benno and Nayak, Shravan and Ma{\~n}as, Oscar and Adlakha, Vaibhav and Elliott, Desmond and Reddy, Siva and Mosbach, Marius},
journal={arXiv preprint arXiv:2602.00462},
year={2026}
}This project builds on Molmo codebase by the Allen Institute for AI. We thank them for releasing their code under the Apache 2.0 license.
MIT License. See LICENSE for details.
