Skip to content

McGill-NLP/latentlens

Repository files navigation

LatentLens

arXiv Demo License

LatentLens interprets what hidden representations in LLM-based models (LLMs, VLMs, ...) encode by finding their nearest neighbors in a bank of contextual text embeddings. Unlike Logit Lens (which projects to vocabulary space), LatentLens compares against embeddings in context — yielding highly interpretable results.

Works with any HuggingFace model (LLMs, VLMs, etc.). No training required.

Getting Started

pip install latentlens

Option A: Build your own index of contextual embeddings — point to any HuggingFace model + a text corpus:

import latentlens

# Use the bundled concepts.txt (117k sentences from 23k concepts)
index = latentlens.build_index("meta-llama/Meta-Llama-3-8B", corpus="concepts.txt")
index.save("llama3_index/")

# Or use your own domain-specific corpus
index = latentlens.build_index("meta-llama/Meta-Llama-3-8B", corpus="my_texts.txt")

Option B: Load a pre-built index (we provide indices for popular models):

index = latentlens.ContextualIndex.from_pretrained("McGill-NLP/contextual_embeddings-llama3.1-8b")

Search — pass any hidden states [num_tokens, hidden_dim] and get back interpretable nearest neighbors:

results = index.search(hidden_states, top_k=5)
# results[i] = [Neighbor(token_str=' dog', similarity=0.42, contextual_layer=27), ...]

# Or search only specific contextual layers:
results = index.search(hidden_states, top_k=5, layers=[8, 27])

LatentLens method overview

Full Example: Interpret Hidden States

import latentlens

# Load any HuggingFace model
model, tokenizer = latentlens.load_model("Qwen/Qwen2.5-7B")

# Load a pre-built index (or build your own — see Getting Started)
index = latentlens.ContextualIndex.from_pretrained("McGill-NLP/contextual_embeddings-qwen2.5-7b")

# Get hidden states from your input
# hidden_states[0] = input embeddings
# hidden_states[1] through hidden_states[N] = transformer layer outputs (N = num_hidden_layers)
inputs = tokenizer("a photo of a dog", return_tensors="pt").to("cuda")
hidden_states = latentlens.get_hidden_states(model, inputs["input_ids"])

# Interpret layer 27 — search auto-normalizes the query
results = index.search(hidden_states[27].squeeze(0), top_k=5)

for i, neighbors in enumerate(results):
    token = tokenizer.decode(inputs["input_ids"][0, i])
    nn = neighbors[0]
    print(f"{token:>10}{nn.token_str!r} (sim={nn.similarity:.2f}, layer={nn.contextual_layer})")

VLM Example: Interpret Visual Tokens

For VLMs like Qwen2.5-VL, use apply_chat_template() to format the input — this inserts the image placeholder tokens that tell the model where visual features go:

import torch
import latentlens
from transformers import AutoProcessor
from PIL import Image

# Load a VLM and its processor
model, tokenizer = latentlens.load_model("Qwen/Qwen2.5-VL-7B-Instruct", dtype=torch.float16)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Load pre-built index for this VLM
index = latentlens.ContextualIndex.from_pretrained("McGill-NLP/contextual_embeddings-qwen2.5-vl-7b")

# Process image + text — apply_chat_template inserts the image placeholder tokens
image = Image.open("example.jpg")
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": "Describe this image."},
]}]
text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image], text=[text_prompt], return_tensors="pt", padding=True).to("cuda")

# Extract hidden states — pass all processor outputs
hidden_states = latentlens.get_hidden_states(model, **inputs)

# Interpret the last layer's hidden states for all tokens
results = index.search(hidden_states[27].squeeze(0), top_k=5)

What You Need

Component What it is How to get it
A model Any HuggingFace LLM or VLM latentlens.load_model("model_name")
A contextual index Bank of text embeddings from that model build_index(model, corpus) or from_pretrained()
Hidden states to interpret Your tokens of interest latentlens.get_hidden_states(model, input_ids)

The index is built once and reused. See Bundled Corpus below for details on the included corpus, or provide your own domain-specific text.

Tip: If you already have a loaded model, pass it directly to avoid loading twice: build_index("meta-llama/Meta-Llama-3-8B", corpus="concepts.txt", model=model, tokenizer=tokenizer)

Bundled Corpus: concepts.txt

We include concepts.txt — a general-purpose corpus of 117k sentences covering ~23k concepts (5 sentences per concept at varying lengths). Concepts are derived by intersecting WordNet lemmas with Brown Corpus vocabulary to obtain a broad, common-usage set; an LLM then generates sentences for each concept. All pre-built indices are built from this corpus using prefix deduplication (identical prefixes produce identical embeddings in causal LMs, so we store each unique prefix only once).

You can also provide your own domain-specific corpus as a .txt file (one sentence per line), a .csv file (first column), or a Python list of strings:

# Custom corpus
index = latentlens.build_index("meta-llama/Meta-Llama-3-8B", corpus="my_domain_texts.txt")
index = latentlens.build_index("meta-llama/Meta-Llama-3-8B", corpus=["sentence 1", "sentence 2"])

Relation to the paper: The original LatentLens paper (Krojer et al., 2026) used ~3M Visual Genome phrases (reproduce/vg_phrases.txt) — a corpus tailored to interpreting visual tokens in VLMs. The library instead uses concepts.txt, which provides broad coverage of concepts humans care about, making it suitable for interpreting any LLM-based model, not just VLMs. We validate this improved corpus and extraction pipeline in a forthcoming companion paper. To reproduce the original paper results exactly, see Reproducing Paper Results.

Pre-built Indices

We provide pre-computed contextual embeddings for popular LLMs and VLMs, built from the bundled concepts.txt corpus across 8 layers each. Browse all indices in our HuggingFace Collection.

LLMs:

Model HuggingFace Repo Layers Size
Llama-3.1-8B McGill-NLP/contextual_embeddings-llama3.1-8b 1, 2, 4, 8, 16, 24, 30, 31 32 GB
Gemma-2-9B McGill-NLP/contextual_embeddings-gemma2-9b 1, 2, 4, 8, 16, 24, 40, 41 15 GB
Qwen2.5-7B McGill-NLP/contextual_embeddings-qwen2.5-7b 1, 2, 4, 8, 16, 24, 26, 27 28 GB

VLMs (text-only forward passes through the VLM's finetuned LLM backbone):

Model HuggingFace Repo Layers Size
Qwen2.5-VL-7B McGill-NLP/contextual_embeddings-qwen2.5-vl-7b 1, 2, 4, 8, 16, 24, 26, 27 28 GB
Qwen3-VL-8B McGill-NLP/contextual_embeddings-qwen3-vl-8b 1, 2, 4, 8, 16, 24, 34, 35 32 GB
Molmo2-8B McGill-NLP/contextual_embeddings-molmo2-8b 1, 2, 4, 8, 16, 24, 34, 35 32 GB

Load specific layers to save memory and download time:

# Load only early + late layers (~2 layers instead of 8)
index = latentlens.ContextualIndex.from_pretrained(
    "McGill-NLP/contextual_embeddings-llama3.1-8b", layers=[1, 31]
)

Quickstart Script (Qwen2-VL visual tokens)

For a self-contained demo interpreting visual tokens in Qwen2-VL (no library install needed):

python quickstart.py                            # uses bundled example.png
python quickstart.py --image path/to/image.jpg  # your own image

Pre-computed contextual embeddings are downloaded automatically from HuggingFace. Requires a GPU with >=24GB VRAM.


Reproducing Paper Results

Note: This section reproduces the original paper, which uses a different corpus (~3M Visual Genome phrases) and extraction pipeline (reservoir sampling, float8 storage) than the library above. The library's build_index() and pre-built indices use concepts.txt instead.

This section walks through reproducing our main results on visual token interpretability in VLMs.

Overview

We study how frozen LLMs process visual tokens from vision encoders. We train MLP connectors mapping visual tokens to LLM embedding space, then analyze interpretability using three methods:

Method What it does
EmbeddingLens Nearest neighbors in LLM input embedding matrix
LogitLens Apply LM head to intermediate representations
LatentLens (ours) Nearest neighbors in contextual text embeddings

Step 1: Install Package

git clone https://github.com/McGill-NLP/latentlens.git
cd latentlens

# Install with uv (recommended)
uv pip install -e ".[dev]"

# Or with pip
pip install -e ".[dev]"

Step 2: Download Models

Downloads our trained MLP connectors from HuggingFace, then downloads and converts the base LLMs and vision encoders to Molmo's weight format (~50GB total).

9 model configurations (3 LLMs × 3 vision encoders):

LLM / Vision ViT-L/14-336 (CLIP) DINOv2-L-336 SigLIP-L
OLMo-7B olmo-vit olmo-dino olmo-siglip
LLaMA3-8B llama-vit llama-dino llama-siglip
Qwen2-7B qwen-vit qwen-dino qwen-siglip

We also support Qwen2-VL-7B-Instruct (off-the-shelf VLM, no connector needed).

# Download all models (connectors + base models)
./reproduce/step1_download.sh

# Or download just connector weights (~3GB)
./reproduce/step1_download.sh --connectors-only

What gets downloaded and converted:

Component Source Size
MLP Connectors (9) McGill-NLP/latentlens-connectors ~350MB each
OLMo-7B allenai/OLMo-7B-1024-preview ~14GB
LLaMA3-8B meta-llama/Meta-Llama-3-8B ~16GB
Qwen2-7B Qwen/Qwen2-7B ~14GB
ViT-L/14-336 openai/clip-vit-large-patch14-336 ~1GB
DINOv2-L-336 facebook/dinov2-large ~1GB
SigLIP-L google/siglip-so400m-patch14-384 ~1GB

Directory structure after download:

checkpoints/           # Connector weights + model configs
├── olmo-vit/
│   ├── model.pt       # Connector weights (~350MB)
│   └── config.yaml    # Model architecture config
├── olmo-dino/
│   └── ...
└── ...

pretrained/            # Converted base models (Molmo format)
├── olmo-1024-preview.pt
├── llama3-8b.pt
├── qwen2-7b.pt
├── vit-l-14-336.pt
├── dinov2-large-336.pt
└── siglip-so400m-14-384.pt

Training Connectors (Optional)

If you want to train the connectors from scratch instead of using our pretrained weights:

Prerequisites:

  • Multi-GPU setup recommended (configs default to 4 GPUs; adjustable via NPROC_PER_NODE)
  • PixMo-Cap dataset (see setup below)

Setup:

# Install training dependencies
pip install -e ".[train]"

# Set data directory (images + processed dataset will be stored here)
export MOLMO_DATA_DIR=/path/to/data

# Download the PixMo-Cap dataset (downloads images, may take a while)
python -c "from molmo.data.pixmo_datasets import PixMoCap; PixMoCap.download(n_procs=8)"

# Download base models (LLMs + vision encoders)
./reproduce/step1_download.sh

Train:

# Train all 9 models
./reproduce/step0_train.sh

# Or train a single model
./reproduce/step0_train.sh --model olmo-vit

# Or use a different GPU count
NPROC_PER_NODE=8 ./reproduce/step0_train.sh --model olmo-vit

Each model trains for 12,000 steps on PixMo-Cap with the LLM and vision encoder frozen — only the MLP connector is trained.

You can also run the training script directly:

torchrun --nproc_per_node=4 reproduce/scripts/train_connector.py \
    reproduce/configs/olmo-vit.yaml

# Dry run (parse config, init model, no actual training)
torchrun --nproc_per_node=1 reproduce/scripts/train_connector.py \
    reproduce/configs/olmo-vit.yaml --dry_run

After training, extract connector weights:

python scripts/extract_connector.py \
    --checkpoint checkpoints/<save_folder>/step12000-unsharded \
    --output connectors/olmo-vit.pt

Step 3: Extract Contextual Embeddings

For LatentLens analysis, you need contextual text embeddings from each LLM. This is the most time-consuming step (~13h per LLM on a single GPU, processing ~3M Visual Genome phrases).

# Extract for all LLMs sequentially
./reproduce/step2_extract_contextual.sh

# Or for a specific LLM:
python reproduce/scripts/extract_embeddings.py \
    --model allenai/OLMo-7B-1024-preview \
    --layers 1 2 4 8 16 24 30 31 \
    --output-dir contextual_embeddings/olmo-7b

Speed up with multiple GPUs: The fastest approach is to run each LLM on a separate GPU in parallel, reducing wall time from ~40h to ~13h:

CUDA_VISIBLE_DEVICES=0 ./reproduce/step2_extract_contextual.sh olmo  &
CUDA_VISIBLE_DEVICES=1 ./reproduce/step2_extract_contextual.sh llama &
CUDA_VISIBLE_DEVICES=2 ./reproduce/step2_extract_contextual.sh qwen  &
wait

The script supports checkpointing — if interrupted, it resumes from the last saved progress.

Step 4: Run Analysis

LatentLens (contextual nearest neighbors):

CUDA_VISIBLE_DEVICES=0 python reproduce/scripts/run_latentlens.py \
    --ckpt-path checkpoints/olmo-vit \
    --contextual-dir contextual_embeddings/olmo-7b/allenai_OLMo-7B-1024-preview \
    --visual-layer 0,1,2,4,8,16,24,30,31 \
    --num-images 300 \
    --output-dir results/latentlens/olmo-vit

LogitLens:

# Single GPU
CUDA_VISIBLE_DEVICES=0 python reproduce/scripts/run_logitlens.py \
    --ckpt-path checkpoints/olmo-vit \
    --layers 0,1,2,4,8,16,24,30,31 \
    --num-images 300 \
    --output-dir results/logitlens/olmo-vit

# Multi-GPU (optional, faster)
torchrun --nproc_per_node=4 reproduce/scripts/run_logitlens.py \
    --ckpt-path checkpoints/olmo-vit \
    --layers 0,1,2,4,8,16,24,30,31 \
    --num-images 300 \
    --output-dir results/logitlens/olmo-vit

EmbeddingLens:

# Single GPU
CUDA_VISIBLE_DEVICES=0 python reproduce/scripts/run_embedding_lens.py \
    --ckpt-path checkpoints/olmo-vit \
    --llm_layer 0,1,2,4,8,16,24,30,31 \
    --num-images 300 \
    --output-base-dir results/embedding_lens/olmo-vit

# Multi-GPU (optional, faster)
torchrun --nproc_per_node=4 reproduce/scripts/run_embedding_lens.py \
    --ckpt-path checkpoints/olmo-vit \
    --llm_layer 0,1,2,4,8,16,24,30,31 \
    --num-images 300 \
    --output-base-dir results/embedding_lens/olmo-vit

Step 5: Evaluate Interpretability (Optional)

The paper's main results use GPT-5 to evaluate whether nearest neighbors are semantically related to image patches. This requires an OpenAI API key and costs ~$80-100 for full reproduction.

# Set API key
export OPENAI_API_KEY="your-key-here"

# Evaluate a single model (~$1 for 100 patches)
python reproduce/scripts/evaluate/evaluate_interpretability.py \
    --results-dir results/latentlens/olmo-vit \
    --images-dir /path/to/pixmo-cap/validation \
    --output-dir evaluation/latentlens/olmo-vit \
    --num-patches 100

# For SigLIP or DINOv2 models, pass --model-name so the evaluator
# uses the correct vision encoder grid size:
python reproduce/scripts/evaluate/evaluate_interpretability.py \
    --results-dir results/latentlens/olmo-siglip \
    --images-dir /path/to/pixmo-cap/validation \
    --output-dir evaluation/latentlens/olmo-siglip \
    --model-name olmo-siglip \
    --num-patches 100

# Aggregate results across models
python reproduce/scripts/evaluate/aggregate_results.py \
    --eval-dir evaluation/ \
    --output results/my_results.json

Note on --model-name: The evaluation script needs to know the vision encoder to determine the correct patch grid size. Pass --model-name matching the model you are evaluating:

  • CLIP (ViT-L/14) models (olmo-vit, llama-vit, qwen-vit): No --model-name needed — CLIP's 24x24 grid is the default.
  • SigLIP models (olmo-siglip, llama-siglip, qwen-siglip): Pass --model-name containing "siglip" (e.g., --model-name olmo-siglip).
  • DINOv2 models (olmo-dino, llama-dino, qwen-dino): Pass --model-name containing "dinov2" (e.g., --model-name olmo-dino).
  • Qwen2-VL: Pass --model-name qwen2vl.

Model Configurations (Paper)

Model LLM Layers Vision Patches Layers Analyzed
OLMo-7B / LLaMA3-8B 32 576 (24×24) 0, 1, 2, 4, 8, 16, 24, 30, 31
Qwen2-7B / Qwen2-VL 28 729 (27×27) 0, 1, 2, 4, 8, 16, 24, 26, 27

Note: SigLIP uses 27×27 patches (729 total), while CLIP and DINOv2 use 24×24 (576 total).

Reproduce Main Results (all 9 models)

# Run all experiments
./reproduce/run_all.sh

# Or step by step:
./reproduce/step2_extract_contextual.sh  # ~40h sequential, ~13h with 3 GPUs in parallel
./reproduce/step3_run_analysis.sh        # ~13.5h (9 models × 3 methods)

Project Structure

├── concepts.txt              # Bundled corpus (117k sentences from 23k concepts)
├── quickstart.py             # Try LatentLens in 5 minutes (standalone)
├── latentlens/               # Library: build & search contextual indices
│   ├── index.py              # ContextualIndex, Neighbor, search, save/load
│   ├── extract.py            # build_index(), corpus loading, prefix dedup
│   └── models.py             # load_model(), get_hidden_states(), MODEL_DEFAULTS
├── molmo/                    # Molmo VLM infrastructure (for reproduction)
│   ├── model.py              # Model architecture with layer hooks
│   ├── config.py             # Configuration classes
│   ├── train.py              # Trainer class (for connector training)
│   ├── optim.py              # Optimizer and LR schedulers
│   ├── eval/                 # Loss evaluator (training only)
│   └── data/                 # Image preprocessing, datasets
└── reproduce/                # Paper reproduction
    ├── scripts/              # Analysis scripts
    │   ├── run_latentlens.py
    │   ├── run_logitlens.py
    │   ├── run_embedding_lens.py
    │   ├── extract_embeddings.py
    │   ├── train_connector.py  # Training entry point (torchrun)
    │   └── evaluate/         # LLM judge evaluation
    ├── configs/              # Model configurations (YAML)
    ├── vg_phrases.txt        # Visual Genome phrases corpus
    ├── step0_train.sh        # Train connectors from scratch (optional)
    ├── step1_download.sh
    ├── step2_extract_contextual.sh
    └── step3_run_analysis.sh

Citation

@article{krojer2026latentlens,
  title={LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs},
  author={Krojer, Benno and Nayak, Shravan and Ma{\~n}as, Oscar and Adlakha, Vaibhav and Elliott, Desmond and Reddy, Siva and Mosbach, Marius},
  journal={arXiv preprint arXiv:2602.00462},
  year={2026}
}

Acknowledgments

This project builds on Molmo codebase by the Allen Institute for AI. We thank them for releasing their code under the Apache 2.0 license.


License

MIT License. See LICENSE for details.

About

Code and data for the paper "LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors