#

mechanistic-interpretability

Here are 80 public repositories matching this topic...

pyvene

stanfordnlp / pyvene

Stanford NLP Python library for understanding and improving PyTorch models via interventions

intervention interpretability mechanistic-interpretability activation-intervention activation-patching

Updated Jun 2, 2025
Python

ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models

This repository collects all relevant resources about interpretability in LLMs

dictionary-learning sparse-autoencoder interpretability-and-explainability mechanistic-interpretability

Updated Nov 1, 2024

MadryLab / modelcomponents

Decomposing and Editing Predictions by Modeling Model Computation

attribution pytorch interpretability model-editing mechanistic-interpretability

Updated Jun 12, 2024
Jupyter Notebook

OpenMOSS / Language-Model-SAEs

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.

sparse-autoencoders interpretability sparse-dictionary mechanistic-interpretability

Updated Jun 18, 2025
Python

steering-vectors / steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

nlp ai pytorch gpt huggingface mechanistic-interpretability representation-engineering

Updated Feb 21, 2025
Python

stanfordnlp / axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

intervention interpretability large-language-models mechanistic-interpretability llm-steering

Updated Jun 5, 2025
Python

pauljblazek / deepdistilling

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms

program-synthesis knowledge-distillation inductive-logic-programming domain-adaptation explainable-ai interpretable distilling neurosymbolic model-distillation out-of-distribution-generalization mechanistic-interpretability

Updated Feb 20, 2024
Python

jbloomAus / DecisionTransformerInterpretability

Interpreting how transformers simulate agents performing RL tasks

reinforcement-learning mechanistic-interpretability

Updated Oct 23, 2023
Jupyter Notebook

epfl-dlab / llm-latent-language

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

multilingual-nlp llm mechanistic-interpretability llama2

Updated Mar 11, 2024
Jupyter Notebook

itsqyh / Awesome-LMMs-Mechanistic-Interpretability

A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository aggregates surveys, blog posts, and research papers that explore how LMMs represent, transform, and align multimodal information internally.

generative-model generative paperlist vision-models large-language-models mechanistic-interpretability large-vision-language-models large-multimodal-models vision-foundation-model

Updated Jun 18, 2025

apartresearch / interpretability-starter

🧠 Starter templates for doing interpretability research

interpretability interpretability-jam alignment-jam mechanistic-interpretability

Updated Jul 16, 2023

taufeeque9 / codebook-features

Sparse and discrete interpretability tool for neural networks

transformers features language-model interpretability codebook mechanistic-interpretability

Updated Feb 12, 2024
Python

wesg52 / sparse-probing-paper

Sparse probing paper full code.

ai-safety interpretability ai-alignment mechanistic-interpretability

Updated Dec 17, 2023
Jupyter Notebook

Trustworthy-ML-Lab / CLIP-dissect

[ICLR 23 spotlight] An automatic and efficient tool to describe functionalities of individual neurons in DNNs

deep-neural-networks computer-vision deep-learning interpretable-deep-learning explainable-ai interpretable-machine-learning mechanistic-interpretability

Updated Nov 6, 2023
Jupyter Notebook

automated-brain-explanations

microsoft / automated-brain-explanations

Generating and validating natural-language explanations for the brain.

data-science machine-learning natural-language-processing neuroscience artificial-intelligence fmri gpt explanation language-model interpretability xai fmri-data-analysis huggingface gpt4 large-language-models ai-for-science mechanistic-interpretability automated-interpretability interpretable-embeddings

Updated Jun 11, 2025
Jupyter Notebook

aryamanarora / causalgym

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

benchmark causality interpretability mechanistic-interpretability syntaxgym

Updated Nov 30, 2024
Python

yash-srivastava19 / arrakis

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

transformer garcon explainable-ai mechanistic-interpretability anthropic transformerlens

Updated Apr 22, 2025
Jupyter Notebook

wesg52 / universal-neurons

Universal Neurons in GPT2 Language Models

ai-safety interpretability llm mechanistic-interpretability

Updated May 28, 2024
Jupyter Notebook

Nix07 / finetuning

This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

finetuning entity-tracking mechanistic-interpretability science-of-deep-learning

Updated Mar 21, 2024
Jupyter Notebook

tim-lawson / mlsae

Multi-Layer Sparse Autoencoders (ICLR 2025)

transformer sae sparse-autoencoder mechanistic-interpretability

Updated Feb 11, 2025
Python

Improve this page

Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."