Skip to content

[WIP] [NeurIPS 2025 Spotlight] Angular Steering: Behavior Control via Rotation in Activation Space

Notifications You must be signed in to change notification settings

lone17/angular-steering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Content

The main source containing:

  • angular_steering.ipynb: analysis and visualization of activations and directions, extraction of the steering directions, construction of the steering plane.
  • generate_responsees.py: generation of steered responses (with our vLLM fork) used for evaluation.
  • evaluate_jailbreak.py: evaluation of substring_matching, LlamaGuard 3, HarmBench and LLM-as-a-judge.
  • eval_perplexity.py: evaluating the perplexity of steered generation.
  • eval.sh: script to run the evaluation of tinyBenchmarks. (with github.com/EleutherAI/lm-evaluation-harness)
  • generate_random_config.py: generation of steering configs on random planes.
  • visualization.ipynb: visualization of benchmark results used in the paper.
  • endpoint.py: an OpenAI-compatible endpoint server to play with Angular Steering.
  • steering_demo.py: a gradio chat UI that uses the endpoint server.

You will also need

Installation

  1. Create conda environment:
conda create -n angular_steering python=3.10
conda activate angular_steering
cd llm-activation-control/
pip install -r requirements.txt
conda deactivate
cd ..
  1. Install vLLM from source (following the instructions in the vLLM repo https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html):
cd ../vllm/
conda activate angular_steering
VLLM_USE_PRECOMPILED=1 pip install --editable .
conda deactivate
cd ..
  1. Create a separated conda environment for lm-evaluation-harness and install it following https://github.com/EleutherAI/lm-evaluation-harness:
cd ..
conda create -n lm_eval python=3.10
conda activate lm_eval
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
conda deactivate
cd ..

Reproducing the results

Using precomputed results

We provide the precomputed outputs and results in the output/ and benchmarks/ folders. You can directly use the visualization.ipynb notebook to visualize the results and run the endpoint.py and steering_demo.py scripts to play with Angular Steering.

Generating the results

  1. Run the angular_steering.ipynb notebook to extract the steering directions and construct the steering plane. This will generate the output/ folder with the extracted steering directions and the steering plane.
  2. Run the generate_responsees.py script to generate the steered responses.
  3. The evaluate_jailbreak.py script will evaluate the steered responses with substring_matching, LlamaGuard 3, HarmBench and LLM-as-a-judge. Some of the evaluation requires serving LLMs beforehand.
    • Serve meta-llama/Llama-Guard-3-8B using vLLM on port 8898 for LlamaGuard 3 evaluation.
    • Serve Qwen/QVQ-72B-Preview using vLLM on port 8809 for LLM-as-a-judge evaluation.
    • For HarmBench, the script will serve cais/HarmBench-Llama-2-13b-cls natively so make sure you have enough GPU memory to run it.
    • For substring_matching, it does not require serving any LLMs.
    • You should edit the methods list in the main function of the script to include/exclude the methods you want to evaluate.
    • The output will be saved in the output/ folder.
  4. Run the eval_perplexity.py script to evaluate the perplexity of the steered generation. The result will be saved in the output/ folder.
  5. Run the eval.sh script in the lm-eval environment to evaluate the steered generation on tinyBenchmarks.
    • You must first serve run endpoint.py script to serve an OpenAI-compatible endpoint server with Angular Steering supported. Change the model name in the script to the one you want to evaluate.
    • Change the MODELS and port variables in the eval.sh script to the models you want to evaluate and the port of the endpoint server.
    • The output will be saved in the benchmarks/ folder.
    • This might take a while.
  6. Run the visualization.ipynb notebook to visualize the benchmark results.
  7. Optionally, you can run the endpoint.py and steering_demo.py script to play with Angular Steering using a gradio chat UI.

Thank you for your interest in our work!

About

[WIP] [NeurIPS 2025 Spotlight] Angular Steering: Behavior Control via Rotation in Activation Space

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages