The main source containing:
angular_steering.ipynb: analysis and visualization of activations and directions, extraction of the steering directions, construction of the steering plane.generate_responsees.py: generation of steered responses (with our vLLM fork) used for evaluation.evaluate_jailbreak.py: evaluation ofsubstring_matching,LlamaGuard 3,HarmBenchand LLM-as-a-judge.eval_perplexity.py: evaluating the perplexity of steered generation.eval.sh: script to run the evaluation of tinyBenchmarks. (withgithub.com/EleutherAI/lm-evaluation-harness)generate_random_config.py: generation of steering configs on random planes.visualization.ipynb: visualization of benchmark results used in the paper.endpoint.py: an OpenAI-compatible endpoint server to play with Angular Steering.steering_demo.py: a gradio chat UI that uses the endpoint server.
You will also need
- Our fork of vLLM with the necessary modifications to support Angular Steering: https://github.com/lone17/vllm/
- Create conda environment:
conda create -n angular_steering python=3.10
conda activate angular_steering
cd llm-activation-control/
pip install -r requirements.txt
conda deactivate
cd ..- Install vLLM from source (following the instructions in the vLLM repo https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html):
cd ../vllm/
conda activate angular_steering
VLLM_USE_PRECOMPILED=1 pip install --editable .
conda deactivate
cd ..- Create a separated conda environment for lm-evaluation-harness and install it following https://github.com/EleutherAI/lm-evaluation-harness:
cd ..
conda create -n lm_eval python=3.10
conda activate lm_eval
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
conda deactivate
cd ..We provide the precomputed outputs and results in the output/ and benchmarks/
folders. You can directly use the visualization.ipynb notebook to visualize the
results and run the endpoint.py and steering_demo.py scripts to play with Angular
Steering.
- Run the
angular_steering.ipynbnotebook to extract the steering directions and construct the steering plane. This will generate theoutput/folder with the extracted steering directions and the steering plane. - Run the
generate_responsees.pyscript to generate the steered responses. - The
evaluate_jailbreak.pyscript will evaluate the steered responses withsubstring_matching,LlamaGuard 3,HarmBenchand LLM-as-a-judge. Some of the evaluation requires serving LLMs beforehand.- Serve
meta-llama/Llama-Guard-3-8Busing vLLM on port 8898 forLlamaGuard 3evaluation. - Serve
Qwen/QVQ-72B-Previewusing vLLM on port 8809 for LLM-as-a-judge evaluation. - For
HarmBench, the script will servecais/HarmBench-Llama-2-13b-clsnatively so make sure you have enough GPU memory to run it. - For
substring_matching, it does not require serving any LLMs. - You should edit the
methodslist in the main function of the script to include/exclude the methods you want to evaluate. - The output will be saved in the
output/folder.
- Serve
- Run the
eval_perplexity.pyscript to evaluate the perplexity of the steered generation. The result will be saved in theoutput/folder. - Run the
eval.shscript in thelm-evalenvironment to evaluate the steered generation on tinyBenchmarks.- You must first serve run
endpoint.pyscript to serve an OpenAI-compatible endpoint server with Angular Steering supported. Change the model name in the script to the one you want to evaluate. - Change the
MODELSandportvariables in theeval.shscript to the models you want to evaluate and the port of the endpoint server. - The output will be saved in the
benchmarks/folder. - This might take a while.
- You must first serve run
- Run the
visualization.ipynbnotebook to visualize the benchmark results. - Optionally, you can run the
endpoint.pyandsteering_demo.pyscript to play with Angular Steering using a gradio chat UI.
Thank you for your interest in our work!