Content

The main source containing:

angular_steering.ipynb: analysis and visualization of activations and directions, extraction of the steering directions, construction of the steering plane.
generate_responsees.py: generation of steered responses (with our vLLM fork) used for evaluation.
evaluate_jailbreak.py: evaluation of substring_matching, LlamaGuard 3, HarmBench and LLM-as-a-judge.
eval_perplexity.py: evaluating the perplexity of steered generation.
eval.sh: script to run the evaluation of tinyBenchmarks. (with github.com/EleutherAI/lm-evaluation-harness)
generate_random_config.py: generation of steering configs on random planes.
visualization.ipynb: visualization of benchmark results used in the paper.
endpoint.py: an OpenAI-compatible endpoint server to play with Angular Steering.
steering_demo.py: a gradio chat UI that uses the endpoint server.

You will also need

Our fork of vLLM with the necessary modifications to support Angular Steering: https://github.com/lone17/vllm/

Installation

Create conda environment:

conda create -n angular_steering python=3.10
conda activate angular_steering
cd llm-activation-control/
pip install -r requirements.txt
conda deactivate
cd ..

Install vLLM from source (following the instructions in the vLLM repo https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html):

cd ../vllm/
conda activate angular_steering
VLLM_USE_PRECOMPILED=1 pip install --editable .
conda deactivate
cd ..

Create a separated conda environment for lm-evaluation-harness and install it following https://github.com/EleutherAI/lm-evaluation-harness:

cd ..
conda create -n lm_eval python=3.10
conda activate lm_eval
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
conda deactivate
cd ..

Reproducing the results

Using precomputed results

We provide the precomputed outputs and results in the output/ and benchmarks/ folders. You can directly use the visualization.ipynb notebook to visualize the results and run the endpoint.py and steering_demo.py scripts to play with Angular Steering.

Generating the results

Run the angular_steering.ipynb notebook to extract the steering directions and construct the steering plane. This will generate the output/ folder with the extracted steering directions and the steering plane.
Run the generate_responsees.py script to generate the steered responses.
The evaluate_jailbreak.py script will evaluate the steered responses with substring_matching, LlamaGuard 3, HarmBench and LLM-as-a-judge. Some of the evaluation requires serving LLMs beforehand.
- Serve meta-llama/Llama-Guard-3-8B using vLLM on port 8898 for LlamaGuard 3 evaluation.
- Serve Qwen/QVQ-72B-Preview using vLLM on port 8809 for LLM-as-a-judge evaluation.
- For HarmBench, the script will serve cais/HarmBench-Llama-2-13b-cls natively so make sure you have enough GPU memory to run it.
- For substring_matching, it does not require serving any LLMs.
- You should edit the methods list in the main function of the script to include/exclude the methods you want to evaluate.
- The output will be saved in the output/ folder.
Run the eval_perplexity.py script to evaluate the perplexity of the steered generation. The result will be saved in the output/ folder.
Run the eval.sh script in the lm-eval environment to evaluate the steered generation on tinyBenchmarks.
- You must first serve run endpoint.py script to serve an OpenAI-compatible endpoint server with Angular Steering supported. Change the model name in the script to the one you want to evaluate.
- Change the MODELS and port variables in the eval.sh script to the models you want to evaluate and the port of the endpoint server.
- The output will be saved in the benchmarks/ folder.
- This might take a while.
Run the visualization.ipynb notebook to visualize the benchmark results.
Optionally, you can run the endpoint.py and steering_demo.py script to play with Angular Steering using a gradio chat UI.

Thank you for your interest in our work!

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
benchmarks		benchmarks
data		data
llm_activation_control		llm_activation_control
output		output
visualization		visualization
.gitignore		.gitignore
README.md		README.md
angular_steering.ipynb		angular_steering.ipynb
angular_steering.py		angular_steering.py
chat_ui.py		chat_ui.py
configs.py		configs.py
control_vector.ipynb		control_vector.ipynb
direction_forcing.ipynb		direction_forcing.ipynb
direction_forcing.py		direction_forcing.py
endpoint.py		endpoint.py
eval_perplexity.py		eval_perplexity.py
eval_tinybench.sh		eval_tinybench.sh
evaluate_jailbreak.py		evaluate_jailbreak.py
generate_random_config.py		generate_random_config.py
generate_responses.py		generate_responses.py
llama_guard.py		llama_guard.py
matrix_transform.ipynb		matrix_transform.ipynb
requirements.txt		requirements.txt
setup.py		setup.py
steering_demo.py		steering_demo.py
tiny_benchmarks.py		tiny_benchmarks.py
visualization.ipynb		visualization.ipynb
weights_basis.ipynb		weights_basis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Content

Installation

Reproducing the results

Using precomputed results

Generating the results

About

Uh oh!

Releases

Packages

Languages

lone17/angular-steering

Folders and files

Latest commit

History

Repository files navigation

Content

Installation

Reproducing the results

Using precomputed results

Generating the results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages