Skip to content

Ramyyang/Anonymous-AsFT

Repository files navigation

Logo
Code for the paper "AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin", introducing a regularization-based method to anchor parameter updates within safety-aligned subspaces for robust LLM fine-tuning.

🎯 Method Overview

Narrow Safety Basin

Figure 1: The "Narrow Safety Basin" concept. Perturbations along the alignment direction (daligned) preserve safety, while orthogonal directions (d) lead to rapid safety degradation.

AsFT Framework

Figure 2: The AsFT framework decomposes parameter updates into safety-aligned (daligned) and orthogonal (d) components, suppressing harmful updates via subspace regularization.

Key Idea:
AsFT leverages the alignment direction (weight difference between safety-aligned and base models) as an anchor. By decomposing parameter updates and constraining orthogonal components through a novel regularization term, it ensures fine-tuning remains within the "narrow safety basin", achieving both strong safety preservation and task performance.


🛠️ Setup

Environment Configuration

# Create conda environment
conda create -n AsFT python=3.9
conda activate AsFT
cd AsFT 

# Install dependencies
pip install -r requirements.txt

Model Preparation

# Create model storage directory (if needed)
mkdir -p ckpts/
Model HuggingFace Link Notes
Llama-2-7B-Chat TheBloke/Llama-2-7B-Chat-fp16 Safety-aligned model
Llama-2-7B-base meta-llama/Llama-2-7b-hf Base model
Beaver-Dam-7B PKU-Alignment/beaver-dam-7b Safety evaluation model

Note: Download the models listed in the table above to the ckpts/ folder.

Directory Structure

AsFT/
├── ckpts/
│   ├── Llama-2-7B-Chat-fp16/
│   ├── Llama-2-7b-hf/
│   └── beaver-dam-7b/
├── configs/
├── ft_datasets/
└── ... (other project folders)

⚠️ Important Notes:

  • Llama-2 models require access approval on HuggingFace
  • All models should be placed under ckpts/
  • Use exact folder names as shown above

🚀 Training

Running Fine-tuning

Training scripts are organized by dataset under scripts/, supporting:
Agnews, Alpaca, GSM8K, SST2

Basic Training Commands

# For Agnews dataset (default 1k_p_0.1 mode)
bash scripts/agnews/AsFT_reg1_p_0.1.sh > finetuned_logs/agnews/AsFT_reg1_p_0.1.log 2>&1 &

# Other datasets
bash scripts/alpaca/AsFT_reg1_p_0.1.sh > finetuned_logs/alpaca/AsFT_reg1_p_0.1.log 2>&1 &
bash scripts/gsm8k/AsFT_reg1_p_0.1.sh > finetuned_logs/gsm8k/AsFT_reg1_p_0.1.log 2>&1 &
bash scripts/SST2/AsFT_reg1_p_0.1.sh > finetuned_logs/SST2/AsFT_reg1_p_0.1.log 2>&1 &

Experimental Modes

Configure training via --mode parameter:

Note: You can modify the --mode parameter in the .sh script file to implement different experimental setups as described in the paper.

Mode Description
1k_p_0 1k samples, 0% harmful data
1k_p_0.05 1k samples, 5% harmful data
1k_p_0.1 1k samples, 10% harmful data (default)
1k_p_0.15 1k samples, 15% harmful data
1k_p_0.2 1k samples, 20% harmful data
0.5k_p_0.1 500 samples, 10% harmful data
1.5k_p_0.1 1500 samples, 10% harmful data
2k_p_0.1 2000 samples, 10% harmful data
2.5k_p_0.1 2500 samples, 10% harmful data

📊 Evaluation

Poison Evaluation (Safety Assessment)

cd evaluation/poison_evaluation

# Run for Agnews
bash scripts/agnews/eval_agnews.sh > scripts/agnews/eval_agnews.log 2>&1 &

# Other datasets
bash scripts/alpaca/eval_alpaca.sh > scripts/alpaca/eval_alpaca.log 2>&1 &
bash scripts/gsm8k/eval_gsm8k.sh > scripts/gsm8k/eval_gsm8k.log 2>&1 &
bash scripts/SST2/eval_SST2.sh > scripts/SST2/eval_SST2.log 2>&1 &

Utility Evaluation (Task Performance)

# For Agnews
cd evaluation/utility_evaluation/agnews
bash scripts/eval.sh > scripts/eval.log 2>&1 &

# For GSM8K/SST2
cd ../gsm8k && bash scripts/eval.sh
cd ../SST2 && bash scripts/eval.sh

# Alpaca requires LLM-Judge
cd ../alpaca
# Follow instructions in the directory's README.md

📂 Project Structure

AsFT/
├── ckpts/                     # Model checkpoints
├── configs/                   # Training configurations
├── evaluation/
│   ├── poison_evaluation/     # Safety assessment scripts
│   └── utility_evaluation/    # Task performance evaluation
├── finetuned_logs/            # Training logs
├── finetuned_models/          # Fine-tuned model outputs
├── ft_datasets/               # Processed datasets
├── images/                    # Figures for documentation
├── scripts/
│   ├── agnews/                # Dataset-specific scripts
│   ├── alpaca/
│   ├── gsm8k/
│   └── SST2/
├── utils/                     # Utility functions
├── LICENSE
└── requirements.txt

🙏 Acknowledgment

This repository is built upon the following open-source projects:

We sincerely thank the authors of these projects for their foundational contributions. Their work provided critical inspiration and technical references for this research. Special thanks to the LLM safety community for driving innovation in this field.


About

Anonymous

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published