GitHub - Ramyyang/Anonymous-AsFT: Anonymous

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Code for the paper "AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin", introducing a regularization-based method to anchor parameter updates within safety-aligned subspaces for robust LLM fine-tuning.

🎯 Method Overview

Figure 1: The "Narrow Safety Basin" concept. Perturbations along the alignment direction (d_aligned) preserve safety, while orthogonal directions (d_⟂) lead to rapid safety degradation.

Figure 2: The AsFT framework decomposes parameter updates into safety-aligned (d_aligned) and orthogonal (d_⟂) components, suppressing harmful updates via subspace regularization.

Key Idea:
AsFT leverages the alignment direction (weight difference between safety-aligned and base models) as an anchor. By decomposing parameter updates and constraining orthogonal components through a novel regularization term, it ensures fine-tuning remains within the "narrow safety basin", achieving both strong safety preservation and task performance.

🛠️ Setup

Environment Configuration

# Create conda environment
conda create -n AsFT python=3.9
conda activate AsFT
cd AsFT 

# Install dependencies
pip install -r requirements.txt

Model Preparation

# Create model storage directory (if needed)
mkdir -p ckpts/

Model	HuggingFace Link	Notes
Llama-2-7B-Chat	TheBloke/Llama-2-7B-Chat-fp16	Safety-aligned model
Llama-2-7B-base	meta-llama/Llama-2-7b-hf	Base model
Beaver-Dam-7B	PKU-Alignment/beaver-dam-7b	Safety evaluation model

Note: Download the models listed in the table above to the ckpts/ folder.

Directory Structure

AsFT/
├── ckpts/
│   ├── Llama-2-7B-Chat-fp16/
│   ├── Llama-2-7b-hf/
│   └── beaver-dam-7b/
├── configs/
├── ft_datasets/
└── ... (other project folders)

⚠️ Important Notes:

Llama-2 models require access approval on HuggingFace
All models should be placed under ckpts/
Use exact folder names as shown above

🚀 Training

Running Fine-tuning

Training scripts are organized by dataset under scripts/, supporting:
Agnews, Alpaca, GSM8K, SST2

Basic Training Commands

# For Agnews dataset (default 1k_p_0.1 mode)
bash scripts/agnews/AsFT_reg1_p_0.1.sh > finetuned_logs/agnews/AsFT_reg1_p_0.1.log 2>&1 &

# Other datasets
bash scripts/alpaca/AsFT_reg1_p_0.1.sh > finetuned_logs/alpaca/AsFT_reg1_p_0.1.log 2>&1 &
bash scripts/gsm8k/AsFT_reg1_p_0.1.sh > finetuned_logs/gsm8k/AsFT_reg1_p_0.1.log 2>&1 &
bash scripts/SST2/AsFT_reg1_p_0.1.sh > finetuned_logs/SST2/AsFT_reg1_p_0.1.log 2>&1 &

Experimental Modes

Configure training via --mode parameter:

Note: You can modify the --mode parameter in the .sh script file to implement different experimental setups as described in the paper.

Mode	Description
`1k_p_0`	1k samples, 0% harmful data
`1k_p_0.05`	1k samples, 5% harmful data
`1k_p_0.1`	1k samples, 10% harmful data (default)
`1k_p_0.15`	1k samples, 15% harmful data
`1k_p_0.2`	1k samples, 20% harmful data
`0.5k_p_0.1`	500 samples, 10% harmful data
`1.5k_p_0.1`	1500 samples, 10% harmful data
`2k_p_0.1`	2000 samples, 10% harmful data
`2.5k_p_0.1`	2500 samples, 10% harmful data

📊 Evaluation

Poison Evaluation (Safety Assessment)

cd evaluation/poison_evaluation

# Run for Agnews
bash scripts/agnews/eval_agnews.sh > scripts/agnews/eval_agnews.log 2>&1 &

# Other datasets
bash scripts/alpaca/eval_alpaca.sh > scripts/alpaca/eval_alpaca.log 2>&1 &
bash scripts/gsm8k/eval_gsm8k.sh > scripts/gsm8k/eval_gsm8k.log 2>&1 &
bash scripts/SST2/eval_SST2.sh > scripts/SST2/eval_SST2.log 2>&1 &

Utility Evaluation (Task Performance)

# For Agnews
cd evaluation/utility_evaluation/agnews
bash scripts/eval.sh > scripts/eval.log 2>&1 &

# For GSM8K/SST2
cd ../gsm8k && bash scripts/eval.sh
cd ../SST2 && bash scripts/eval.sh

# Alpaca requires LLM-Judge
cd ../alpaca
# Follow instructions in the directory's README.md

📂 Project Structure

AsFT/
├── ckpts/                     # Model checkpoints
├── configs/                   # Training configurations
├── evaluation/
│   ├── poison_evaluation/     # Safety assessment scripts
│   └── utility_evaluation/    # Task performance evaluation
├── finetuned_logs/            # Training logs
├── finetuned_models/          # Fine-tuned model outputs
├── ft_datasets/               # Processed datasets
├── images/                    # Figures for documentation
├── scripts/
│   ├── agnews/                # Dataset-specific scripts
│   ├── alpaca/
│   ├── gsm8k/
│   └── SST2/
├── utils/                     # Utility functions
├── LICENSE
└── requirements.txt

🙏 Acknowledgment

This repository is built upon the following open-source projects:

LLMs-Finetuning-Safety
SafeLoRA
Booster
llm-landscape (for safety landscape visualization)

We sincerely thank the authors of these projects for their foundational contributions. Their work provided critical inspiration and technical references for this research. Special thanks to the LLM safety community for driving innovation in this field.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Code for the paper "AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin", introducing a regularization-based method to anchor parameter updates within safety-aligned subspaces for robust LLM fine-tuning.

🎯 Method Overview

🛠️ Setup

Environment Configuration

Model Preparation

Directory Structure

🚀 Training

Running Fine-tuning

Basic Training Commands

Experimental Modes

📊 Evaluation

Poison Evaluation (Safety Assessment)

Utility Evaluation (Task Performance)

📂 Project Structure

🙏 Acknowledgment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ckpts		ckpts
configs		configs
evaluation		evaluation
finetuned_logs		finetuned_logs
finetuned_models		finetuned_models
ft_datasets		ft_datasets
images		images
inference		inference
model_checkpointing		model_checkpointing
peft		peft
policies		policies
scripts		scripts
utils		utils
AsFT_finetuning.py		AsFT_finetuning.py
LICENSE		LICENSE
README.md		README.md
finetuning.py		finetuning.py
requirements.txt		requirements.txt

License

Ramyyang/Anonymous-AsFT

Folders and files

Latest commit

History

Repository files navigation

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Code for the paper "AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin", introducing a regularization-based method to anchor parameter updates within safety-aligned subspaces for robust LLM fine-tuning.

🎯 Method Overview

🛠️ Setup

Environment Configuration

Model Preparation

Directory Structure

🚀 Training

Running Fine-tuning

Basic Training Commands

Experimental Modes

📊 Evaluation

Poison Evaluation (Safety Assessment)

Utility Evaluation (Task Performance)

📂 Project Structure

🙏 Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages