[Paper] [Project Page] [BibTeX]
This repository contains code for the paper "Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness".
- The
CLIP_benchmarkfolder contains zero-shot evaluation code of both clean performances and adversarial robustness of CLIP models. - The
RobustVLMfolder contains the robustness evaluation code of VLM Captioning and VQA and VLM Targeted Attacks. - The
Open-LLaVA-NeXTfolder contains the training code of$\Delta^2$ LLaVA.
This paper studies the robustness of vision-language models against adversarial visual perturbations, and introduces a novel double visual defense for improving it.
Rather than the previous works' lightweight adversarial fine-tuning of a pre-trained CLIP model, we opt for adversarial vision-language pre-training on web-scale data.
We then add an extra layer of defense by introducing adversarial visual instruction tuning.
The models that result from each stage,
For evaluating
| Model | Link |
|---|---|
|
|
🤗 HuggingFace Model |
|
|
🤗 HuggingFace Model |
The installation process largely follows the LLaVA and Open-LLaVA-NeXT repo.
- Clone this repository and navigate to Double_Visual_Defense folder
git clone https://github.com/zw615/Double_Visual_Defense.git
cd Double_Visual_Defense/Open-LLaVA-NeXT- Install Package
conda create -n double_visual_defense python=3.10 -y
conda activate double_visual_defense
pip install -e .- Install additional packages for training
pip install -e ".[train]"
yes | conda install -c conda-forge libstdcxx-ng==14.2.0
pip install flash-attn==2.6.3 --no-build-isolation
The data preparation process of LLaVA-v1.5 largely follows the LLaVA repo.
The following presents a breif overview of how to train
It is worth mentioning that to train on fewer GPUs with less memories, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly.
Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.
However, due to a known bug of the gradient accumulation step in earlier version of Transformers,
we also implemented a manual monkey patch specifically for llama models. See llama_grad_accum_monkey_patch.py.
Upgrading to the latest Transformers version might solve the issue, but we've not tested it.
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here.
Each extra number of PGD forward-backward step brings 1x more compute upon vanilla LLaVA training, and the total training time increases accordingly. Note that Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.
Training script with DeepSpeed ZeRO-2: adv_pretrain_template_multinode.sh.
-
--mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector. -
--pretrain_vision_tower /path/to/delta_clip_h14_336.pt:$\Delta$ CLIP-H/14-336. -
--image_processor_name_or_path ./clip_preprocess/open_clip_336/preprocessor_config.json: preprocess config of$\Delta$ CLIP-H/14-336. -
--epsilon: the PGD attack radius. -
--step_size: the PGD attack step size. -
--num_steps: the PGD attack number of steps.
- Prepare data
Please download the annotation of the final mixture LLaVA-v1.5 instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg - TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
- Start training!
It is not recommended to use any clean pretrained LLaVA projectors as they are not adversarially robust.
Due to memory cost, we default to Low-rank Adaptation (LoRA) technique in
Each extra number of PGD forward-backward step brings 1x more compute upon vanilla LLaVA training, and the total training time increases accordingly. Note that Visual instruction tuning takes around 20 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 10 hours for LLaVA-v1.5-7B on 8x A100 (40G).
Training script with DeepSpeed ZeRO-2: adv_finetune_lora_template_multinode.sh.
New options to note:
--unfreeze_mm_vision_tower True: finetune vision tower.--mm_vision_tower_lr 1e-6: learning rate of vision tower.--image_aspect_ratio pad: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.--group_by_modality_length True: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.--epsilon: the PGD attack radius.--step_size: the PGD attack step size.--num_steps: the PGD attack number of steps.
In our experiments, we opted for the awesome lmms-eval for efficient clean performance evaluation. You may also use the evaluation script from the original LLaVA and Open-LLaVA-NeXT repo.
It should be relatively easy to adapt to those repositories since our trained LLaVA model can be loaded using the following sample code
from llava.model.builder import load_pretrained_model
model_path=/path/to/delta2_llava_8/weight
model_base='lmsys/vicuna-7b-v1.5'
model_name='llava-v1.5-7b'
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, model_base, model_name)
This repo is built upon LLaVA, Open-LLaVA-NeXT, CLIP_benchmark, RobustVLM. Huge thanks for their great contribution to the open-source community!
The code of this site is released under the MIT license.
LLNL-Code-2002980
We would like to thank TPU Research Cloud (TRC) program, Google Cloud Research Credits program, and AWS Cloud Credit for Research program for partially supporting our computing needs. Cihang Xie is partially support by a gift from Open Philanthropy. This work is partially based upon the work supported by the National Center for Transportation Cybersecurity and Resiliency (TraCR) (a U.S. Department of Transportation National University Transportation Center) headquartered at Clemson University, Clemson, South Carolina, USA. Any opinions, findings, conclusions, and recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of TraCR, and the U.S. Government assumes no liability for the contents or use thereof.
If you find this repository useful, please consider citing our paper:
@article{wang2025double,
title={Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness},
author={Wang, Zeyu and Xie, Cihang and Bartoldson, Brian and Kailkhura, Bhavya},
journal={arXiv preprint arXiv:2501.09446},
year={2025}
}

