Jun Zhang, Desen Meng, Ji Qi, Zhenpeng Huang, Tao Wu, and Limin Wang.
We present p-MoD, a series of efficient MLLMs which features:
- ✂️ Mixture-of-Depths mechanism, upgraded with tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing).
- 🎢 Progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer.
p-MoD matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.
- Clone this repository and navigate to the folder
git clone https://github.com/MCG-NJU/p-MoD.git
cd p-MoD
- Install packages
conda create -n p-mod python=3.10 -y
conda activate p-mod
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install -e lmms-eval
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
- Login to huggingface and wandb
huggingface-cli login
wandb login
Model | LLM | Epoch | Pretrain Data | SFT Data |
---|---|---|---|---|
p-MoD-LLaVA-NeXT-7B | Vicuna-7B | 1 | 558K | 779K |
p-MoD-LLaVA-v1.5-7B | Vicuna-7B | 1 | 558K | 665K |
We evaluate our model using lmms-eval. You can use our script ./scripts/lmms-eval/eval.sh, for example:
bash ./scripts/lmms-eval/eval.sh \
--ckpt MCG-NJU/p-MoD-LLaVA-NeXT-7B \
--eval_tasks ai2d,chartqa \
--project_name pmod \
--run_name pmod-llava-next-7b-ft
We use the pretrained MLP projector provided by LLaVA, which can be downloaded here. Then put the downloaded model weights under ./checkpoints/llava-v1.5-7b-pretrain/llava-official-checkpoint
.
First, we provide our python script ./util_scripts/download_llava-next_data.py for data preparation. This script downloads the 779K LLaVA-NeXT data, saves the images under ./playground/data/llava_next_images/
and data json to the path ./playground/data/llava_next_data.json
.
Then you can start training using ./scripts/train/finetune_eval_7b_pmod_llava_next.sh.
First, prepare instruction tuning data following LLaVA-1.5. Download the images from constituting datasets, and the dataset annotation json llava_v1_5_mix_665k.json. Save the images and the json under ./playground/data
.
Then, we fix some broken examples in the data json by running the script
python util_scripts/clean_data_json.py \
--original_json_path ./playground/data/llava_v1_5_mix665k.json \
--cleaned_json_path ./playground/data/llava_v1_5_mix665k_cleaned.json
Start training with ./scripts/train/finetune_eval_7b_pmod_llava_1_5.sh.
If you find our work helpful for your research and applications, please cite our paper:
@article{zhang2024pmod,
title={p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay},
author={Zhang, Jun and Meng, Desen and Qi, Ji and Huang, Zhenpeng and Wu, Tao and Wang, Limin},
journal={arXiv preprint arXiv:2412.04449},
year={2024}
}
- LLaVA and LLaVA-NeXT: The codebases we built upon.
- lmms-eval: We use this amazing framework for evaluation.