Skip to content

ValerianFourel/MiniGPT-4

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MiniGPTFace

Training of MiniGPT-4

The training of MiniGPT-4 contains two alignment stages.

1. First pretraining stage

In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets to align the vision and language model. To download and prepare the datasets, please check our first stage dataset preparation instruction. After the first stage, the visual features are mapped and can be understood by the language model. To launch the first stage training, run the following command. In our experiments, we use 4 A100. You can change the save path in the config file train_configs/minigpt4_stage1_pretrain.yaml

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml

A MiniGPT-4 checkpoint with only stage one training can be downloaded here (13B) or here (7B). Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.

2. Second finetuning stage

In the second stage, we use a small high quality image-text pair dataset created by ourselves and convert it to a conversation format to further align MiniGPT-4. To download and prepare our second stage dataset, please check our second stage dataset preparation instruction. To launch the second stage alignment, first specify the path to the checkpoint file trained in stage 1 in train_configs/minigpt4_stage1_pretrain.yaml. You can also specify the output path there. Then, run the following command. In our experiments, we use 1 A100.

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml

After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.

Finetune of MiniGPT-4 Part 2

You firstly need to prepare the dataset. you can follow this step to prepare the dataset. our dataset preparation.

In the train_configs/minigptv2_finetune.yaml, you need to set up the following paths:

llama_model checkpoint path: "/path/to/llama_checkpoint"

ckpt: "/path/to/pretrained_checkpoint"

ckpt save path: "/path/to/save_checkpoint"

For ckpt, you may load from our pretrained model checkpoints:

MiniGPT-v2 (after stage-2) MiniGPT-v2 (after stage-3) MiniGPT-v2 (online developing demo)
Download Download Download
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml

MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning

MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models

Deyao Zhu*, Jun Chen*, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny

*equal contribution

King Abdullah University of Science and Technology

Modifications made

We made the following modifications from the original: Inference files for the command lines:

  • demo_v2_command_line.py
  • demo_v1_command_line.py

We add dataset files inside of: ~/FaceGPT/Archive/MiniGPT-4/minigpt4/datasets/datasets :

  • sharegpt_dataset.py
  • chatgpt4vision_datasets.py

We add train configs files inside of: ~/FaceGPT/Archive/MiniGPT-4/train_configs :

  • minigpt4_stage2_finetune_gpt4vision.yaml
  • minigptv2_finetune_gpt4vision.yaml

We add folders to contain the configurations files for the training files path, to be found at: ~/FaceGPT/Archive/MiniGPT-4/minigpt4/configs/datasets/ :

  • chatgpt4vision
  • sharegpt

Inside of the file MiniGPT-4/minigpt4/datasets/builders/image_text_pair_builder.py :

  • we need a wrapper builder class for each of the finetuning dataset (ShareGPT_Face and FaceGPT4Vision)

We then put of our data, inside of /fast/vfourel/FaceGPT/Data/MiniFaceGPT4Data

For each dataset:

  • we have the ShareGPT Data with images annotated by GPT4Vision and/or the ShareCaptioner by the ShareGPT: Number of images in each file: subset_objects_gpt4vision_100k.json: 19514 images subset_objects_share-captioner_coco_lcs_sam_1246k_1107.json: 17754 images

Number of overlapping image paths: 17759

  • We have the GPT4VisionDataset that we have collected by sending images to OpenAI's API for annotations: We have the subsections: Object counts in gpt4VisionCalls_Cleaned_BySection.json: Pre-Prompt: 4 objects Facial Expression due to contextual cues: 1497 objects Facial Characteristics Long: 1511 objects Factual Generalization: 1246 objects Mouvement Description: 349 objects Interpretative Generalization: 1346 objects Intensity of the emotion: 1584 objects Posture: 739 objects Race: 1188 objects Race and Gender: 192 objects Facial Characteristics Short : 196 objects Gender: 1205 objects Human or not: 1182 objects Feature by Feature Description: 766 objects Fitzpatrick: 94 objects Multi-Person Interaction: 0 objects

The train loop is executed inside of MiniGPT-4/minigpt4/task/base_task.py

We modify the files MiniGPT-4/train_configs/minigptv2_finetune_gpt4vision.yaml to lower the number of workerrs for training to 1, and set the number of gradient accumulations from 1 to 8

In the file: MiniGPT-4/minigpt4/models/minigpt_base.py we modify to return {"loss": loss,"output s": outputs} # modifications by VF Such that we can delete outputs to save RAM space We can then delete it in MiniGPT-4/minigpt4/task/base_task.py

  • we modify the function train_step(self, model, samples)
  • _train_inner_loop

we try a last thing by adding this as export: export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

Dataset Configuration

This project uses three datasets for training the MiniGPT-v2 model: sharegpt_detail, gpt4visionface_detail, and RealisticEmotions_detail. Below is an overview of their configurations and sampling strategies: Dataset s sharegpt_detail Batch Size: 1 Vision Processor: blip2_image_train (image size: 448) Text Processor: blip_caption Sample Ratio: 30 Notes: This dataset uses a sample ratio of 30, which is considered the canonical value for training. This ratio has been established as a stable and effective choice for leveraging the dataset's content in the pretraining and fine-tuning phases.

gpt4visionface_detail Batch Size: 1 Vision Processor: blip2_image_train (image size: 448) Text Processor: blip_caption Sample Ratio: 10 Notes: With a sample ratio of 10, this dataset also adopts a canonical sampling rate. This value balances its contribution to the training process, ensuring it complements sharegpt_detail without overwhelming the model.

RealisticEmotions_detail Batch Size: 1 Vision Processor: blip2_image_train (image size: 448) Text Processor: blip_caption Sample Ratio: 20 Notes: This dataset is currently in a trial phase. The sample ratio of 20 is experimental and subject to adjustment as we evaluate its impact on model performance. Unlike the canonical ratios for sharegpt_detail and gpt4visionface_detail, this value is still under review.

Sampling Strategy

The sample_ratio parameter determines how frequently each dataset is sampled during training relative to a baseline rate. The canonical ratios (30 for sharegpt_detail and 10 for gpt4visionface_detail) have been tested and validated for optimal performance in our workflow. Meanwhile, RealisticEmotions_detail is being trialed with a ratio of 20 to assess its effectiveness, and further tuning may occur based on experimental results. For more details on the training configuration, refer to the main configuration file.

πŸ’‘ Get help - Q&A or Discord πŸ’¬

**Example Community Efforts Built on Top of MiniGPT-4 **

Colab YouTube

  • InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 Lai Wei, Zihao Jiang, Weiran Huang, Lichao Sun, Arxiv, 2023

  • PatFig: Generating Short and Long Captions for Patent Figures.", Aubakirova, Dana, Kim Gerdes, and Lufei Liu, ICCVW, 2023

  • SkinGPT-4: An Interactive Dermatology Diagnostic System with Visual Large Language Model, Juexiao Zhou and Xiaonan He and Liyuan Sun and Jiannan Xu and Xiuying Chen and Yuetan Chu and Longxi Zhou and Xingyu Liao and Bin Zhang and Xin Gao, Arxiv, 2023

  • ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4.", Yuan, Zhengqing, Huiwen Xue, Xinyi Wang, Yongming Liu, Zhuanzhe Zhao, and Kun Wang, Arxiv, 2023

News

[Oct.31 2023] We release the evaluation code of our MiniGPT-v2.

[Oct.24 2023] We release the finetuning code of our MiniGPT-v2.

[Oct.13 2023] Breaking! We release the first major update with our MiniGPT-v2

[Aug.28 2023] We now provide a llama 2 version of MiniGPT-4

Online Demo

Click the image to chat with MiniGPT-v2 around your images demo

Click the image to chat with MiniGPT-4 around your images demo

MiniGPT-v2 Examples

MiniGPT-v2 demos

MiniGPT-4 Examples

find wild write story
solve problem write Poem

More examples can be found in the project page.

Getting Started

Installation

1. Prepare the code and the environment

Git clone our repository, creating a python environment and activate it via the following command

git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigptv

2. Prepare the pretrained LLM weights

MiniGPT-v2 is based on Llama2 Chat 7B. For MiniGPT-4, we have both Vicuna V0 and Llama 2 version. Download the corresponding LLM weights from the following huggingface space via clone the repository using git-lfs.

Llama 2 Chat 7B Vicuna V0 13B Vicuna V0 7B
Download Downlad Download

Then, set the variable llama_model in the model config file to the LLM weight path.

  • For MiniGPT-v2, set the LLM path here at Line 14.

  • For MiniGPT-4 (Llama2), set the LLM path here at Line 15.

  • For MiniGPT-4 (Vicuna), set the LLM path here at Line 18

3. Prepare the pretrained model checkpoints

Download the pretrained model checkpoints

MiniGPT-v2 (after stage-2) MiniGPT-v2 (after stage-3) MiniGPT-v2 (online developing demo)
Download Download Download

For MiniGPT-v2, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigptv2_eval.yaml at Line 8.

MiniGPT-4 (Vicuna 13B) MiniGPT-4 (Vicuna 7B) MiniGPT-4 (LLaMA-2 Chat 7B)
Download Download Download

For MiniGPT-4, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4_eval.yaml at Line 8 for Vicuna version or eval_configs/minigpt4_llama2_eval.yaml for LLama2 version.

Launching Demo Locally

For MiniGPT-v2, run

python demo_v2.py --cfg-path eval_configs/minigptv2_eval.yaml  --gpu-id 0

For MiniGPT-4 (Vicuna version), run

python demo.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0

For MiniGPT-4 (Llama2 version), run

python demo.py --cfg-path eval_configs/minigpt4_llama2_eval.yaml  --gpu-id 0

To save GPU memory, LLMs loads as 8 bit by default, with a beam search width of 1. This configuration requires about 23G GPU memory for 13B LLM and 11.5G GPU memory for 7B LLM. For more powerful GPUs, you can run the model in 16 bit by setting low_resource to False in the relevant config file:

Thanks @WangRongsheng, you can also run MiniGPT-4 on Colab

Training

For training details of MiniGPT-4, check here.

For finetuning details of MiniGPT-v2, check here

Evaluation

For finetuning details of MiniGPT-v2, check here

Acknowledgement

  • BLIP2 The model architecture of MiniGPT-4 follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!
  • Lavis This repository is built upon Lavis!
  • Vicuna The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!
  • LLaMA The strong open-sourced LLaMA 2 language model.

If you're using MiniGPT-4/MiniGPT-v2 in your research or applications, please cite using this BibTeX:

@article{chen2023minigptv2,
      title={MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning}, 
      author={Chen, Jun and Zhu, Deyao and Shen, Xiaoqian and Li, Xiang and Liu, Zechu and Zhang, Pengchuan and Krishnamoorthi, Raghuraman and Chandra, Vikas and Xiong, Yunyang and Elhoseiny, Mohamed},
      year={2023},
      journal={arXiv preprint arXiv:2310.09478},
}

@article{zhu2023minigpt,
  title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
  author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2304.10592},
  year={2023}
}

License

This repository is under BSD 3-Clause License. Many codes are based on Lavis with BSD 3-Clause License here.

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Shell 0.1%