This is a visual-text pair dataset synergistically generated by text-to-image model and multimodal large language model.
This paper is accepted by the international top conference IJCAI 2024 - GLOW Workshop full paper
This research aims to collaboratively generate data using multimodal large language models, large language models, and the text-to-image model. Through the generation of diverse datasets resulting from interactions among multiple models, we endeavor to automatically generate a visual-text pair dataset.
- Initially, a conventional large language model is employed to generate a primary narrative, which is then paired with an image produced by the text-to-image model.
- Subsequently, this image is described by the multimodal large language model, and this new narrative is relayed to the text-to-image model to produce a subsequent image, establishing an iterative data generation cycle.
- Through numerous cycles of collaborative generation, we have accumulated a substantial set of text-image pairs.
- Using a multimodal language model (MLLM) or a large language model (LLM), a simple initial description is randomly generated through a fixed prompt.
- The description is then given to a text-to-image model (G) to generate a corresponding image (M).
- The image and a fixed instruction (I) are then given to the LLM (F) to generate a corresponding variant description (
$D^{variant}$ ). The variant description is then used to generate an image by the text-to-image model (back to step 2).
- Steps 2-4 are repeated to generate many image-description pairs. The stopping criteria is when the number of iterations reaches the maximum number set.
We trained LLaVA-v1.3-vicuna-7b (a lower-parameter multimodal LLMs compared to the generation model) on our dataset, which varies in size from 1,000 to 7,000 instances. We employed an evaluation method using the mean BERTScore (focusing on recall), BLEU and ROUGE-L score based on the descriptions of 100 images with the output of GPT-4 to assess the performance post-training on our dataset.
Our findings of the increase in mean BERT Score, BLEU, and ROUGE-L scores with larger dataset sizes indicates a positive correlation between dataset size and the descriptive capabilities of the models. Additionally, the decrease in the standard deviation of the BERT Score suggests improved consistency in model performance across different dataset sizes.
In conclusion, for different topics, the initial descriptions
In the process of experiment, it can be found that generating too many initial descriptions at once is not good. Therefore, we use multiple batches to generate initial descriptions. Therefore, the final dataset will be ⇒ batch number number of initial descriptions in a single batch number of iterations to generate variants for each initial description.
@incollection{huang2024synergistic,
author = {Mao Xun Huang and Hen-Hsen Huang},
title = {Integrating Text-to-Image and Vision Language Models for Synergistic Dataset Generation: The Creation of Synergy-General-Multimodal Pairs},
booktitle = {Generalizing from Limited Resources in the Open World},
editor = {Jinyang Guo and Yuqing Ma and Yifu Ding and Ruihao Gong and Xingyu Zheng and Changyi He and Yantao Lu and Xianglong Liu},
series = {Communications in Computer and Information Science},
volume = {2160},
pages = {147--161},
year = {2024},
publisher = {Springer},
address = {Singapore},
doi = {10.1007/978-981-97-6125-8_12},
url = {https://link.springer.com/chapter/10.1007/978-981-97-6125-8_12}
}