🕵️ MM-Detect: The First Multimodal Data Contamination Detection Framework

Dingjie Song, Sicheng Lai, Shunian Chen, Lichao Sun, Benyou Wang*

🌈 Updates

🏆 Accepted to ICML 2025 Workshop DIG-BUG
📅 Date of Acceptance: June 2025
🎤 Oral Presentation
- DIG-BUG

Overview

The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), they are less effective for MLLMs due to their various modalities and multiple training phases. Therefore, we introduce a multimodal data contamination detection framework, MM-Detect. Besides, we employ a heuristic method to discern whether the contamination originates from the pre-training phase of LLMs.

🤖 Environment Setup

git clone https://github.com/FreedomIntelligence/MM-Detect.git
conda create -n MM-Detect python=3.10
cd MM-Detect
pip install torch==2.1.2
pip install -r requirements.txt
pip install googletrans==3.1.0a0
pip install httpx==0.27.2

Ensure that your system has Java installed to enable the use of the Stanford POS Tagger.

sudo apt update
sudo apt install openjdk-11-jdk

🚀 Run MM-Detect

Our codebase supports the following models on ScienceQA, MMStar, COCO-Caption, Nocaps and Vintage:

White-box Models:
- LLaVA-1.5
- VILA1.5
- Qwen-VL-Chat
- idefics2
- Phi-3-vision-instruct
- Yi-VL
- InternVL2
- DeepSeek-VL2
Grey-box Models:
- fuyu
Black-box Models:
- GPT-4o
- Gemini-1.5-Pro
- Claude-3.5-Sonnet

🔐 Important: When detecting contamination of black-box models, ensure to add your API key at Line 26 in mm_detect/mllms/gpt.py:

api_key='your-api-key'

🌱 To save intermediate results and enable the Resume function, please add your output_dir at line 77 in multimodal_methods/option_order_sensitivity_test.py and at line 104 in multimodal_methods/slot_guessing_for_perturbation_caption.py:

results_file = "output_dir/results.json"

📌 To run contamination detection for MLLMs, you can follow the multiple test scripts in scripts/tests/mllms folder. For instance, use the following command to run Option Order Sensitivity Test on ScienceQA with GPT-4o:

bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o

🔍 Discern the Source of Contamination

We support the following LLMs on MMStar:

LLMs:
- LLaMA2
- Qwen
- Internlm2
- Mistral
- Phi-3-instruct
- Yi
- DeepSeek-MoE-Chat

📌 For instance, use the following command to run the Qwen-7B:

bash scripts/llms/detect_pretrain/test_MMStar.sh -m Qwen/Qwen-7B

🎉 Citation

⭐ If you find our implementation and paper helpful, please consider citing our work and starring the repository⭐:

@misc{song2024textimagesleakedsystematic,
  title={Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination},
  author={Dingjie Song and Sicheng Lai and Shunian Chen and Lichao Sun and Benyou Wang},
  year={2024},
  eprint={2411.03823},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2411.03823},
}

🛠 Troubleshooting

If you encounter the following error when using googletrans:

AttributeError: module 'httpcore' has no attribute 'SyncHTTPTransport'

please refer to the solution provided on this Stack Overflow page for further guidance.

❤️ Acknowledgement

LLaVA
LLMSanitize
Stanford POS Tagger

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!