Skip to content

This repo contains code for the paper "Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM"

License

Notifications You must be signed in to change notification settings

MLLM-Data-Contamination/MM-Detect

Repository files navigation

🕵️ MM-Detect: The First Multimodal Data Contamination Detection Framework

Dingjie Song, Sicheng Lai, Shunian Chen, Lichao Sun, Benyou Wang*

🤗 Paper | 📖 arXiv

🌈 Updates

  • 🏆 Accepted to ICML 2025 Workshop DIG-BUG
    📅 Date of Acceptance: June 2025
    🎤 Oral Presentation

Overview

The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), they are less effective for MLLMs due to their various modalities and multiple training phases. Therefore, we introduce a multimodal data contamination detection framework, MM-Detect. Besides, we employ a heuristic method to discern whether the contamination originates from the pre-training phase of LLMs.

MM-Detect

🤖 Environment Setup

git clone https://github.com/FreedomIntelligence/MM-Detect.git
conda create -n MM-Detect python=3.10
cd MM-Detect
pip install torch==2.1.2
pip install -r requirements.txt
pip install googletrans==3.1.0a0
pip install httpx==0.27.2

Ensure that your system has Java installed to enable the use of the Stanford POS Tagger.

sudo apt update
sudo apt install openjdk-11-jdk

🚀 Run MM-Detect

Our codebase supports the following models on ScienceQA, MMStar, COCO-Caption, Nocaps and Vintage:

  • White-box Models:

    • LLaVA-1.5
    • VILA1.5
    • Qwen-VL-Chat
    • idefics2
    • Phi-3-vision-instruct
    • Yi-VL
    • InternVL2
    • DeepSeek-VL2
  • Grey-box Models:

    • fuyu
  • Black-box Models:

    • GPT-4o
    • Gemini-1.5-Pro
    • Claude-3.5-Sonnet

🔐 Important: When detecting contamination of black-box models, ensure to add your API key at Line 26 in mm_detect/mllms/gpt.py:

api_key='your-api-key'

🌱 To save intermediate results and enable the Resume function, please add your output_dir at line 77 in multimodal_methods/option_order_sensitivity_test.py and at line 104 in multimodal_methods/slot_guessing_for_perturbation_caption.py:

results_file = "output_dir/results.json"

📌 To run contamination detection for MLLMs, you can follow the multiple test scripts in scripts/tests/mllms folder. For instance, use the following command to run Option Order Sensitivity Test on ScienceQA with GPT-4o:

bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o

🔍 Discern the Source of Contamination

We support the following LLMs on MMStar:

  • LLMs:
    • LLaMA2
    • Qwen
    • Internlm2
    • Mistral
    • Phi-3-instruct
    • Yi
    • DeepSeek-MoE-Chat

📌 For instance, use the following command to run the Qwen-7B:

bash scripts/llms/detect_pretrain/test_MMStar.sh -m Qwen/Qwen-7B

🎉 Citation

⭐ If you find our implementation and paper helpful, please consider citing our work and starring the repository⭐:

@misc{song2024textimagesleakedsystematic,
  title={Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination},
  author={Dingjie Song and Sicheng Lai and Shunian Chen and Lichao Sun and Benyou Wang},
  year={2024},
  eprint={2411.03823},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2411.03823},
}

🛠 Troubleshooting

If you encounter the following error when using googletrans:

AttributeError: module 'httpcore' has no attribute 'SyncHTTPTransport'

please refer to the solution provided on this Stack Overflow page for further guidance.

❤️ Acknowledgement

About

This repo contains code for the paper "Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published