🕵️ MM-Detect: The First Multimodal Data Contamination Detection Framework

Authors: Dingjie Song*, Sicheng Lai*, Shunian Chen, Lichao Sun, Benyou Wang

A systematic framework for detecting data contamination in Multimodal Large Language Models

📋 Table of Contents

🏆 Latest Updates
🔍 Overview
⚡ Quick Start
- Prerequisites
- Installation
🚀 Usage Guide
🔬 Contamination Source Analysis
📊 Datasets
🔧 Troubleshooting
📝 Citation
🙏 Acknowledgements

🏆 Latest Updates

[August 2025] 🎉 Paper accepted to EMNLP 2025 Findings track
[June 2025] 🏆 Paper accepted to ICML 2025 DIG-BUG Workshop as oral presentation

🔍 Overview

MM-Detect framework for multimodal data contamination detection

The rapid advancement of Multimodal Large Language Models (MLLMs) has achieved remarkable performance across various benchmarks. However, data contamination during training poses significant challenges for fair evaluation and model comparison.

🎯 Key Challenges

Existing contamination detection methods for LLMs are insufficient for MLLMs
Multiple modalities and training phases complicate detection
Need for systematic analysis of contamination sources

💡 Our Solution: MM-Detect

MM-Detect introduces the first comprehensive framework specifically designed for detecting data contamination in multimodal models:

✅ Multi-modal Detection: Handles both text and image contamination
✅ Multi-phase Analysis: Identifies contamination across different training stages
✅ Heuristic Source Identification: Determines if contamination originates from LLM pre-training
✅ Comprehensive Model Support: Works with white-box, grey-box, and black-box models

⚡ Quick Start

Prerequisites

📋 System Requirements

Operating System: Linux (Ubuntu 18.04+), macOS, or Windows with WSL
Python: 3.10 or higher
Package Manager: conda + Poetry (automatically installed via environment.yml)
Java: OpenJDK 11+ (automatically installed via environment.yml)
GPU: NVIDIA GPU with CUDA support (recommended for faster inference)
Memory: Minimum 16GB RAM, 32GB+ recommended

Installation

🚀 Quick Setup (Recommended)

# 1. Clone the repository
git clone https://github.com/MLLM-Data-Contamination/MM-Detect.git
cd MM-Detect

# 2. Create and activate conda environment (includes Python 3.10, Poetry, and Java)
conda env create -f environment.yml
conda activate MM-Detect

# 3. Install Python dependencies with Poetry
poetry install --no-root

📋 Manual Installation (Alternative)

Click to expand manual installation steps

1️⃣ Clone the Repository

git clone https://github.com/MLLM-Data-Contamination/MM-Detect.git
cd MM-Detect

2️⃣ Create Python Environment

# Using conda (recommended)
conda create -n MM-Detect python=3.10
conda activate MM-Detect

# Or using virtualenv
python -m venv MM-Detect
source MM-Detect/bin/activate  # On Windows: MM-Detect\Scripts\activate

3️⃣ Install Poetry

# Install Poetry
pip install poetry

# Install dependencies
poetry install --no-root

4️⃣ Install Java (for Stanford POS Tagger)

# Ubuntu/Debian
sudo apt update
sudo apt install openjdk-11-jdk

# CentOS/RHEL
sudo yum install java-11-openjdk-devel

# macOS
brew install openjdk@11

# Verify installation
java -version

🚀 Usage Guide

Supported Models

Model Type	Access Level	Models Supported
🔓 White-box	Full Access	LLaVA-1.5, VILA1.5, Qwen-VL-Chat, idefics2, Phi-3-vision-instruct, Yi-VL, InternVL2, DeepSeek-VL2
🔒 Grey-box	Partial Access	fuyu
⚫ Black-box	API Only	GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet

Configuration

🔑 API Configuration (Black-box Models)

Step 1: Create environment file

# Copy the example environment file
cp .env.example .env

Step 2: Configure your API keys Edit the .env file with your API credentials:

# OpenAI API Configuration
OPENAI_API_KEY=your_actual_openai_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1

# Google Gemini API Configuration  
GEMINI_API_KEY=your_actual_gemini_api_key_here

# Anthropic Claude API Configuration
ANTHROPIC_API_KEY=your_actual_anthropic_api_key_here

Step 3: Verify configuration The framework will automatically load these credentials when running detection tests.

🔒 Security Best Practices

✅ Never commit your .env file to version control
✅ Keep API keys secure and rotate them regularly
✅ Use environment-specific .env files for different deployments
✅ Set appropriate permissions on your .env file: chmod 600 .env
✅ Monitor API usage to detect unauthorized access

💾 Output Configuration (Optional)

Configure output settings in your .env file:

# Output Configuration
OUTPUT_DIR=./outputs
RESULTS_FILE=./outputs/results.json
ENABLE_RESUME=true

# Model Configuration (Optional)
DEFAULT_MODEL=gpt-4o
MAX_TOKENS=4096
TEMPERATURE=0.7

The framework will automatically:

✅ Create output directories
✅ Save results to specified locations
✅ Enable resume functionality when interrupted

Running Detection

🧪 Basic Example: Option Order Sensitivity Test

# Test GPT-4o on ScienceQA dataset
bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o

# Test LLaVA-1.5 on MMStar dataset  
bash scripts/mllms/option_order_sensitivity_test/test_MMStar.sh -m llava-1.5

# Resume interrupted tests with -r flag
bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o -r

🔄 Resume Functionality

MM-Detect includes intelligent resume capabilities to handle interrupted runs:

Automatic Checkpointing:

✅ Progress saved every 10 items processed
✅ Checkpoint files stored in outputs/checkpoints/
✅ Failed items tracking with error details
✅ Automatic cleanup after successful completion

Using Resume:

# Add -r flag to resume from last checkpoint
bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o -r
bash scripts/mllms/option_order_sensitivity_test/test_MMStar.sh -m claude-3.5-sonnet -r

# Or use --resume directly with main.py
python main.py --method option-order-sensitivity-test --model_name gpt-4o --resume

Resume Information: When resuming, you'll see detailed progress information:

🔄 Resuming from checkpoint:
   Task ID: a1b2c3d4e5f6
   Method: option_order_sensitivity_test
   Model: gpt-4o
   Dataset: derek-thomas/ScienceQA
   Progress: 450/1340 (33.6%)
   Failed items: 2
   Last saved: 2024-08-08 14:30:15

🔬 Advanced Usage Examples

📊 Batch Testing Multiple Models

# Test multiple models on the same dataset
models=("gpt-4o" "claude-3.5-sonnet" "gemini-1.5-pro")
for model in "${models[@]}"; do
    echo "Testing $model..."
    bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m "$model"
done

🎯 Custom Dataset Testing

# Modify the script to use your custom dataset
# Edit the dataset path in the corresponding test script
# Then run the detection
bash scripts/mllms/option_order_sensitivity_test/custom_test.sh -m your-model

🛠️ Development

Adding Dependencies

# Add a new runtime dependency
poetry add package-name

# Add a new development dependency
poetry add --group dev package-name

# Update dependencies
poetry update

Project Structure

MM-Detect/
├── pyproject.toml          # Poetry configuration and dependencies
├── environment.yml         # Conda environment specification
├── .env.example            # Template for environment variables
├── .env                    # Your API keys (create from .env.example)
├── mm_detect/              # Main package directory
│   ├── utils/              # Utility modules
│   │   ├── config.py       # Configuration and API key management
│   │   └── resume_manager.py # Resume functionality
│   └── mllms/              # Model implementations
├── scripts/                # Test and run scripts
├── outputs/                # Results and checkpoints
│   ├── checkpoints/        # Resume checkpoint files
│   └── results.json        # Final results
└── requirements.txt        # Legacy (for compatibility)

🔬 Contamination Source Analysis

Determine whether contamination originates from the pre-training phase of base LLMs:

Supported LLMs for Source Analysis

Model Family	Specific Models
LLaMA	LLaMA2-7B, LLaMA2-13B
Qwen	Qwen-7B, Qwen-14B
InternLM	Internlm2-7B, Internlm2-20B
Mistral	Mistral-7B-v0.1
Phi	Phi-3-instruct
Yi	Yi-6B, Yi-34B
DeepSeek	DeepSeek-MoE-Chat

Running Source Analysis

# Analyze Qwen-7B on MMStar dataset
bash scripts/llms/detect_pretrain/test_MMStar.sh -m Qwen/Qwen-7B

# Analyze LLaMA2-7B
bash scripts/llms/detect_pretrain/test_MMStar.sh -m meta-llama/Llama-2-7b-hf

# Batch analysis for multiple models
for model in "Qwen/Qwen-7B" "meta-llama/Llama-2-7b-hf" "mistralai/Mistral-7B-v0.1"; do
    echo "Analyzing contamination source for $model..."
    bash scripts/llms/detect_pretrain/test_MMStar.sh -m "$model"
done

📊 Datasets

MM-Detect supports comprehensive evaluation across multiple benchmark datasets:

Dataset	Type	Domain	Size	Description
ScienceQA	VQA	Science	21K	Science question answering with diagrams
MMStar	VQA	General	1.5K	Multi-domain visual question answering
COCO-Caption	Captioning	General	123K	Image captioning benchmark
NoCaps	Captioning	General	166K	Novel object captioning
Vintage	Captioning	Art	60K	Vintage artwork descriptions

🔧 Troubleshooting

🐛 Common Issues and Solutions

Issue 1: `googletrans` AttributeError

AttributeError: module 'httpcore' has no attribute 'SyncHTTPTransport'

Solution:

Install compatible versions: pip install httpx==0.27.2 googletrans==3.1.0a0
Alternative: Use the solution from Stack Overflow

Issue 2: Java Not Found

Exception: Java not found. Please install Java to use Stanford POS Tagger.

Solution:

# Check Java installation
java -version

# If not installed, install OpenJDK 11
sudo apt install openjdk-11-jdk  # Ubuntu/Debian
brew install openjdk@11         # macOS

Issue 3: CUDA Out of Memory

torch.cuda.OutOfMemoryError: CUDA out of memory

Solution:

Reduce batch size in configuration files
Use gradient checkpointing: --gradient_checkpointing
Switch to CPU inference: --device cpu

Issue 4: API Rate Limiting

openai.error.RateLimitError: Rate limit exceeded

Solution:

Add delays between API calls
Use multiple API keys
Consider using local models instead

📧 Getting Help

If you encounter issues not covered above:

Check Issues: Search existing issues
Create Issue: Open a new issue with:
- Error message
- System information
- Steps to reproduce
Community Support: Join our discussions for community help

📝 Citation

If you find MM-Detect helpful in your research, please consider citing our work:

@misc{song2025textimagesleakedsystematic,
      title={Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM}, 
      author={Dingjie Song and Sicheng Lai and Mingxuan Wang and Shunian Chen and Lichao Sun and Benyou Wang},
      year={2025},
      eprint={2411.03823},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.03823}, 
}

🙏 Acknowledgements

We extend our gratitude to the following projects and contributors:

Project	Contribution	Link
LLaVA	Multimodal architecture inspiration	GitHub
LLMSanitize	Contamination detection methodologies	GitHub
Stanford POS Tagger	Natural language processing tools	Official

Special Thanks

The research community for valuable feedback and suggestions
Contributors who helped improve the framework
Beta testers who provided essential bug reports

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
images		images
mm_detect		mm_detect
multimodal_methods		multimodal_methods
pretrain_detect		pretrain_detect
scripts		scripts
stanford-postagger-full-2020-11-17		stanford-postagger-full-2020-11-17
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

MLLM-Data-Contamination/MM-Detect

Folders and files

Latest commit

History

Repository files navigation

🕵️ MM-Detect: The First Multimodal Data Contamination Detection Framework

📋 Table of Contents

🏆 Latest Updates

🔍 Overview

🎯 Key Challenges

💡 Our Solution: MM-Detect

⚡ Quick Start

Prerequisites

Installation

🚀 Quick Setup (Recommended)

📋 Manual Installation (Alternative)

1️⃣ Clone the Repository

2️⃣ Create Python Environment

3️⃣ Install Poetry

4️⃣ Install Java (for Stanford POS Tagger)

🚀 Usage Guide

Supported Models

Configuration

🔑 API Configuration (Black-box Models)

💾 Output Configuration (Optional)

Running Detection

🧪 Basic Example: Option Order Sensitivity Test

🔄 Resume Functionality

🔬 Advanced Usage Examples

🛠️ Development

Adding Dependencies

Project Structure

🔬 Contamination Source Analysis

Supported LLMs for Source Analysis

Running Source Analysis

📊 Datasets

🔧 Troubleshooting

Issue 1: googletrans AttributeError

Issue 2: Java Not Found

Issue 3: CUDA Out of Memory

Issue 4: API Rate Limiting

📝 Citation

🙏 Acknowledgements

Special Thanks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Issue 1: `googletrans` AttributeError

Packages