Authors: Dingjie Song*, Sicheng Lai*, Shunian Chen, Lichao Sun, Benyou Wang
A systematic framework for detecting data contamination in Multimodal Large Language Models
- π Latest Updates
- π Overview
- β‘ Quick Start
- π Usage Guide
- π¬ Contamination Source Analysis
- π Datasets
- π§ Troubleshooting
- π Citation
- π Acknowledgements
- [August 2025] π Paper accepted to EMNLP 2025 Findings track
- [June 2025] π Paper accepted to ICML 2025 DIG-BUG Workshop as oral presentation
The rapid advancement of Multimodal Large Language Models (MLLMs) has achieved remarkable performance across various benchmarks. However, data contamination during training poses significant challenges for fair evaluation and model comparison.
- Existing contamination detection methods for LLMs are insufficient for MLLMs
- Multiple modalities and training phases complicate detection
- Need for systematic analysis of contamination sources
MM-Detect introduces the first comprehensive framework specifically designed for detecting data contamination in multimodal models:
β
 Multi-modal Detection: Handles both text and image contamination
β
 Multi-phase Analysis: Identifies contamination across different training stages
β
 Heuristic Source Identification: Determines if contamination originates from LLM pre-training
β
 Comprehensive Model Support: Works with white-box, grey-box, and black-box models
π System Requirements
- Operating System: Linux (Ubuntu 18.04+), macOS, or Windows with WSL
- Python: 3.10 or higher
- Package Manager: conda + Poetry (automatically installed via environment.yml)
- Java: OpenJDK 11+ (automatically installed via environment.yml)
- GPU: NVIDIA GPU with CUDA support (recommended for faster inference)
- Memory: Minimum 16GB RAM, 32GB+ recommended
# 1. Clone the repository
git clone https://github.com/MLLM-Data-Contamination/MM-Detect.git
cd MM-Detect
# 2. Create and activate conda environment (includes Python 3.10, Poetry, and Java)
conda env create -f environment.yml
conda activate MM-Detect
# 3. Install Python dependencies with Poetry
poetry install --no-rootClick to expand manual installation steps
git clone https://github.com/MLLM-Data-Contamination/MM-Detect.git
cd MM-Detect# Using conda (recommended)
conda create -n MM-Detect python=3.10
conda activate MM-Detect
# Or using virtualenv
python -m venv MM-Detect
source MM-Detect/bin/activate  # On Windows: MM-Detect\Scripts\activate# Install Poetry
pip install poetry
# Install dependencies
poetry install --no-root# Ubuntu/Debian
sudo apt update
sudo apt install openjdk-11-jdk
# CentOS/RHEL
sudo yum install java-11-openjdk-devel
# macOS
brew install openjdk@11
# Verify installation
java -version| Model Type | Access Level | Models Supported | 
|---|---|---|
| π White-box | Full Access | LLaVA-1.5, VILA1.5, Qwen-VL-Chat, idefics2, Phi-3-vision-instruct, Yi-VL, InternVL2, DeepSeek-VL2 | 
| π Grey-box | Partial Access | fuyu | 
| β« Black-box | API Only | GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet | 
Step 1: Create environment file
# Copy the example environment file
cp .env.example .envStep 2: Configure your API keys
Edit the .env file with your API credentials:
# OpenAI API Configuration
OPENAI_API_KEY=your_actual_openai_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
# Google Gemini API Configuration  
GEMINI_API_KEY=your_actual_gemini_api_key_here
# Anthropic Claude API Configuration
ANTHROPIC_API_KEY=your_actual_anthropic_api_key_hereStep 3: Verify configuration The framework will automatically load these credentials when running detection tests.
π Security Best Practices
- β
 Never commit your .envfile to version control
- β Keep API keys secure and rotate them regularly
- β
 Use environment-specific .envfiles for different deployments
- β
 Set appropriate permissions on your .envfile:chmod 600 .env
- β Monitor API usage to detect unauthorized access
Configure output settings in your .env file:
# Output Configuration
OUTPUT_DIR=./outputs
RESULTS_FILE=./outputs/results.json
ENABLE_RESUME=true
# Model Configuration (Optional)
DEFAULT_MODEL=gpt-4o
MAX_TOKENS=4096
TEMPERATURE=0.7The framework will automatically:
- β Create output directories
- β Save results to specified locations
- β Enable resume functionality when interrupted
# Test GPT-4o on ScienceQA dataset
bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o
# Test LLaVA-1.5 on MMStar dataset  
bash scripts/mllms/option_order_sensitivity_test/test_MMStar.sh -m llava-1.5
# Resume interrupted tests with -r flag
bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o -rMM-Detect includes intelligent resume capabilities to handle interrupted runs:
Automatic Checkpointing:
- β Progress saved every 10 items processed
- β
 Checkpoint files stored in outputs/checkpoints/
- β Failed items tracking with error details
- β Automatic cleanup after successful completion
Using Resume:
# Add -r flag to resume from last checkpoint
bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o -r
bash scripts/mllms/option_order_sensitivity_test/test_MMStar.sh -m claude-3.5-sonnet -r
# Or use --resume directly with main.py
python main.py --method option-order-sensitivity-test --model_name gpt-4o --resumeResume Information: When resuming, you'll see detailed progress information:
π Resuming from checkpoint:
   Task ID: a1b2c3d4e5f6
   Method: option_order_sensitivity_test
   Model: gpt-4o
   Dataset: derek-thomas/ScienceQA
   Progress: 450/1340 (33.6%)
   Failed items: 2
   Last saved: 2024-08-08 14:30:15
π Batch Testing Multiple Models
# Test multiple models on the same dataset
models=("gpt-4o" "claude-3.5-sonnet" "gemini-1.5-pro")
for model in "${models[@]}"; do
    echo "Testing $model..."
    bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m "$model"
doneπ― Custom Dataset Testing
# Modify the script to use your custom dataset
# Edit the dataset path in the corresponding test script
# Then run the detection
bash scripts/mllms/option_order_sensitivity_test/custom_test.sh -m your-model# Add a new runtime dependency
poetry add package-name
# Add a new development dependency
poetry add --group dev package-name
# Update dependencies
poetry updateMM-Detect/
βββ pyproject.toml          # Poetry configuration and dependencies
βββ environment.yml         # Conda environment specification
βββ .env.example            # Template for environment variables
βββ .env                    # Your API keys (create from .env.example)
βββ mm_detect/              # Main package directory
β   βββ utils/              # Utility modules
β   β   βββ config.py       # Configuration and API key management
β   β   βββ resume_manager.py # Resume functionality
β   βββ mllms/              # Model implementations
βββ scripts/                # Test and run scripts
βββ outputs/                # Results and checkpoints
β   βββ checkpoints/        # Resume checkpoint files
β   βββ results.json        # Final results
βββ requirements.txt        # Legacy (for compatibility)
Determine whether contamination originates from the pre-training phase of base LLMs:
| Model Family | Specific Models | 
|---|---|
| LLaMA | LLaMA2-7B, LLaMA2-13B | 
| Qwen | Qwen-7B, Qwen-14B | 
| InternLM | Internlm2-7B, Internlm2-20B | 
| Mistral | Mistral-7B-v0.1 | 
| Phi | Phi-3-instruct | 
| Yi | Yi-6B, Yi-34B | 
| DeepSeek | DeepSeek-MoE-Chat | 
# Analyze Qwen-7B on MMStar dataset
bash scripts/llms/detect_pretrain/test_MMStar.sh -m Qwen/Qwen-7B
# Analyze LLaMA2-7B
bash scripts/llms/detect_pretrain/test_MMStar.sh -m meta-llama/Llama-2-7b-hf
# Batch analysis for multiple models
for model in "Qwen/Qwen-7B" "meta-llama/Llama-2-7b-hf" "mistralai/Mistral-7B-v0.1"; do
    echo "Analyzing contamination source for $model..."
    bash scripts/llms/detect_pretrain/test_MMStar.sh -m "$model"
doneMM-Detect supports comprehensive evaluation across multiple benchmark datasets:
| Dataset | Type | Domain | Size | Description | 
|---|---|---|---|---|
| ScienceQA | VQA | Science | 21K | Science question answering with diagrams | 
| MMStar | VQA | General | 1.5K | Multi-domain visual question answering | 
| COCO-Caption | Captioning | General | 123K | Image captioning benchmark | 
| NoCaps | Captioning | General | 166K | Novel object captioning | 
| Vintage | Captioning | Art | 60K | Vintage artwork descriptions | 
π Common Issues and Solutions
AttributeError: module 'httpcore' has no attribute 'SyncHTTPTransport'Solution:
- Install compatible versions: pip install httpx==0.27.2 googletrans==3.1.0a0
- Alternative: Use the solution from Stack Overflow
Exception: Java not found. Please install Java to use Stanford POS Tagger.Solution:
# Check Java installation
java -version
# If not installed, install OpenJDK 11
sudo apt install openjdk-11-jdk  # Ubuntu/Debian
brew install openjdk@11         # macOStorch.cuda.OutOfMemoryError: CUDA out of memorySolution:
- Reduce batch size in configuration files
- Use gradient checkpointing: --gradient_checkpointing
- Switch to CPU inference: --device cpu
openai.error.RateLimitError: Rate limit exceededSolution:
- Add delays between API calls
- Use multiple API keys
- Consider using local models instead
π§ Getting Help
If you encounter issues not covered above:
- Check Issues: Search existing issues
- Create Issue: Open a new issue with:
- Error message
- System information
- Steps to reproduce
 
- Community Support: Join our discussions for community help
If you find MM-Detect helpful in your research, please consider citing our work:
@misc{song2025textimagesleakedsystematic,
      title={Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM}, 
      author={Dingjie Song and Sicheng Lai and Mingxuan Wang and Shunian Chen and Lichao Sun and Benyou Wang},
      year={2025},
      eprint={2411.03823},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.03823}, 
}We extend our gratitude to the following projects and contributors:
| Project | Contribution | Link | 
|---|---|---|
| LLaVA | Multimodal architecture inspiration | GitHub | 
| LLMSanitize | Contamination detection methodologies | GitHub | 
| Stanford POS Tagger | Natural language processing tools | Official | 
- The research community for valuable feedback and suggestions
- Contributors who helped improve the framework
- Beta testers who provided essential bug reports
