Skip to content

This repo contains code for the paper "Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM"

License

Notifications You must be signed in to change notification settings

MLLM-Data-Contamination/MM-Detect

Repository files navigation

πŸ•΅οΈ MM-Detect: The First Multimodal Data Contamination Detection Framework

arXiv Hugging Face EMNLP 2025 ICML 2025 License Python

Authors: Dingjie Song*, Sicheng Lai*, Shunian Chen, Lichao Sun, Benyou Wang

A systematic framework for detecting data contamination in Multimodal Large Language Models


πŸ“‹ Table of Contents


πŸ† Latest Updates


πŸ” Overview

MM-Detect Framework Overview

MM-Detect framework for multimodal data contamination detection

The rapid advancement of Multimodal Large Language Models (MLLMs) has achieved remarkable performance across various benchmarks. However, data contamination during training poses significant challenges for fair evaluation and model comparison.

🎯 Key Challenges

  • Existing contamination detection methods for LLMs are insufficient for MLLMs
  • Multiple modalities and training phases complicate detection
  • Need for systematic analysis of contamination sources

πŸ’‘ Our Solution: MM-Detect

MM-Detect introduces the first comprehensive framework specifically designed for detecting data contamination in multimodal models:

βœ… Multi-modal Detection: Handles both text and image contamination
βœ… Multi-phase Analysis: Identifies contamination across different training stages
βœ… Heuristic Source Identification: Determines if contamination originates from LLM pre-training
βœ… Comprehensive Model Support: Works with white-box, grey-box, and black-box models


⚑ Quick Start

Prerequisites

πŸ“‹ System Requirements
  • Operating System: Linux (Ubuntu 18.04+), macOS, or Windows with WSL
  • Python: 3.10 or higher
  • Package Manager: conda + Poetry (automatically installed via environment.yml)
  • Java: OpenJDK 11+ (automatically installed via environment.yml)
  • GPU: NVIDIA GPU with CUDA support (recommended for faster inference)
  • Memory: Minimum 16GB RAM, 32GB+ recommended

Installation

πŸš€ Quick Setup (Recommended)

# 1. Clone the repository
git clone https://github.com/MLLM-Data-Contamination/MM-Detect.git
cd MM-Detect

# 2. Create and activate conda environment (includes Python 3.10, Poetry, and Java)
conda env create -f environment.yml
conda activate MM-Detect

# 3. Install Python dependencies with Poetry
poetry install --no-root

πŸ“‹ Manual Installation (Alternative)

Click to expand manual installation steps

1️⃣ Clone the Repository

git clone https://github.com/MLLM-Data-Contamination/MM-Detect.git
cd MM-Detect

2️⃣ Create Python Environment

# Using conda (recommended)
conda create -n MM-Detect python=3.10
conda activate MM-Detect

# Or using virtualenv
python -m venv MM-Detect
source MM-Detect/bin/activate  # On Windows: MM-Detect\Scripts\activate

3️⃣ Install Poetry

# Install Poetry
pip install poetry

# Install dependencies
poetry install --no-root

4️⃣ Install Java (for Stanford POS Tagger)

# Ubuntu/Debian
sudo apt update
sudo apt install openjdk-11-jdk

# CentOS/RHEL
sudo yum install java-11-openjdk-devel

# macOS
brew install openjdk@11

# Verify installation
java -version

πŸš€ Usage Guide

Supported Models

Model Type Access Level Models Supported
πŸ”“ White-box Full Access LLaVA-1.5, VILA1.5, Qwen-VL-Chat, idefics2, Phi-3-vision-instruct, Yi-VL, InternVL2, DeepSeek-VL2
πŸ”’ Grey-box Partial Access fuyu
⚫ Black-box API Only GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet

Configuration

πŸ”‘ API Configuration (Black-box Models)

Step 1: Create environment file

# Copy the example environment file
cp .env.example .env

Step 2: Configure your API keys Edit the .env file with your API credentials:

# OpenAI API Configuration
OPENAI_API_KEY=your_actual_openai_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1

# Google Gemini API Configuration  
GEMINI_API_KEY=your_actual_gemini_api_key_here

# Anthropic Claude API Configuration
ANTHROPIC_API_KEY=your_actual_anthropic_api_key_here

Step 3: Verify configuration The framework will automatically load these credentials when running detection tests.

πŸ”’ Security Best Practices
  • βœ… Never commit your .env file to version control
  • βœ… Keep API keys secure and rotate them regularly
  • βœ… Use environment-specific .env files for different deployments
  • βœ… Set appropriate permissions on your .env file: chmod 600 .env
  • βœ… Monitor API usage to detect unauthorized access

πŸ’Ύ Output Configuration (Optional)

Configure output settings in your .env file:

# Output Configuration
OUTPUT_DIR=./outputs
RESULTS_FILE=./outputs/results.json
ENABLE_RESUME=true

# Model Configuration (Optional)
DEFAULT_MODEL=gpt-4o
MAX_TOKENS=4096
TEMPERATURE=0.7

The framework will automatically:

  • βœ… Create output directories
  • βœ… Save results to specified locations
  • βœ… Enable resume functionality when interrupted

Running Detection

πŸ§ͺ Basic Example: Option Order Sensitivity Test

# Test GPT-4o on ScienceQA dataset
bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o

# Test LLaVA-1.5 on MMStar dataset  
bash scripts/mllms/option_order_sensitivity_test/test_MMStar.sh -m llava-1.5

# Resume interrupted tests with -r flag
bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o -r

πŸ”„ Resume Functionality

MM-Detect includes intelligent resume capabilities to handle interrupted runs:

Automatic Checkpointing:

  • βœ… Progress saved every 10 items processed
  • βœ… Checkpoint files stored in outputs/checkpoints/
  • βœ… Failed items tracking with error details
  • βœ… Automatic cleanup after successful completion

Using Resume:

# Add -r flag to resume from last checkpoint
bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m gpt-4o -r
bash scripts/mllms/option_order_sensitivity_test/test_MMStar.sh -m claude-3.5-sonnet -r

# Or use --resume directly with main.py
python main.py --method option-order-sensitivity-test --model_name gpt-4o --resume

Resume Information: When resuming, you'll see detailed progress information:

πŸ”„ Resuming from checkpoint:
   Task ID: a1b2c3d4e5f6
   Method: option_order_sensitivity_test
   Model: gpt-4o
   Dataset: derek-thomas/ScienceQA
   Progress: 450/1340 (33.6%)
   Failed items: 2
   Last saved: 2024-08-08 14:30:15

πŸ”¬ Advanced Usage Examples

πŸ“Š Batch Testing Multiple Models
# Test multiple models on the same dataset
models=("gpt-4o" "claude-3.5-sonnet" "gemini-1.5-pro")
for model in "${models[@]}"; do
    echo "Testing $model..."
    bash scripts/mllms/option_order_sensitivity_test/test_ScienceQA.sh -m "$model"
done
🎯 Custom Dataset Testing
# Modify the script to use your custom dataset
# Edit the dataset path in the corresponding test script
# Then run the detection
bash scripts/mllms/option_order_sensitivity_test/custom_test.sh -m your-model

πŸ› οΈ Development

Adding Dependencies

# Add a new runtime dependency
poetry add package-name

# Add a new development dependency
poetry add --group dev package-name

# Update dependencies
poetry update

Project Structure

MM-Detect/
β”œβ”€β”€ pyproject.toml          # Poetry configuration and dependencies
β”œβ”€β”€ environment.yml         # Conda environment specification
β”œβ”€β”€ .env.example            # Template for environment variables
β”œβ”€β”€ .env                    # Your API keys (create from .env.example)
β”œβ”€β”€ mm_detect/              # Main package directory
β”‚   β”œβ”€β”€ utils/              # Utility modules
β”‚   β”‚   β”œβ”€β”€ config.py       # Configuration and API key management
β”‚   β”‚   └── resume_manager.py # Resume functionality
β”‚   └── mllms/              # Model implementations
β”œβ”€β”€ scripts/                # Test and run scripts
β”œβ”€β”€ outputs/                # Results and checkpoints
β”‚   β”œβ”€β”€ checkpoints/        # Resume checkpoint files
β”‚   └── results.json        # Final results
└── requirements.txt        # Legacy (for compatibility)

πŸ”¬ Contamination Source Analysis

Determine whether contamination originates from the pre-training phase of base LLMs:

Supported LLMs for Source Analysis

Model Family Specific Models
LLaMA LLaMA2-7B, LLaMA2-13B
Qwen Qwen-7B, Qwen-14B
InternLM Internlm2-7B, Internlm2-20B
Mistral Mistral-7B-v0.1
Phi Phi-3-instruct
Yi Yi-6B, Yi-34B
DeepSeek DeepSeek-MoE-Chat

Running Source Analysis

# Analyze Qwen-7B on MMStar dataset
bash scripts/llms/detect_pretrain/test_MMStar.sh -m Qwen/Qwen-7B

# Analyze LLaMA2-7B
bash scripts/llms/detect_pretrain/test_MMStar.sh -m meta-llama/Llama-2-7b-hf

# Batch analysis for multiple models
for model in "Qwen/Qwen-7B" "meta-llama/Llama-2-7b-hf" "mistralai/Mistral-7B-v0.1"; do
    echo "Analyzing contamination source for $model..."
    bash scripts/llms/detect_pretrain/test_MMStar.sh -m "$model"
done

πŸ“Š Datasets

MM-Detect supports comprehensive evaluation across multiple benchmark datasets:

Dataset Type Domain Size Description
ScienceQA VQA Science 21K Science question answering with diagrams
MMStar VQA General 1.5K Multi-domain visual question answering
COCO-Caption Captioning General 123K Image captioning benchmark
NoCaps Captioning General 166K Novel object captioning
Vintage Captioning Art 60K Vintage artwork descriptions

πŸ”§ Troubleshooting

πŸ› Common Issues and Solutions

Issue 1: googletrans AttributeError

AttributeError: module 'httpcore' has no attribute 'SyncHTTPTransport'

Solution:

  • Install compatible versions: pip install httpx==0.27.2 googletrans==3.1.0a0
  • Alternative: Use the solution from Stack Overflow

Issue 2: Java Not Found

Exception: Java not found. Please install Java to use Stanford POS Tagger.

Solution:

# Check Java installation
java -version

# If not installed, install OpenJDK 11
sudo apt install openjdk-11-jdk  # Ubuntu/Debian
brew install openjdk@11         # macOS

Issue 3: CUDA Out of Memory

torch.cuda.OutOfMemoryError: CUDA out of memory

Solution:

  • Reduce batch size in configuration files
  • Use gradient checkpointing: --gradient_checkpointing
  • Switch to CPU inference: --device cpu

Issue 4: API Rate Limiting

openai.error.RateLimitError: Rate limit exceeded

Solution:

  • Add delays between API calls
  • Use multiple API keys
  • Consider using local models instead
πŸ“§ Getting Help

If you encounter issues not covered above:

  1. Check Issues: Search existing issues
  2. Create Issue: Open a new issue with:
    • Error message
    • System information
    • Steps to reproduce
  3. Community Support: Join our discussions for community help

πŸ“ Citation

If you find MM-Detect helpful in your research, please consider citing our work:

@misc{song2025textimagesleakedsystematic,
      title={Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM}, 
      author={Dingjie Song and Sicheng Lai and Mingxuan Wang and Shunian Chen and Lichao Sun and Benyou Wang},
      year={2025},
      eprint={2411.03823},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.03823}, 
}

πŸ™ Acknowledgements

We extend our gratitude to the following projects and contributors:

Project Contribution Link
LLaVA Multimodal architecture inspiration GitHub
LLMSanitize Contamination detection methodologies GitHub
Stanford POS Tagger Natural language processing tools Official

Special Thanks

  • The research community for valuable feedback and suggestions
  • Contributors who helped improve the framework
  • Beta testers who provided essential bug reports

About

This repo contains code for the paper "Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published