mlcommons
diff --git a/‎language/PREPROCESSING_GUIDE.md‎
Lines changed: 139 additions & 0 deletions b/‎language/PREPROCESSING_GUIDE.md‎
Lines changed: 139 additions & 0 deletions
diff --git a/‎language/deepseek-r1/PREPROCESSING.md‎
Lines changed: 48 additions & 105 deletions b/‎language/deepseek-r1/PREPROCESSING.md‎
Lines changed: 48 additions & 105 deletions
@@ -0,0 +1,139 @@
+# MLCommons Inference - General Preprocessing Guide
+
+## Overview
+
+This guide covers common preprocessing patterns across all language models in MLCommons Inference benchmarks. Preprocessing varies by:
+1. Model architecture
+2. Backend choice (PyTorch, vLLM, SGLang)
+3. Task type (summarization, Q&A, etc.)
+
+## Common Tokenizer Setup Pattern
+
+Most models follow this pattern:
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+tokenizer.padding_side = "left"  # Critical for generation
+tokenizer.pad_token = tokenizer.eos_token
+```
+
+## Backend Dependencies
+
+Different backends have different preprocessing requirements:
+
+| Backend | Input Type | Chat Template Support | Use Case |
+|---------|------------|---------------------|----------|
+| PyTorch | Tokenized | Varies by model | Distributed inference |
+| vLLM | Text | Varies by model | High-throughput serving |
+| SGLang | Text | Usually disabled | Optimized serving |
+
+## Dataset Format
+
+All models expect datasets with these common fields:
+
+```python
+{
+    'text_input': str,      # Raw prompt text (required)
+    'tok_input': List[int], # Pre-tokenized input (optional)
+    'output': str,          # Expected output for evaluation
+}
+```
+
+## Model-Specific Preprocessing
+
+### Models Using Chat Templates
+- **DeepSeek-R1**: Uses `apply_chat_template` with PyTorch/vLLM
+- **Potential others**: Check `uses_chat_template` in backend registry
+
+### Models Using Simple Templates
+- **Llama 3.1-8B**: Instruction format for summarization
+- **Llama 2-70B**: Custom format with `[INST]` markers
+- **Mixtral-8x7B**: Simple instruction format
+
+### Models Using Raw Prompts
+- **GPT-J**: Completion-style, no special formatting
+
+## Preprocessing Steps
+
+1. **Load the tokenizer** with appropriate configuration
+2. **Apply model-specific formatting** (chat template or instruction format)
+3. **Tokenize** with proper truncation and max length
+4. **Handle padding** (left-side for generation models)
+
+## Example: Generic Preprocessing Function
+
+```python
+def preprocess_for_model(text, model_name, backend="pytorch"):
+    """Generic preprocessing based on model and backend"""
+    
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.padding_side = "left"
+    tokenizer.pad_token = tokenizer.eos_token
+    
+    # Check if chat template should be used
+    if should_use_chat_template(model_name, backend):
+        tokens = tokenizer.apply_chat_template(
+            [{"role": "user", "content": text}],
+            add_generation_prompt=True,
+            truncation=True,
+            max_length=get_max_length(model_name)
+        )
+    else:
+        # Apply model-specific template or use raw text
+        formatted_text = apply_model_template(text, model_name)
+        tokens = tokenizer.encode(
+            formatted_text,
+            truncation=True,
+            max_length=get_max_length(model_name)
+        )
+    
+    return tokens
+```
+
+## Max Context Lengths
+
+| Model | Max Length | Notes |
+|-------|------------|-------|
+| DeepSeek-R1 | 32,768 | 32K context |
+| Llama 3.1-8B | 8,000 | For preprocessing |
+| Llama 2-70B | 1,024 | Limited context |
+| Mixtral-8x7B | 1,024 | From dataset.py |
+| GPT-J | ~2,048 | Standard GPT-J limit |
+
+## Running Inference
+
+```bash
+# Set backend
+export MLPERF_BACKEND=pytorch  # or vllm, sglang
+
+# PyTorch backend (distributed)
+torchrun --nproc_per_node=8 run_eval_mpi.py --input-file data.pkl
+
+# vLLM/SGLang backends
+python run_eval.py --input-file data.pkl
+```
+
+## Common Issues
+
+1. **Wrong padding side**: Always use `padding_side="left"` for generation
+2. **Missing pad token**: Set `pad_token = eos_token`
+3. **Backend mismatch**: Ensure preprocessing matches backend requirements
+4. **Context overflow**: Respect model's maximum context length
+
+## Validation
+
+To ensure correct preprocessing:
+
+1. Check tokenized length doesn't exceed max
+2. Verify special tokens are properly placed
+3. Test with a few examples before full dataset
+4. Compare against reference outputs
+
+## References
+
+- Model-specific guides in each model's directory
+- Backend configuration in `utils/backend_registry.py`
+- Tokenization utilities in `utils/tokenization.py`
@@ -1,113 +1,56 @@
-# Dataset Preprocessing Documentation - DeepSeek-R1
-
-## Model: DeepSeek-R1
-**Dataset:** Multi-domain Evaluation Ensemble  
-**Evaluation Task:** Multi-domain Reasoning and Code Generation
-
-## Data Source
-- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket
-- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
-- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite)
-- **Licenses:** 
-  - AIME: [CC0](https://creativecommons.org/public-domain/cc0/)
-  - MATH500: [MIT](https://opensource.org/license/mit)
-  - GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
-  - MMLU-Pro: [MIT](https://opensource.org/license/mit)
-  - LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/)
-
-## Current Implementation
-
-### Files Available
-- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`
-- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`
-- **Format:** Preprocessed pickle files ready for evaluation
-
-### Download Process
-```bash
-# Install Rclone
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-
-# Configure access
-rclone config create mlc-inference s3 provider=Cloudflare \
-  access_key_id=f65ba5eef400db161ea49967de89f47b \
-  secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \
-  endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
-
-# Download datasets
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
+# DeepSeek-R1 Preprocessing
+
+## Model Configuration
+- **Model**: `deepseek-ai/DeepSeek-R1`
+- **Revision**: `56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad`
+- **Max Length**: 32,768 tokens (32K)
+
+## Tokenization
+```python
+from transformers import AutoTokenizer
+
+# From utils/tokenization.py
+tokenizer = AutoTokenizer.from_pretrained(
+    "deepseek-ai/DeepSeek-R1",
+    revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad"
+)
 ```
 
-## Missing Documentation (Addresses Issue #2245)
-
-The following preprocessing information is **not currently available**, making reproduction and adaptation difficult:
-
-### 1. Original Data Sources
-- **Raw Dataset Locations:** Where each component dataset was obtained
-- **Version Information:** Specific versions/commits of source datasets
-- **Access Methods:** How to obtain raw data independently
-
-### 2. Preprocessing Pipeline
-- **Tokenization Method:** Which tokenizer was used and configuration
-- **Input Formatting:** How different dataset formats were standardized
-- **Quality Filtering:** Criteria for sample inclusion/exclusion
-- **Ensemble Strategy:** How multiple datasets were combined
-
-### 3. Dataset Statistics
-- **Sample Counts:** Number of samples from each component dataset
-- **Distribution:** How samples are balanced across domains
-- **Difficulty Levels:** Complexity distribution of included problems
+## Preprocessing Method
 
-### 4. Validation Process
-- **Quality Control:** How preprocessing quality was verified
-- **Consistency Checks:** Validation of format standardization
-- **Error Handling:** How malformed samples were addressed
+The preprocessing varies by backend:
 
-## Adaptation Challenges
-
-**For Different Tokenizers:**
-- Cannot modify tokenization without access to raw data
-- No documentation of original tokenization parameters
-- Unable to test preprocessing consistency
-
-**For Different Models:**
-- Cannot adapt input formatting without preprocessing scripts
-- No guidance on prompt template modifications
-- Unable to reproduce dataset with different filtering criteria
-
-## Recommended Improvements
-
-To fully address issue #2245 and improve reproducibility:
-
-### 1. Raw Data Access
-- Provide scripts to download original datasets
-- Document exact versions and sources used
-- Include data licenses and attribution
-
-### 2. Preprocessing Scripts
-- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`)
-- Document tokenization and formatting steps
-- Include quality filtering logic
-
-### 3. Documentation
-- Add detailed preprocessing methodology
-- Include dataset statistics and composition
-- Provide adaptation guidelines
+### PyTorch/vLLM Backends (Chat Template Enabled)
+```python
+# From utils/tokenization.py
+tokens = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt}],
+    add_generation_prompt=True,
+    max_length=32768,
+    truncation=True
+)
+```
 
-### 4. Validation
-- Include preprocessing verification scripts
-- Document expected outputs and checksums
-- Provide quality metrics
+### SGLang Backend (No Chat Template)
+```python
+tokens = tokenizer.encode(
+    prompt,
+    truncation=True,
+    max_length=32768
+)
+```
 
-## Temporary Workaround
+## Backend Configuration
+| Backend | uses_chat_template | input_type |
+|---------|-------------------|------------|
+| PyTorch | True | tokenized |
+| vLLM | True | text |
+| SGLang | False | text |
 
-Until full preprocessing documentation is available:
-1. Use provided preprocessed datasets for standard evaluation
-2. Contact maintainers for specific adaptation requirements
-3. Reference `llama2-70b/processorca.py` for preprocessing patterns
-4. Consider contributing preprocessing scripts based on reverse engineering
+## Dataset Format
+Input data should have a `text_input` column containing the prompts.
 
-## See Also
-- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
-- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
-- Repository issue #2245 - Discussion of preprocessing documentation gaps
+## Accuracy Target
+```
+"mean-accuracy": 81.3582
+```