Skip to content

Commit aa8b5da

Browse files
committed
Fix preprocessing documentation with verified implementations
Update PREPROCESSING.md files with correct information based on actual code. - DeepSeek-R1: Use apply_chat_template, 32K context - Llama 3.1-8B: Use instruction template for summarization - Add general preprocessing guide and examples
1 parent c6aa29a commit aa8b5da

File tree

4 files changed

+388
-173
lines changed

4 files changed

+388
-173
lines changed

language/PREPROCESSING_GUIDE.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# MLCommons Inference - General Preprocessing Guide
2+
3+
## Overview
4+
5+
This guide covers common preprocessing patterns across all language models in MLCommons Inference benchmarks. Preprocessing varies by:
6+
1. Model architecture
7+
2. Backend choice (PyTorch, vLLM, SGLang)
8+
3. Task type (summarization, Q&A, etc.)
9+
10+
## Common Tokenizer Setup Pattern
11+
12+
Most models follow this pattern:
13+
14+
```python
15+
from transformers import AutoTokenizer
16+
17+
tokenizer = AutoTokenizer.from_pretrained(model_name)
18+
tokenizer.padding_side = "left" # Critical for generation
19+
tokenizer.pad_token = tokenizer.eos_token
20+
```
21+
22+
## Backend Dependencies
23+
24+
Different backends have different preprocessing requirements:
25+
26+
| Backend | Input Type | Chat Template Support | Use Case |
27+
|---------|------------|---------------------|----------|
28+
| PyTorch | Tokenized | Varies by model | Distributed inference |
29+
| vLLM | Text | Varies by model | High-throughput serving |
30+
| SGLang | Text | Usually disabled | Optimized serving |
31+
32+
## Dataset Format
33+
34+
All models expect datasets with these common fields:
35+
36+
```python
37+
{
38+
'text_input': str, # Raw prompt text (required)
39+
'tok_input': List[int], # Pre-tokenized input (optional)
40+
'output': str, # Expected output for evaluation
41+
}
42+
```
43+
44+
## Model-Specific Preprocessing
45+
46+
### Models Using Chat Templates
47+
- **DeepSeek-R1**: Uses `apply_chat_template` with PyTorch/vLLM
48+
- **Potential others**: Check `uses_chat_template` in backend registry
49+
50+
### Models Using Simple Templates
51+
- **Llama 3.1-8B**: Instruction format for summarization
52+
- **Llama 2-70B**: Custom format with `[INST]` markers
53+
- **Mixtral-8x7B**: Simple instruction format
54+
55+
### Models Using Raw Prompts
56+
- **GPT-J**: Completion-style, no special formatting
57+
58+
## Preprocessing Steps
59+
60+
1. **Load the tokenizer** with appropriate configuration
61+
2. **Apply model-specific formatting** (chat template or instruction format)
62+
3. **Tokenize** with proper truncation and max length
63+
4. **Handle padding** (left-side for generation models)
64+
65+
## Example: Generic Preprocessing Function
66+
67+
```python
68+
def preprocess_for_model(text, model_name, backend="pytorch"):
69+
"""Generic preprocessing based on model and backend"""
70+
71+
# Load tokenizer
72+
tokenizer = AutoTokenizer.from_pretrained(model_name)
73+
tokenizer.padding_side = "left"
74+
tokenizer.pad_token = tokenizer.eos_token
75+
76+
# Check if chat template should be used
77+
if should_use_chat_template(model_name, backend):
78+
tokens = tokenizer.apply_chat_template(
79+
[{"role": "user", "content": text}],
80+
add_generation_prompt=True,
81+
truncation=True,
82+
max_length=get_max_length(model_name)
83+
)
84+
else:
85+
# Apply model-specific template or use raw text
86+
formatted_text = apply_model_template(text, model_name)
87+
tokens = tokenizer.encode(
88+
formatted_text,
89+
truncation=True,
90+
max_length=get_max_length(model_name)
91+
)
92+
93+
return tokens
94+
```
95+
96+
## Max Context Lengths
97+
98+
| Model | Max Length | Notes |
99+
|-------|------------|-------|
100+
| DeepSeek-R1 | 32,768 | 32K context |
101+
| Llama 3.1-8B | 8,000 | For preprocessing |
102+
| Llama 2-70B | 1,024 | Limited context |
103+
| Mixtral-8x7B | 1,024 | From dataset.py |
104+
| GPT-J | ~2,048 | Standard GPT-J limit |
105+
106+
## Running Inference
107+
108+
```bash
109+
# Set backend
110+
export MLPERF_BACKEND=pytorch # or vllm, sglang
111+
112+
# PyTorch backend (distributed)
113+
torchrun --nproc_per_node=8 run_eval_mpi.py --input-file data.pkl
114+
115+
# vLLM/SGLang backends
116+
python run_eval.py --input-file data.pkl
117+
```
118+
119+
## Common Issues
120+
121+
1. **Wrong padding side**: Always use `padding_side="left"` for generation
122+
2. **Missing pad token**: Set `pad_token = eos_token`
123+
3. **Backend mismatch**: Ensure preprocessing matches backend requirements
124+
4. **Context overflow**: Respect model's maximum context length
125+
126+
## Validation
127+
128+
To ensure correct preprocessing:
129+
130+
1. Check tokenized length doesn't exceed max
131+
2. Verify special tokens are properly placed
132+
3. Test with a few examples before full dataset
133+
4. Compare against reference outputs
134+
135+
## References
136+
137+
- Model-specific guides in each model's directory
138+
- Backend configuration in `utils/backend_registry.py`
139+
- Tokenization utilities in `utils/tokenization.py`
Lines changed: 48 additions & 105 deletions
Original file line numberDiff line numberDiff line change
@@ -1,113 +1,56 @@
1-
# Dataset Preprocessing Documentation - DeepSeek-R1
2-
3-
## Model: DeepSeek-R1
4-
**Dataset:** Multi-domain Evaluation Ensemble
5-
**Evaluation Task:** Multi-domain Reasoning and Code Generation
6-
7-
## Data Source
8-
- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket
9-
- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
10-
- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite)
11-
- **Licenses:**
12-
- AIME: [CC0](https://creativecommons.org/public-domain/cc0/)
13-
- MATH500: [MIT](https://opensource.org/license/mit)
14-
- GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
15-
- MMLU-Pro: [MIT](https://opensource.org/license/mit)
16-
- LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/)
17-
18-
## Current Implementation
19-
20-
### Files Available
21-
- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`
22-
- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`
23-
- **Format:** Preprocessed pickle files ready for evaluation
24-
25-
### Download Process
26-
```bash
27-
# Install Rclone
28-
sudo -v ; curl https://rclone.org/install.sh | sudo bash
29-
30-
# Configure access
31-
rclone config create mlc-inference s3 provider=Cloudflare \
32-
access_key_id=f65ba5eef400db161ea49967de89f47b \
33-
secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \
34-
endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
35-
36-
# Download datasets
37-
rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
38-
rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
1+
# DeepSeek-R1 Preprocessing
2+
3+
## Model Configuration
4+
- **Model**: `deepseek-ai/DeepSeek-R1`
5+
- **Revision**: `56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad`
6+
- **Max Length**: 32,768 tokens (32K)
7+
8+
## Tokenization
9+
```python
10+
from transformers import AutoTokenizer
11+
12+
# From utils/tokenization.py
13+
tokenizer = AutoTokenizer.from_pretrained(
14+
"deepseek-ai/DeepSeek-R1",
15+
revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad"
16+
)
3917
```
4018

41-
## Missing Documentation (Addresses Issue #2245)
42-
43-
The following preprocessing information is **not currently available**, making reproduction and adaptation difficult:
44-
45-
### 1. Original Data Sources
46-
- **Raw Dataset Locations:** Where each component dataset was obtained
47-
- **Version Information:** Specific versions/commits of source datasets
48-
- **Access Methods:** How to obtain raw data independently
49-
50-
### 2. Preprocessing Pipeline
51-
- **Tokenization Method:** Which tokenizer was used and configuration
52-
- **Input Formatting:** How different dataset formats were standardized
53-
- **Quality Filtering:** Criteria for sample inclusion/exclusion
54-
- **Ensemble Strategy:** How multiple datasets were combined
55-
56-
### 3. Dataset Statistics
57-
- **Sample Counts:** Number of samples from each component dataset
58-
- **Distribution:** How samples are balanced across domains
59-
- **Difficulty Levels:** Complexity distribution of included problems
19+
## Preprocessing Method
6020

61-
### 4. Validation Process
62-
- **Quality Control:** How preprocessing quality was verified
63-
- **Consistency Checks:** Validation of format standardization
64-
- **Error Handling:** How malformed samples were addressed
21+
The preprocessing varies by backend:
6522

66-
## Adaptation Challenges
67-
68-
**For Different Tokenizers:**
69-
- Cannot modify tokenization without access to raw data
70-
- No documentation of original tokenization parameters
71-
- Unable to test preprocessing consistency
72-
73-
**For Different Models:**
74-
- Cannot adapt input formatting without preprocessing scripts
75-
- No guidance on prompt template modifications
76-
- Unable to reproduce dataset with different filtering criteria
77-
78-
## Recommended Improvements
79-
80-
To fully address issue #2245 and improve reproducibility:
81-
82-
### 1. Raw Data Access
83-
- Provide scripts to download original datasets
84-
- Document exact versions and sources used
85-
- Include data licenses and attribution
86-
87-
### 2. Preprocessing Scripts
88-
- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`)
89-
- Document tokenization and formatting steps
90-
- Include quality filtering logic
91-
92-
### 3. Documentation
93-
- Add detailed preprocessing methodology
94-
- Include dataset statistics and composition
95-
- Provide adaptation guidelines
23+
### PyTorch/vLLM Backends (Chat Template Enabled)
24+
```python
25+
# From utils/tokenization.py
26+
tokens = tokenizer.apply_chat_template(
27+
[{"role": "user", "content": prompt}],
28+
add_generation_prompt=True,
29+
max_length=32768,
30+
truncation=True
31+
)
32+
```
9633

97-
### 4. Validation
98-
- Include preprocessing verification scripts
99-
- Document expected outputs and checksums
100-
- Provide quality metrics
34+
### SGLang Backend (No Chat Template)
35+
```python
36+
tokens = tokenizer.encode(
37+
prompt,
38+
truncation=True,
39+
max_length=32768
40+
)
41+
```
10142

102-
## Temporary Workaround
43+
## Backend Configuration
44+
| Backend | uses_chat_template | input_type |
45+
|---------|-------------------|------------|
46+
| PyTorch | True | tokenized |
47+
| vLLM | True | text |
48+
| SGLang | False | text |
10349

104-
Until full preprocessing documentation is available:
105-
1. Use provided preprocessed datasets for standard evaluation
106-
2. Contact maintainers for specific adaptation requirements
107-
3. Reference `llama2-70b/processorca.py` for preprocessing patterns
108-
4. Consider contributing preprocessing scripts based on reverse engineering
50+
## Dataset Format
51+
Input data should have a `text_input` column containing the prompts.
10952

110-
## See Also
111-
- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
112-
- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
113-
- Repository issue #2245 - Discussion of preprocessing documentation gaps
53+
## Accuracy Target
54+
```
55+
"mean-accuracy": 81.3582
56+
```

0 commit comments

Comments
 (0)