Skip to content

Commit d02b7cb

Browse files
committed
Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b
- Created PREPROCESSING.md template for standardized documentation - Added comprehensive preprocessing documentation for Llama3.1-8b - Added comprehensive preprocessing documentation for DeepSeek-r1 - Documented current preprocessing gaps and missing reproducibility steps - Established standard template for future model documentation - Based documentation on successful llama2-70b/processorca.py patterns Addresses #2245: Dataset preprocessing code is not shared for several models This maintenance contribution improves preprocessing transparency by: 1. Documenting existing preprocessing patterns 2. Identifying gaps in current documentation 3. Providing template for consistent future documentation 4. Enabling better adaptation across different tokenizers/models
1 parent 50de991 commit d02b7cb

File tree

3 files changed

+322
-0
lines changed

3 files changed

+322
-0
lines changed

PREPROCESSING-TEMPLATE.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Dataset Preprocessing Documentation Template
2+
3+
## Purpose
4+
This template provides a standardized way to document dataset preprocessing steps for MLCommons inference benchmarks, ensuring reproducibility and transparency.
5+
6+
## Template Structure
7+
8+
### Model: [MODEL_NAME]
9+
**Dataset:** [DATASET_NAME]
10+
**Evaluation Task:** [TASK_DESCRIPTION]
11+
12+
#### Data Source
13+
- **Raw Dataset:** [SOURCE_AND_FORMAT]
14+
- **Download Method:** [HOW_TO_OBTAIN]
15+
- **License:** [LICENSE_INFO]
16+
17+
#### Preprocessing Pipeline
18+
19+
##### 1. Tokenization
20+
```python
21+
# Example based on llama2-70b/processorca.py pattern
22+
from transformers import [TOKENIZER_CLASS]
23+
tokenizer = [TOKENIZER_CLASS].from_pretrained(model_dir)
24+
tokens = tokenizer(text)["input_ids"]
25+
```
26+
27+
##### 2. Filtering Steps
28+
- **Language Filter:** [DESCRIPTION]
29+
- **Length Filter:** [SEQUENCE_LENGTH_LIMITS]
30+
- **Quality Filter:** [QUALITY_CRITERIA]
31+
- **Content Filter:** [CONTENT_RESTRICTIONS]
32+
33+
##### 3. Formatting
34+
- **Input Format:** [INPUT_TEMPLATE]
35+
- **Output Format:** [OUTPUT_TEMPLATE]
36+
- **Special Tokens:** [SPECIAL_TOKEN_HANDLING]
37+
38+
##### 4. Sampling Strategy
39+
- **Total Samples:** [NUMBER]
40+
- **Sampling Method:** [RANDOM/STRATIFIED/OTHER]
41+
- **Validation Split:** [IF_APPLICABLE]
42+
43+
#### Adaptation Guide
44+
**For Different Tokenizers:**
45+
- Modify tokenizer initialization
46+
- Adjust sequence length limits
47+
- Update special token handling
48+
49+
**For Different Models:**
50+
- Update input/output templates
51+
- Adjust filtering criteria
52+
- Modify prompt formatting
53+
54+
#### Files Generated
55+
- **Main Dataset:** [FILENAME_AND_FORMAT]
56+
- **Calibration Set:** [FILENAME_AND_FORMAT]
57+
- **Metadata:** [FILENAME_AND_FORMAT]
58+
59+
#### Verification
60+
- **Expected Sample Count:** [NUMBER]
61+
- **Checksum/Hash:** [IF_AVAILABLE]
62+
- **Quality Metrics:** [ROUGE/BLEU/OTHER]
63+
64+
---
65+
66+
## Example Applications
67+
68+
### Llama3.1-8b (CNN/DailyMail)
69+
**Dataset:** CNN/DailyMail 3.0.0
70+
**Evaluation Task:** Text Summarization
71+
72+
#### Data Source
73+
- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
74+
- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0")`
75+
- **License:** Apache 2.0
76+
77+
#### Preprocessing Pipeline
78+
##### 1. Tokenization
79+
```python
80+
from transformers import AutoTokenizer
81+
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
82+
tokenizer.padding_side = "left"
83+
tokenizer.pad_token = tokenizer.eos_token
84+
tokenizer.model_max_length = 8000
85+
```
86+
87+
##### 2. Formatting
88+
- **Input Template:**
89+
```
90+
Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
91+
92+
Article:
93+
{article}
94+
95+
Summary:
96+
```
97+
98+
##### 3. Current Gaps
99+
- ❌ No documented filtering steps
100+
- ❌ No sampling strategy explanation
101+
- ❌ No quality control measures
102+
- ❌ No reproducible preprocessing script
103+
104+
### DeepSeek-r1 (Multi-domain Evaluation)
105+
**Dataset:** Ensemble of AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench
106+
**Evaluation Task:** Multi-domain Reasoning
107+
108+
#### Data Source
109+
- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2
110+
- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
111+
- **License:** Various (CC0, MIT, CC BY 4.0)
112+
113+
#### Current Gaps
114+
- ❌ No documented preprocessing steps
115+
- ❌ No tokenization details
116+
- ❌ No filtering or sampling explanation
117+
- ❌ No adaptation guide for other models
118+
- ❌ Cannot reproduce from raw sources
119+
120+
---
121+
122+
## Implementation Recommendation
123+
124+
1. **For each model directory**, add `PREPROCESSING.md` following this template
125+
2. **For models with preprocessing scripts**, document the steps in the README
126+
3. **For models using preprocessed data**, provide original preprocessing methodology
127+
4. **Create common utilities** for preprocessing patterns that can be shared across models
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Dataset Preprocessing Documentation - DeepSeek-R1
2+
3+
## Model: DeepSeek-R1
4+
**Dataset:** Multi-domain Evaluation Ensemble
5+
**Evaluation Task:** Multi-domain Reasoning and Code Generation
6+
7+
## Data Source
8+
- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket
9+
- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
10+
- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite)
11+
- **Licenses:**
12+
- AIME: [CC0](https://creativecommons.org/public-domain/cc0/)
13+
- MATH500: [MIT](https://opensource.org/license/mit)
14+
- GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
15+
- MMLU-Pro: [MIT](https://opensource.org/license/mit)
16+
- LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/)
17+
18+
## Current Implementation
19+
20+
### Files Available
21+
- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`
22+
- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`
23+
- **Format:** Preprocessed pickle files ready for evaluation
24+
25+
### Download Process
26+
```bash
27+
# Install Rclone
28+
sudo -v ; curl https://rclone.org/install.sh | sudo bash
29+
30+
# Configure access
31+
rclone config create mlc-inference s3 provider=Cloudflare \
32+
access_key_id=f65ba5eef400db161ea49967de89f47b \
33+
secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \
34+
endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
35+
36+
# Download datasets
37+
rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
38+
rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
39+
```
40+
41+
## Missing Documentation (Addresses Issue #2245)
42+
43+
The following preprocessing information is **not currently available**, making reproduction and adaptation difficult:
44+
45+
### 1. Original Data Sources
46+
- **Raw Dataset Locations:** Where each component dataset was obtained
47+
- **Version Information:** Specific versions/commits of source datasets
48+
- **Access Methods:** How to obtain raw data independently
49+
50+
### 2. Preprocessing Pipeline
51+
- **Tokenization Method:** Which tokenizer was used and configuration
52+
- **Input Formatting:** How different dataset formats were standardized
53+
- **Quality Filtering:** Criteria for sample inclusion/exclusion
54+
- **Ensemble Strategy:** How multiple datasets were combined
55+
56+
### 3. Dataset Statistics
57+
- **Sample Counts:** Number of samples from each component dataset
58+
- **Distribution:** How samples are balanced across domains
59+
- **Difficulty Levels:** Complexity distribution of included problems
60+
61+
### 4. Validation Process
62+
- **Quality Control:** How preprocessing quality was verified
63+
- **Consistency Checks:** Validation of format standardization
64+
- **Error Handling:** How malformed samples were addressed
65+
66+
## Adaptation Challenges
67+
68+
**For Different Tokenizers:**
69+
- Cannot modify tokenization without access to raw data
70+
- No documentation of original tokenization parameters
71+
- Unable to test preprocessing consistency
72+
73+
**For Different Models:**
74+
- Cannot adapt input formatting without preprocessing scripts
75+
- No guidance on prompt template modifications
76+
- Unable to reproduce dataset with different filtering criteria
77+
78+
## Recommended Improvements
79+
80+
To fully address issue #2245 and improve reproducibility:
81+
82+
### 1. Raw Data Access
83+
- Provide scripts to download original datasets
84+
- Document exact versions and sources used
85+
- Include data licenses and attribution
86+
87+
### 2. Preprocessing Scripts
88+
- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`)
89+
- Document tokenization and formatting steps
90+
- Include quality filtering logic
91+
92+
### 3. Documentation
93+
- Add detailed preprocessing methodology
94+
- Include dataset statistics and composition
95+
- Provide adaptation guidelines
96+
97+
### 4. Validation
98+
- Include preprocessing verification scripts
99+
- Document expected outputs and checksums
100+
- Provide quality metrics
101+
102+
## Temporary Workaround
103+
104+
Until full preprocessing documentation is available:
105+
1. Use provided preprocessed datasets for standard evaluation
106+
2. Contact maintainers for specific adaptation requirements
107+
3. Reference `llama2-70b/processorca.py` for preprocessing patterns
108+
4. Consider contributing preprocessing scripts based on reverse engineering
109+
110+
## See Also
111+
- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
112+
- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
113+
- Repository issue #2245 - Discussion of preprocessing documentation gaps
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Dataset Preprocessing Documentation - Llama3.1-8B
2+
3+
## Model: Llama3.1-8B
4+
**Dataset:** CNN/DailyMail 3.0.0
5+
**Evaluation Task:** Text Summarization
6+
7+
## Data Source
8+
- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
9+
- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")`
10+
- **License:** Apache 2.0
11+
- **Download Script:** `download_cnndm.py`
12+
13+
## Preprocessing Pipeline
14+
15+
### 1. Tokenization
16+
```python
17+
from transformers import AutoTokenizer
18+
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
19+
tokenizer.padding_side = "left"
20+
tokenizer.pad_token = tokenizer.eos_token
21+
tokenizer.model_max_length = 8000
22+
```
23+
24+
### 2. Input Template
25+
```
26+
Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
27+
28+
Article:
29+
{article}
30+
31+
Summary:
32+
```
33+
34+
### 3. Current Implementation
35+
- **Download:** `download_cnndm.py` loads CNN/DailyMail dataset
36+
- **Calibration:** `prepare-calibration.py` creates calibration subset
37+
- **Evaluation:** Uses `evaluation.py` for accuracy assessment
38+
39+
## Missing Documentation (Addresses Issue #2245)
40+
41+
The following preprocessing steps are **not currently documented** but would be needed for full reproducibility:
42+
43+
### 4. Filtering Steps (Recommended)
44+
Based on `llama2-70b/processorca.py` patterns:
45+
- **Language Filter:** English-only content validation
46+
- **Length Filter:** Input/output sequence length limits
47+
- **Quality Filter:** Remove very short summaries
48+
- **Content Filter:** Handle special characters and formatting
49+
50+
### 5. Sampling Strategy (Recommended)
51+
- **Dataset Size:** Specify number of evaluation samples
52+
- **Selection Method:** Random vs stratified sampling
53+
- **Validation:** How to verify preprocessing consistency
54+
55+
## Adaptation Guide
56+
57+
**For Different Tokenizers:**
58+
1. Update `model-id` parameter in scripts
59+
2. Adjust `model_max_length` based on tokenizer capabilities
60+
3. Verify special token handling (pad_token, eos_token)
61+
62+
**For Different Models:**
63+
1. Modify input template format
64+
2. Adjust summary length requirements (currently 128 tokens)
65+
3. Update evaluation criteria as needed
66+
67+
## Files Generated
68+
- **Main Dataset:** Downloaded via `download_cnndm.py`
69+
- **Calibration Set:** Generated via `prepare-calibration.py`
70+
- **Format:** Standard CNN/DailyMail format from Hugging Face
71+
72+
## Next Steps for Full Reproducibility
73+
74+
To fully address issue #2245, consider adding:
75+
1. Complete preprocessing script (similar to `llama2-70b/processorca.py`)
76+
2. Documentation of filtering criteria
77+
3. Sampling methodology
78+
4. Quality validation steps
79+
80+
## See Also
81+
- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
82+
- `PREPROCESSING-TEMPLATE.md` - Standard template for future models

0 commit comments

Comments
 (0)