| 
1 |  | -# Dataset Preprocessing Documentation - DeepSeek-R1  | 
2 |  | - | 
3 |  | -## Model: DeepSeek-R1  | 
4 |  | -**Dataset:** Multi-domain Evaluation Ensemble    | 
5 |  | -**Evaluation Task:** Multi-domain Reasoning and Code Generation  | 
6 |  | - | 
7 |  | -## Data Source  | 
8 |  | -- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket  | 
9 |  | -- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`  | 
10 |  | -- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite)  | 
11 |  | -- **Licenses:**   | 
12 |  | -  - AIME: [CC0](https://creativecommons.org/public-domain/cc0/)  | 
13 |  | -  - MATH500: [MIT](https://opensource.org/license/mit)  | 
14 |  | -  - GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)  | 
15 |  | -  - MMLU-Pro: [MIT](https://opensource.org/license/mit)  | 
16 |  | -  - LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/)  | 
17 |  | - | 
18 |  | -## Current Implementation  | 
19 |  | - | 
20 |  | -### Files Available  | 
21 |  | -- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`  | 
22 |  | -- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`  | 
23 |  | -- **Format:** Preprocessed pickle files ready for evaluation  | 
24 |  | - | 
25 |  | -### Download Process  | 
26 |  | -```bash  | 
27 |  | -# Install Rclone  | 
28 |  | -sudo -v ; curl https://rclone.org/install.sh | sudo bash  | 
29 |  | - | 
30 |  | -# Configure access  | 
31 |  | -rclone config create mlc-inference s3 provider=Cloudflare \  | 
32 |  | -  access_key_id=f65ba5eef400db161ea49967de89f47b \  | 
33 |  | -  secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \  | 
34 |  | -  endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com  | 
35 |  | - | 
36 |  | -# Download datasets  | 
37 |  | -rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P  | 
38 |  | -rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P  | 
 | 1 | +# DeepSeek-R1 Preprocessing  | 
 | 2 | + | 
 | 3 | +## Model Configuration  | 
 | 4 | +- **Model**: `deepseek-ai/DeepSeek-R1`  | 
 | 5 | +- **Revision**: `56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad`  | 
 | 6 | +- **Max Length**: 32,768 tokens (32K)  | 
 | 7 | + | 
 | 8 | +## Tokenization  | 
 | 9 | +```python  | 
 | 10 | +from transformers import AutoTokenizer  | 
 | 11 | + | 
 | 12 | +# From utils/tokenization.py  | 
 | 13 | +tokenizer = AutoTokenizer.from_pretrained(  | 
 | 14 | +    "deepseek-ai/DeepSeek-R1",  | 
 | 15 | +    revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad"  | 
 | 16 | +)  | 
39 | 17 | ```  | 
40 | 18 | 
 
  | 
41 |  | -## Missing Documentation (Addresses Issue #2245)  | 
42 |  | - | 
43 |  | -The following preprocessing information is **not currently available**, making reproduction and adaptation difficult:  | 
44 |  | - | 
45 |  | -### 1. Original Data Sources  | 
46 |  | -- **Raw Dataset Locations:** Where each component dataset was obtained  | 
47 |  | -- **Version Information:** Specific versions/commits of source datasets  | 
48 |  | -- **Access Methods:** How to obtain raw data independently  | 
49 |  | - | 
50 |  | -### 2. Preprocessing Pipeline  | 
51 |  | -- **Tokenization Method:** Which tokenizer was used and configuration  | 
52 |  | -- **Input Formatting:** How different dataset formats were standardized  | 
53 |  | -- **Quality Filtering:** Criteria for sample inclusion/exclusion  | 
54 |  | -- **Ensemble Strategy:** How multiple datasets were combined  | 
55 |  | - | 
56 |  | -### 3. Dataset Statistics  | 
57 |  | -- **Sample Counts:** Number of samples from each component dataset  | 
58 |  | -- **Distribution:** How samples are balanced across domains  | 
59 |  | -- **Difficulty Levels:** Complexity distribution of included problems  | 
 | 19 | +## Preprocessing Method  | 
60 | 20 | 
 
  | 
61 |  | -### 4. Validation Process  | 
62 |  | -- **Quality Control:** How preprocessing quality was verified  | 
63 |  | -- **Consistency Checks:** Validation of format standardization  | 
64 |  | -- **Error Handling:** How malformed samples were addressed  | 
 | 21 | +The preprocessing varies by backend:  | 
65 | 22 | 
 
  | 
66 |  | -## Adaptation Challenges  | 
67 |  | - | 
68 |  | -**For Different Tokenizers:**  | 
69 |  | -- Cannot modify tokenization without access to raw data  | 
70 |  | -- No documentation of original tokenization parameters  | 
71 |  | -- Unable to test preprocessing consistency  | 
72 |  | - | 
73 |  | -**For Different Models:**  | 
74 |  | -- Cannot adapt input formatting without preprocessing scripts  | 
75 |  | -- No guidance on prompt template modifications  | 
76 |  | -- Unable to reproduce dataset with different filtering criteria  | 
77 |  | - | 
78 |  | -## Recommended Improvements  | 
79 |  | - | 
80 |  | -To fully address issue #2245 and improve reproducibility:  | 
81 |  | - | 
82 |  | -### 1. Raw Data Access  | 
83 |  | -- Provide scripts to download original datasets  | 
84 |  | -- Document exact versions and sources used  | 
85 |  | -- Include data licenses and attribution  | 
86 |  | - | 
87 |  | -### 2. Preprocessing Scripts  | 
88 |  | -- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`)  | 
89 |  | -- Document tokenization and formatting steps  | 
90 |  | -- Include quality filtering logic  | 
91 |  | - | 
92 |  | -### 3. Documentation  | 
93 |  | -- Add detailed preprocessing methodology  | 
94 |  | -- Include dataset statistics and composition  | 
95 |  | -- Provide adaptation guidelines  | 
 | 23 | +### PyTorch/vLLM Backends (Chat Template Enabled)  | 
 | 24 | +```python  | 
 | 25 | +# From utils/tokenization.py  | 
 | 26 | +tokens = tokenizer.apply_chat_template(  | 
 | 27 | +    [{"role": "user", "content": prompt}],  | 
 | 28 | +    add_generation_prompt=True,  | 
 | 29 | +    max_length=32768,  | 
 | 30 | +    truncation=True  | 
 | 31 | +)  | 
 | 32 | +```  | 
96 | 33 | 
 
  | 
97 |  | -### 4. Validation  | 
98 |  | -- Include preprocessing verification scripts  | 
99 |  | -- Document expected outputs and checksums  | 
100 |  | -- Provide quality metrics  | 
 | 34 | +### SGLang Backend (No Chat Template)  | 
 | 35 | +```python  | 
 | 36 | +tokens = tokenizer.encode(  | 
 | 37 | +    prompt,  | 
 | 38 | +    truncation=True,  | 
 | 39 | +    max_length=32768  | 
 | 40 | +)  | 
 | 41 | +```  | 
101 | 42 | 
 
  | 
102 |  | -## Temporary Workaround  | 
 | 43 | +## Backend Configuration  | 
 | 44 | +| Backend | uses_chat_template | input_type |  | 
 | 45 | +|---------|-------------------|------------|  | 
 | 46 | +| PyTorch | True | tokenized |  | 
 | 47 | +| vLLM | True | text |  | 
 | 48 | +| SGLang | False | text |  | 
103 | 49 | 
 
  | 
104 |  | -Until full preprocessing documentation is available:  | 
105 |  | -1. Use provided preprocessed datasets for standard evaluation  | 
106 |  | -2. Contact maintainers for specific adaptation requirements  | 
107 |  | -3. Reference `llama2-70b/processorca.py` for preprocessing patterns  | 
108 |  | -4. Consider contributing preprocessing scripts based on reverse engineering  | 
 | 50 | +## Dataset Format  | 
 | 51 | +Input data should have a `text_input` column containing the prompts.  | 
109 | 52 | 
 
  | 
110 |  | -## See Also  | 
111 |  | -- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing  | 
112 |  | -- `PREPROCESSING-TEMPLATE.md` - Standard template for future models  | 
113 |  | -- Repository issue #2245 - Discussion of preprocessing documentation gaps  | 
 | 53 | +## Accuracy Target  | 
 | 54 | +```  | 
 | 55 | +"mean-accuracy": 81.3582  | 
 | 56 | +```  | 
0 commit comments