|
| 1 | +# Dataset Preprocessing Documentation - DeepSeek-R1 |
| 2 | + |
| 3 | +## Model: DeepSeek-R1 |
| 4 | +**Dataset:** Multi-domain Evaluation Ensemble |
| 5 | +**Evaluation Task:** Multi-domain Reasoning and Code Generation |
| 6 | + |
| 7 | +## Data Source |
| 8 | +- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket |
| 9 | +- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/` |
| 10 | +- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite) |
| 11 | +- **Licenses:** |
| 12 | + - AIME: [CC0](https://creativecommons.org/public-domain/cc0/) |
| 13 | + - MATH500: [MIT](https://opensource.org/license/mit) |
| 14 | + - GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
| 15 | + - MMLU-Pro: [MIT](https://opensource.org/license/mit) |
| 16 | + - LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/) |
| 17 | + |
| 18 | +## Current Implementation |
| 19 | + |
| 20 | +### Files Available |
| 21 | +- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl` |
| 22 | +- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl` |
| 23 | +- **Format:** Preprocessed pickle files ready for evaluation |
| 24 | + |
| 25 | +### Download Process |
| 26 | +```bash |
| 27 | +# Install Rclone |
| 28 | +sudo -v ; curl https://rclone.org/install.sh | sudo bash |
| 29 | + |
| 30 | +# Configure access |
| 31 | +rclone config create mlc-inference s3 provider=Cloudflare \ |
| 32 | + access_key_id=f65ba5eef400db161ea49967de89f47b \ |
| 33 | + secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \ |
| 34 | + endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com |
| 35 | + |
| 36 | +# Download datasets |
| 37 | +rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P |
| 38 | +rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P |
| 39 | +``` |
| 40 | + |
| 41 | +## Missing Documentation (Addresses Issue #2245) |
| 42 | + |
| 43 | +The following preprocessing information is **not currently available**, making reproduction and adaptation difficult: |
| 44 | + |
| 45 | +### 1. Original Data Sources |
| 46 | +- **Raw Dataset Locations:** Where each component dataset was obtained |
| 47 | +- **Version Information:** Specific versions/commits of source datasets |
| 48 | +- **Access Methods:** How to obtain raw data independently |
| 49 | + |
| 50 | +### 2. Preprocessing Pipeline |
| 51 | +- **Tokenization Method:** Which tokenizer was used and configuration |
| 52 | +- **Input Formatting:** How different dataset formats were standardized |
| 53 | +- **Quality Filtering:** Criteria for sample inclusion/exclusion |
| 54 | +- **Ensemble Strategy:** How multiple datasets were combined |
| 55 | + |
| 56 | +### 3. Dataset Statistics |
| 57 | +- **Sample Counts:** Number of samples from each component dataset |
| 58 | +- **Distribution:** How samples are balanced across domains |
| 59 | +- **Difficulty Levels:** Complexity distribution of included problems |
| 60 | + |
| 61 | +### 4. Validation Process |
| 62 | +- **Quality Control:** How preprocessing quality was verified |
| 63 | +- **Consistency Checks:** Validation of format standardization |
| 64 | +- **Error Handling:** How malformed samples were addressed |
| 65 | + |
| 66 | +## Adaptation Challenges |
| 67 | + |
| 68 | +**For Different Tokenizers:** |
| 69 | +- Cannot modify tokenization without access to raw data |
| 70 | +- No documentation of original tokenization parameters |
| 71 | +- Unable to test preprocessing consistency |
| 72 | + |
| 73 | +**For Different Models:** |
| 74 | +- Cannot adapt input formatting without preprocessing scripts |
| 75 | +- No guidance on prompt template modifications |
| 76 | +- Unable to reproduce dataset with different filtering criteria |
| 77 | + |
| 78 | +## Recommended Improvements |
| 79 | + |
| 80 | +To fully address issue #2245 and improve reproducibility: |
| 81 | + |
| 82 | +### 1. Raw Data Access |
| 83 | +- Provide scripts to download original datasets |
| 84 | +- Document exact versions and sources used |
| 85 | +- Include data licenses and attribution |
| 86 | + |
| 87 | +### 2. Preprocessing Scripts |
| 88 | +- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`) |
| 89 | +- Document tokenization and formatting steps |
| 90 | +- Include quality filtering logic |
| 91 | + |
| 92 | +### 3. Documentation |
| 93 | +- Add detailed preprocessing methodology |
| 94 | +- Include dataset statistics and composition |
| 95 | +- Provide adaptation guidelines |
| 96 | + |
| 97 | +### 4. Validation |
| 98 | +- Include preprocessing verification scripts |
| 99 | +- Document expected outputs and checksums |
| 100 | +- Provide quality metrics |
| 101 | + |
| 102 | +## Temporary Workaround |
| 103 | + |
| 104 | +Until full preprocessing documentation is available: |
| 105 | +1. Use provided preprocessed datasets for standard evaluation |
| 106 | +2. Contact maintainers for specific adaptation requirements |
| 107 | +3. Reference `llama2-70b/processorca.py` for preprocessing patterns |
| 108 | +4. Consider contributing preprocessing scripts based on reverse engineering |
| 109 | + |
| 110 | +## See Also |
| 111 | +- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing |
| 112 | +- `PREPROCESSING-TEMPLATE.md` - Standard template for future models |
| 113 | +- Repository issue #2245 - Discussion of preprocessing documentation gaps |
0 commit comments