Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b

anivar · anivar · commit d02b7cbd7954 · 2025-07-24T21:17:49.000+05:30
- Created PREPROCESSING.md template for standardized documentation - Added comprehensive preprocessing documentation for Llama3.1-8b - Added comprehensive preprocessing documentation for DeepSeek-r1 - Documented current preprocessing gaps and missing reproducibility steps - Established standard template for future model documentation - Based documentation on successful llama2-70b/processorca.py patterns Addresses #2245: Dataset preprocessing code is not shared for several models This maintenance contribution improves preprocessing transparency by: 1. Documenting existing preprocessing patterns 2. Identifying gaps in current documentation 3. Providing template for consistent future documentation 4. Enabling better adaptation across different tokenizers/models
diff --git a/PREPROCESSING-TEMPLATE.md b/PREPROCESSING-TEMPLATE.md
@@ -0,0 +1,127 @@
+# Dataset Preprocessing Documentation Template
+
+## Purpose
+This template provides a standardized way to document dataset preprocessing steps for MLCommons inference benchmarks, ensuring reproducibility and transparency.
+
+## Template Structure
+
+### Model: [MODEL_NAME]
+**Dataset:** [DATASET_NAME]  
+**Evaluation Task:** [TASK_DESCRIPTION]
+
+#### Data Source
+- **Raw Dataset:** [SOURCE_AND_FORMAT]
+- **Download Method:** [HOW_TO_OBTAIN]
+- **License:** [LICENSE_INFO]
+
+#### Preprocessing Pipeline
+
+##### 1. Tokenization
+```python
+# Example based on llama2-70b/processorca.py pattern
+from transformers import [TOKENIZER_CLASS]
+tokenizer = [TOKENIZER_CLASS].from_pretrained(model_dir)
+tokens = tokenizer(text)["input_ids"]
+```
+
+##### 2. Filtering Steps
+- **Language Filter:** [DESCRIPTION]
+- **Length Filter:** [SEQUENCE_LENGTH_LIMITS]
+- **Quality Filter:** [QUALITY_CRITERIA]
+- **Content Filter:** [CONTENT_RESTRICTIONS]
+
+##### 3. Formatting
+- **Input Format:** [INPUT_TEMPLATE]
+- **Output Format:** [OUTPUT_TEMPLATE]
+- **Special Tokens:** [SPECIAL_TOKEN_HANDLING]
+
+##### 4. Sampling Strategy
+- **Total Samples:** [NUMBER]
+- **Sampling Method:** [RANDOM/STRATIFIED/OTHER]
+- **Validation Split:** [IF_APPLICABLE]
+
+#### Adaptation Guide
+**For Different Tokenizers:**
+- Modify tokenizer initialization
+- Adjust sequence length limits
+- Update special token handling
+
+**For Different Models:**
+- Update input/output templates
+- Adjust filtering criteria
+- Modify prompt formatting
+
+#### Files Generated
+- **Main Dataset:** [FILENAME_AND_FORMAT]
+- **Calibration Set:** [FILENAME_AND_FORMAT]
+- **Metadata:** [FILENAME_AND_FORMAT]
+
+#### Verification
+- **Expected Sample Count:** [NUMBER]
+- **Checksum/Hash:** [IF_AVAILABLE]
+- **Quality Metrics:** [ROUGE/BLEU/OTHER]
+
+---
+
+## Example Applications
+
+### Llama3.1-8b (CNN/DailyMail)
+**Dataset:** CNN/DailyMail 3.0.0  
+**Evaluation Task:** Text Summarization
+
+#### Data Source
+- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
+- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0")`
+- **License:** Apache 2.0
+
+#### Preprocessing Pipeline
+##### 1. Tokenization
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
+tokenizer.padding_side = "left"
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.model_max_length = 8000
+```
+
+##### 2. Formatting
+- **Input Template:** 
+```
+Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
+
+Article:
+{article}
+
+Summary:
+```
+
+##### 3. Current Gaps
+- ❌ No documented filtering steps
+- ❌ No sampling strategy explanation  
+- ❌ No quality control measures
+- ❌ No reproducible preprocessing script
+
+### DeepSeek-r1 (Multi-domain Evaluation)
+**Dataset:** Ensemble of AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench  
+**Evaluation Task:** Multi-domain Reasoning
+
+#### Data Source
+- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2
+- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
+- **License:** Various (CC0, MIT, CC BY 4.0)
+
+#### Current Gaps
+- ❌ No documented preprocessing steps
+- ❌ No tokenization details
+- ❌ No filtering or sampling explanation
+- ❌ No adaptation guide for other models
+- ❌ Cannot reproduce from raw sources
+
+---
+
+## Implementation Recommendation
+
+1. **For each model directory**, add `PREPROCESSING.md` following this template
+2. **For models with preprocessing scripts**, document the steps in the README
+3. **For models using preprocessed data**, provide original preprocessing methodology
+4. **Create common utilities** for preprocessing patterns that can be shared across models
diff --git a/language/deepseek-r1/PREPROCESSING.md b/language/deepseek-r1/PREPROCESSING.md
@@ -0,0 +1,113 @@
+# Dataset Preprocessing Documentation - DeepSeek-R1
+
+## Model: DeepSeek-R1
+**Dataset:** Multi-domain Evaluation Ensemble  
+**Evaluation Task:** Multi-domain Reasoning and Code Generation
+
+## Data Source
+- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket
+- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
+- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite)
+- **Licenses:** 
+  - AIME: [CC0](https://creativecommons.org/public-domain/cc0/)
+  - MATH500: [MIT](https://opensource.org/license/mit)
+  - GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
+  - MMLU-Pro: [MIT](https://opensource.org/license/mit)
+  - LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/)
+
+## Current Implementation
+
+### Files Available
+- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`
+- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`
+- **Format:** Preprocessed pickle files ready for evaluation
+
+### Download Process
+```bash
+# Install Rclone
+sudo -v ; curl https://rclone.org/install.sh | sudo bash
+
+# Configure access
+rclone config create mlc-inference s3 provider=Cloudflare \
+  access_key_id=f65ba5eef400db161ea49967de89f47b \
+  secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \
+  endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+
+# Download datasets
+rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
+rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
+```
+
+## Missing Documentation (Addresses Issue #2245)
+
+The following preprocessing information is **not currently available**, making reproduction and adaptation difficult:
+
+### 1. Original Data Sources
+- **Raw Dataset Locations:** Where each component dataset was obtained
+- **Version Information:** Specific versions/commits of source datasets
+- **Access Methods:** How to obtain raw data independently
+
+### 2. Preprocessing Pipeline
+- **Tokenization Method:** Which tokenizer was used and configuration
+- **Input Formatting:** How different dataset formats were standardized
+- **Quality Filtering:** Criteria for sample inclusion/exclusion
+- **Ensemble Strategy:** How multiple datasets were combined
+
+### 3. Dataset Statistics
+- **Sample Counts:** Number of samples from each component dataset
+- **Distribution:** How samples are balanced across domains
+- **Difficulty Levels:** Complexity distribution of included problems
+
+### 4. Validation Process
+- **Quality Control:** How preprocessing quality was verified
+- **Consistency Checks:** Validation of format standardization
+- **Error Handling:** How malformed samples were addressed
+
+## Adaptation Challenges
+
+**For Different Tokenizers:**
+- Cannot modify tokenization without access to raw data
+- No documentation of original tokenization parameters
+- Unable to test preprocessing consistency
+
+**For Different Models:**
+- Cannot adapt input formatting without preprocessing scripts
+- No guidance on prompt template modifications
+- Unable to reproduce dataset with different filtering criteria
+
+## Recommended Improvements
+
+To fully address issue #2245 and improve reproducibility:
+
+### 1. Raw Data Access
+- Provide scripts to download original datasets
+- Document exact versions and sources used
+- Include data licenses and attribution
+
+### 2. Preprocessing Scripts
+- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`)
+- Document tokenization and formatting steps
+- Include quality filtering logic
+
+### 3. Documentation
+- Add detailed preprocessing methodology
+- Include dataset statistics and composition
+- Provide adaptation guidelines
+
+### 4. Validation
+- Include preprocessing verification scripts
+- Document expected outputs and checksums
+- Provide quality metrics
+
+## Temporary Workaround
+
+Until full preprocessing documentation is available:
+1. Use provided preprocessed datasets for standard evaluation
+2. Contact maintainers for specific adaptation requirements
+3. Reference `llama2-70b/processorca.py` for preprocessing patterns
+4. Consider contributing preprocessing scripts based on reverse engineering
+
+## See Also
+- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
+- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
+- Repository issue #2245 - Discussion of preprocessing documentation gaps
diff --git a/language/llama3.1-8b/PREPROCESSING.md b/language/llama3.1-8b/PREPROCESSING.md
@@ -0,0 +1,82 @@
+# Dataset Preprocessing Documentation - Llama3.1-8B
+
+## Model: Llama3.1-8B
+**Dataset:** CNN/DailyMail 3.0.0  
+**Evaluation Task:** Text Summarization
+
+## Data Source
+- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
+- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")`
+- **License:** Apache 2.0
+- **Download Script:** `download_cnndm.py`
+
+## Preprocessing Pipeline
+
+### 1. Tokenization
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
+tokenizer.padding_side = "left"
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.model_max_length = 8000
+```
+
+### 2. Input Template
+```
+Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
+
+Article:
+{article}
+
+Summary:
+```
+
+### 3. Current Implementation
+- **Download:** `download_cnndm.py` loads CNN/DailyMail dataset
+- **Calibration:** `prepare-calibration.py` creates calibration subset
+- **Evaluation:** Uses `evaluation.py` for accuracy assessment
+
+## Missing Documentation (Addresses Issue #2245)
+
+The following preprocessing steps are **not currently documented** but would be needed for full reproducibility:
+
+### 4. Filtering Steps (Recommended)
+Based on `llama2-70b/processorca.py` patterns:
+- **Language Filter:** English-only content validation
+- **Length Filter:** Input/output sequence length limits
+- **Quality Filter:** Remove very short summaries
+- **Content Filter:** Handle special characters and formatting
+
+### 5. Sampling Strategy (Recommended)
+- **Dataset Size:** Specify number of evaluation samples
+- **Selection Method:** Random vs stratified sampling
+- **Validation:** How to verify preprocessing consistency
+
+## Adaptation Guide
+
+**For Different Tokenizers:**
+1. Update `model-id` parameter in scripts
+2. Adjust `model_max_length` based on tokenizer capabilities
+3. Verify special token handling (pad_token, eos_token)
+
+**For Different Models:**
+1. Modify input template format
+2. Adjust summary length requirements (currently 128 tokens)
+3. Update evaluation criteria as needed
+
+## Files Generated
+- **Main Dataset:** Downloaded via `download_cnndm.py`
+- **Calibration Set:** Generated via `prepare-calibration.py`
+- **Format:** Standard CNN/DailyMail format from Hugging Face
+
+## Next Steps for Full Reproducibility
+
+To fully address issue #2245, consider adding:
+1. Complete preprocessing script (similar to `llama2-70b/processorca.py`)
+2. Documentation of filtering criteria
+3. Sampling methodology
+4. Quality validation steps
+
+## See Also
+- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
+- `PREPROCESSING-TEMPLATE.md` - Standard template for future models