-
Notifications
You must be signed in to change notification settings - Fork 0
Train and evaluate Qwen 3.5-4B + Gemma 4 E4B on Colab #101
Copy link
Copy link
Open
Labels
area:aiAI/ML, NLQ featuresAI/ML, NLQ featuresfine-tuning: student-explainabilityFine-tune Qwen 3.5 for SHAP narrator, summarizer, and explainer tasksFine-tune Qwen 3.5 for SHAP narrator, summarizer, and explainer taskstype:spikeResearch, investigationResearch, investigation
Description
Summary
Run the Colab training notebook for both Qwen 3.5-4B and Gemma 4 E4B. Evaluate both against ship criteria and select the winner.
Depends On
- Build Colab training notebook (Unsloth + LoRA) #98 (Colab notebook)
- Distill training pairs for summarizer and explainer #99 (summarizer + explainer training pairs)
- Distill training pairs for SHAP narrator #100 (SHAP narrator training pairs)
Tasks
- Upload training data to notebook environment (or clone from repo)
- Run notebook with both models configured — single Run All execution
- Review comparison metrics table:
- json_valid_rate per task (target >= 95%)
- schema_valid_rate per task (target >= 90%)
- shap_grounding_rate for narrator (target >= 80%)
- Inference latency (p50, p95)
- Download GGUF artifacts (6 files: 3 tasks × 2 models)
- Select winner model based on metrics
- If neither passes ship criteria: diagnose, adjust hyperparams, re-run
- Document results in experiment log
- Track compute cost (expected: $8-20)
Model Candidates
| Model | Strengths | Risks |
|---|---|---|
| Qwen 3.5-4B | D4BL proven (98.77% schema validity), smaller GGUF | QLoRA discouraged, no native JSON |
| Gemma 4 E4B | Native JSON output, 128K context, full Ollama support | Unproven for fine-tuning, larger GGUF |
Ship Criteria
| Metric | Target | Blocking? |
|---|---|---|
| json_valid_rate (all tasks) | >= 95% | Yes |
| schema_valid_rate (all tasks) | >= 90% | Yes |
| shap_grounding_rate (narrator) | >= 80% | Yes |
| action_specificity (narrator) | LLM-judged | No |
Acceptance Criteria
- At least one model passes all blocking ship criteria for all 3 tasks
- GGUF files downloaded and verified loadable in Ollama
- Experiment results documented with head-to-head metrics comparison
- Winner selected with documented rationale
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area:aiAI/ML, NLQ featuresAI/ML, NLQ featuresfine-tuning: student-explainabilityFine-tune Qwen 3.5 for SHAP narrator, summarizer, and explainer tasksFine-tune Qwen 3.5 for SHAP narrator, summarizer, and explainer taskstype:spikeResearch, investigationResearch, investigation