Skip to content

Train and evaluate Qwen 3.5-4B + Gemma 4 E4B on Colab #101

@William-Hill

Description

@William-Hill

Summary

Run the Colab training notebook for both Qwen 3.5-4B and Gemma 4 E4B. Evaluate both against ship criteria and select the winner.

Depends On

Tasks

  • Upload training data to notebook environment (or clone from repo)
  • Run notebook with both models configured — single Run All execution
  • Review comparison metrics table:
    • json_valid_rate per task (target >= 95%)
    • schema_valid_rate per task (target >= 90%)
    • shap_grounding_rate for narrator (target >= 80%)
    • Inference latency (p50, p95)
  • Download GGUF artifacts (6 files: 3 tasks × 2 models)
  • Select winner model based on metrics
  • If neither passes ship criteria: diagnose, adjust hyperparams, re-run
  • Document results in experiment log
  • Track compute cost (expected: $8-20)

Model Candidates

Model Strengths Risks
Qwen 3.5-4B D4BL proven (98.77% schema validity), smaller GGUF QLoRA discouraged, no native JSON
Gemma 4 E4B Native JSON output, 128K context, full Ollama support Unproven for fine-tuning, larger GGUF

Ship Criteria

Metric Target Blocking?
json_valid_rate (all tasks) >= 95% Yes
schema_valid_rate (all tasks) >= 90% Yes
shap_grounding_rate (narrator) >= 80% Yes
action_specificity (narrator) LLM-judged No

Acceptance Criteria

  • At least one model passes all blocking ship criteria for all 3 tasks
  • GGUF files downloaded and verified loadable in Ollama
  • Experiment results documented with head-to-head metrics comparison
  • Winner selected with documented rationale

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:aiAI/ML, NLQ featuresfine-tuning: student-explainabilityFine-tune Qwen 3.5 for SHAP narrator, summarizer, and explainer taskstype:spikeResearch, investigation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions