Train and evaluate Qwen 3.5-4B + Gemma 4 E4B on Colab

## Summary

Run the Colab training notebook for both Qwen 3.5-4B and Gemma 4 E4B. Evaluate both against ship criteria and select the winner.

## Depends On

- #98 (Colab notebook)
- #99 (summarizer + explainer training pairs)
- #100 (SHAP narrator training pairs)

## Tasks

- [ ] Upload training data to notebook environment (or clone from repo)
- [ ] Run notebook with both models configured — single Run All execution
- [ ] Review comparison metrics table:
  - json_valid_rate per task (target >= 95%)
  - schema_valid_rate per task (target >= 90%)
  - shap_grounding_rate for narrator (target >= 80%)
  - Inference latency (p50, p95)
- [ ] Download GGUF artifacts (6 files: 3 tasks × 2 models)
- [ ] Select winner model based on metrics
- [ ] If neither passes ship criteria: diagnose, adjust hyperparams, re-run
- [ ] Document results in experiment log
- [ ] Track compute cost (expected: $8-20)

## Model Candidates

| Model | Strengths | Risks |
|-------|-----------|-------|
| Qwen 3.5-4B | D4BL proven (98.77% schema validity), smaller GGUF | QLoRA discouraged, no native JSON |
| Gemma 4 E4B | Native JSON output, 128K context, full Ollama support | Unproven for fine-tuning, larger GGUF |

## Ship Criteria

| Metric | Target | Blocking? |
|--------|--------|-----------|
| json_valid_rate (all tasks) | >= 95% | Yes |
| schema_valid_rate (all tasks) | >= 90% | Yes |
| shap_grounding_rate (narrator) | >= 80% | Yes |
| action_specificity (narrator) | LLM-judged | No |

## Acceptance Criteria

- At least one model passes all blocking ship criteria for all 3 tasks
- GGUF files downloaded and verified loadable in Ollama
- Experiment results documented with head-to-head metrics comparison
- Winner selected with documented rationale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train and evaluate Qwen 3.5-4B + Gemma 4 E4B on Colab #101

Summary

Depends On

Tasks

Model Candidates

Ship Criteria

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Strengths	Risks
Qwen 3.5-4B	D4BL proven (98.77% schema validity), smaller GGUF	QLoRA discouraged, no native JSON
Gemma 4 E4B	Native JSON output, 128K context, full Ollama support	Unproven for fine-tuning, larger GGUF

Metric	Target	Blocking?
json_valid_rate (all tasks)	>= 95%	Yes
schema_valid_rate (all tasks)	>= 90%	Yes
shap_grounding_rate (narrator)	>= 80%	Yes
action_specificity (narrator)	LLM-judged	No

Train and evaluate Qwen 3.5-4B + Gemma 4 E4B on Colab #101

Description

Summary

Depends On

Tasks

Model Candidates

Ship Criteria

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions