This roadmap details how the Thinker framework evolves from the current validation-first Hugging Face loop into the orchestration layer for both local and Tinker-backed experiments. It is written for reviewers who need to audit the full workflow—commands, configs, tests, and documentation are all referenced explicitly.
- Thinker CLI - Full orchestration for validation, training, evaluation, antagonist analysis, and data setup
- Tinker Backend - ✅ COMPLETE: Production-ready integration with citation validation, manifest generation, telemetry
- Antagonist MVP - ✅ COMPLETE: 92% flagging rate, 4 issue types, 22 tests, comprehensive documentation
- 4-Stage Semantic Validation - ✅ OPERATIONAL: Citation → Entailment → Similarity → Paraphrase
- Topology Instrumentation - ✅ WORKING: β₁ (Betti numbers), chirality, Fisher-Rao distance
- Dashboard & Telemetry - ✅ COMPLETE: Multi-run visualization, training/eval/antagonist charts
- Datasets:
- SciFact: Fully automated (download + convert + validation)
- FEVER: Helper pulls from Zenodo mirrors, conversion script supports JSONL wiki shards
- Citation Hallucination Fix - Training with
citation_validity_weight=5.0to eliminate HIGH severity CITATION_INVALID cases- Status: Code committed (commit
e500bb2), training run pending - Previous attempt (weight=2.0) FAILED to eliminate hallucinations
- Success criteria: Eliminate 2 HIGH severity flags, mean entailment ≥0.50, overall pass ≥45%
- Status: Code committed (commit
- Synthesizer Agent - Blocked until Proposer reaches ≥60% semantic quality (currently 34-38%)
- Blocking issues: Citation hallucinations, weak entailment (0.395-0.448)
- Unblocking criteria: Mean entailment ≥0.60, HIGH severity flags eliminated
Schema Compliance: 100% ✅
Citation Accuracy: 96% ✅
Mean Entailment: 0.448 ⚠️ (target ≥0.75)
Overall Semantic Pass: 38% ⚠️ (target ≥60%)
Antagonist Flags: 46/50 (92%), 2 HIGH severity
β₁ (cycles): 0 across all samples
Mean Chirality: 0.561
- Delete corrupted raw files and rerun
python -m thinker.cli data setup --dataset fever --skip-validationto confirm a clean processed JSONL. - Add FEVER fixtures/tests:
- Sample FEVER claims + wiki lines under
cns-support-models/tests/fixtures/fever_*. - Tests similar to SciFact (CLAIM parsing, converter CLI).
- Sample FEVER claims + wiki lines under
- Create
thinker/configs/pipeline_fever.yamlpointing at FEVER paths (data validation, training, evaluation). - Update docs (README, DATA_PIPELINE) to mention FEVER config and commands.
- Extend
DatasetValidationConfigwith:- Regex checks, numeric bounds, Hypothesis-driven validators.
- Per-dataset defaults (SciFact vs FEVER).
- Add CLI options for dataset validator script to select dataset-specific schemas.
- ✅ Tinker backend functional via shim to
cns-support-models/scripts/train_claim_extractor.py - ✅ Citation validation integrated with configurable penalty weights
- ✅ Manifest generation (
runs/latest_tinker_adapter.json) - ✅ Provenance logging to
runs/train_*.json - ✅ Telemetry: loss, citation_invalid_rate, timestamps at each step
- Future enhancement (P2): Native
TinkerTrainingBackendusingtinker.ServiceClientdirectly
- ✅ Tinker sampling client integrated in
thinker/evaluation.py - ✅ Loads tokenizer via API, samples from adapter in manifest
- ✅ Logs job ID, sample prompts, completions, metrics
- ✅ Per-sample topology/chirality instrumentation
- ✅ Live progress logging (
sample N/50 | entailment | β₁ | chirality)
- ✅ 22 tests for Antagonist
- ✅ Citation validation: 29 tests
- ✅ Integration tested via real training runs
- ⏳ Future: Mock Tinker ServiceClient for unit tests (P2)
python -m thinker.cli data setup --dataset scifact --validation-mode embedding --similarity-threshold 0.7python -m thinker.cli validatepython -m thinker.cli train --backend hf_peftpython -m thinker.cli eval- Log metrics + configs in
cns-support-models/notes/claim_extractor.md
- Same flow but using FEVER config; evaluate impacts of larger dataset and NEI cases.
- Run identical config on HF and Tinker backends; compare metrics, runtime, resource usage.
- Document differences in notes + run metadata.
- Add plugin system for custom validators (e.g., relation semantics, critic-specific checks).
- Support Hypothesis property tests triggered via Thinker config.
- Integrate logic/grounding critics once they exist, ensuring Thinker enforces the same validation gate before critic training.
- Provide CLI flags to run critic training/evaluation.
- README – now references Thinker CLI; keep updated when new commands/configs land.
- docs/thinker/THINKER_SPEC.md – update whenever CLI/validation contracts change.
- docs/thinker/DATA_PIPELINE.md – the canonical guide to data setup (SciFact/FEVER, caching, troubleshooting).
- RUN METADATA – ensure each Thinker run writes JSON metadata (config paths, dataset hashes, metrics) in
runs/thinker/. - CONTINUATION_PROMPT.md – short task list for next engineer; update after each major milestone.
- ✅ DONE: Documentation updates - README.md, AGENTS.md, ROADMAP.md updated with latest status
- 🔬 IN PROGRESS: Training with weight=5.0 - Eliminate citation hallucinations
- Run:
python -m thinker.cli train --backend tinker - Expected duration: ~17 minutes (320 steps, 5 epochs)
- Success criteria: 2 HIGH severity flags → 0, mean entailment ≥0.50
- Run:
- ⏳ NEXT: Evaluate training results - Run full evaluation + antagonist analysis
python -m thinker.cli evalpython -m thinker.cli antagonist- Compare metrics to baseline and weight=2.0 iteration
- ⏳ DECISION POINT: Weight=5.0 outcome
- If SUCCESS: Document results, proceed to P1 priorities
- If FAILURE: Escalate to weight=10.0 or implement negative example training
- Antagonist enhancements:
- Embedding anti-neighbor retrieval for counter-evidence
- DeBERTa contradiction scoring for POLARITY_CONTRADICTION
- 200-pair synthetic contradiction test suite (precision/recall)
- Proposer semantic grounding:
- Contrastive loss integration (if weight=5.0 insufficient)
- Scale to 1000+ training examples
- Consider LoRA rank increase (16 → 32)
- FEVER dataset: Add fixtures, tests, pipeline config
- Tinker backend native implementation: Replace shim with
TinkerTrainingBackendusingServiceClientdirectly - Enhanced validation options: Regex checks, numeric bounds, per-dataset defaults
- Synthesizer prep (once Proposer unblocks): Critic interfaces, SNO manifest schema