Date: 2025-11-11 Component: CNS 3.0 Proposer Agent Evaluation Status: ✅ COMPLETE
Successfully implemented 4-stage semantic validation to replace exact-match evaluation for the CNS Proposer agent. The semantic validation aligns with AGENTS.md Section 1.0-1.1 and Section 4.1 specifications.
Key Finding: Semantic validation revealed that the model has deeper training issues (schema compliance 0%, citation accuracy 3.3%) that exact-match evaluation completely hid. The evaluation pipeline now provides actionable diagnostic insights.
- New instrumentation:
thinker/logic/betti.py(β₁ / cycle detection) andthinker/metrics/chirality.py(Fisher-Rao-based chirality) now run on every evaluation sample. Results are embedded inruns/thinker_eval/*.jsonl.thinker/evaluation.pystreams per-sample logs ([eval] sample N/50 …) so long jobs have visible progress and report entailment, β₁, and chirality for each SNO.- Dataset utilities now emit a cleaned SciFact JSONL (
scifact_claim_extractor_clean.jsonl) and training scripts default toCNS_CLAIM_EVIDENCE_WEIGHT=2.0to prioritize evidence copying before reasoning.
- Latest evaluation (SciFact dev, adapter
claim-extractor-scifact-20251118T173307, Tinker backend):- Schema 100%, citation 96%, mean entailment 0.448 (38% ≥0.75), mean similarity 0.25 (20% ≥0.70), overall semantic pass 38%.
- β₁ = 0 for all 50 samples, indicating reasoning graphs are acyclic before Antagonist processing.
- Mean chirality score 0.561 and mean Fisher-Rao distance 16.75 quantify thesis/antithesis tension for downstream critics.
- Full artifact:
runs/thinker_eval/scifact_dev_eval.jsonl.
File: thinker/evaluation.py
Status: Complete
Lines Changed: ~155 lines (complete rewrite of evaluation logic)
- Integrated SemanticValidator for 4-stage validation
- Updated metrics collection to track all AGENTS.md Section 1.1 metrics:
- Schema compliance rate
- Citation accuracy rate
- Mean entailment score
- Entailment pass rate
- Mean semantic similarity
- Semantic similarity rate
- Paraphrase acceptance rate
- Overall pass rate
- Retained legacy exact-match metrics (labeled
_LEGACY) for comparison - Updated console output to show detailed metrics breakdown
================================================================================
4-STAGE SEMANTIC VALIDATION METRICS (AGENTS.md Section 1.1)
================================================================================
Total examples: 30
Schema Compliance: 0.0% (target: ≥95%)
Citation Accuracy: 3.3% (hard gate)
Mean Entailment Score: 0.000 (threshold: ≥0.75)
Entailment Pass Rate: 0.0%
Mean Similarity Score: 0.000 (threshold: ≥0.70)
Similarity Pass Rate: 0.0% (target: ≥60%)
Paraphrase Accepted: 0.0%
🎯 OVERALL PASS RATE: 0.0%
--------------------------------------------------------------------------------
LEGACY EXACT-MATCH METRICS (for comparison only, DO NOT optimize):
C1 Exact Match: 0.0%
Evidence Exact Match: 0.0%
================================================================================
File: thinker/semantic_validation.py
Status: Complete
Lines: 368 lines (new file)
- SemanticValidator class with 4-stage validation pipeline
- ValidationResult dataclass capturing all validation scores
- Models used:
- DeBERTa-v3-large for NLI entailment (Stage 2)
- all-MiniLM-L6-v2 for semantic similarity (Stage 3)
Stage 1: Citation Accuracy (Hard Gate)
├─ Extract document IDs from generated output
├─ Validate IDs exist in evidence corpus
└─ Short-circuit if failed (citation_valid = False)
Stage 2: Entailment Score
├─ Gather evidence text from cited documents
├─ Compute P(evidence → claim) using DeBERTa-v3
└─ Threshold: ≥0.75 (per AGENTS.md Section 1.1)
Stage 3: Semantic Similarity
├─ Encode generated and gold claims
├─ Compute cosine similarity
└─ Threshold: ≥0.70 (target: ≥60% pass rate)
Stage 4: Paraphrase Tolerance
└─ Accept valid rephrasings if Stage 1-2 pass- Short-circuit logic (fail fast at each stage)
- Schema compliance checking (CLAIM[c*] format)
- Multiple citation pattern recognition
- Device auto-detection (CUDA/CPU)
File: generate_comparison_report.py
Status: Complete
Output: runs/comparison_report.txt
| Metric | Old (Exact-Match) | New (Semantic) | Delta |
|---|---|---|---|
| C1 Match | 0.0% | N/A | N/A |
| Schema | N/A | 0.0% | -95.0pp (target) |
| Citation | N/A | 3.3% | +3.3pp |
| Entailment | N/A | 0.000 | N/A |
| Similarity | N/A | 0.000 | N/A |
| Overall | 0.0% | 0.0% | 0.0pp |
The report shows 30 examples with:
- Claim ID
- Gold claim (truncated)
- Generated claim (truncated)
- Old exact-match result (PASS/FAIL)
- New semantic validation result (PASS/FAIL)
- Detailed scores (entailment, similarity, citation, schema)
No examples passed the new validation, revealing that the 0% exact-match score was hiding deeper issues:
- Model not learning CLAIM[c*] schema (0% compliance)
- Model not citing evidence properly (3.3% accuracy)
- Training prompts need fixing, not just evaluation metrics
Section Added: "Evaluation: 4-Stage Semantic Validation (2025-11-11 Update)" Content:
- Explanation of why semantic validation replaced exact-match
- Description of 4-stage validation pipeline
- New metrics list with targets
- Implementation files reference
- Current status and next steps
- Dependencies added
File: ISSUE_semantic_validation_emergency_fix.md
Content:
- Problem summary and root cause analysis
- Solution implementation details
- Comparison results
- Critical findings (schema and citation failures)
- Recommended next steps
- Files changed summary
- Verification instructions
- Alignment with AGENTS.md
- Lessons learned
File: requirements.txt
Change: Added transformers library
sentence-transformers
+ transformers
tinkerWhy: Required for DeBERTa-v3-large NLI model (Stage 2 entailment)
Installation:
source .venv/bin/activate
pip install transformersthinker/semantic_validation.py(368 lines)generate_comparison_report.py(264 lines)ISSUE_semantic_validation_emergency_fix.md(documentation)DELIVERABLES_semantic_validation.md(this file)
thinker/evaluation.py(~155 lines changed)requirements.txt(added 1 dependency)README.md(added 68-line section)
runs/comparison_report.txt(comparison analysis output)
cd /home/home/p/g/North-Shore-AI/tinkerer
source .venv/bin/activate
pip install transformers # If not already installed
python -m thinker.cli evalExpected: New metrics output showing 4-stage validation scores.
python generate_comparison_report.pyExpected: Creates runs/comparison_report.txt with old vs new metrics.
python -c "from thinker.semantic_validation import SemanticValidator; print('✅ Import successful')"Expected: No import errors, models load successfully.
Problem: Model outputs do NOT produce CLAIM[c*] format. Evidence: 0/30 examples had valid schema. Root Cause: Training prompts don't enforce CLAIM[c*] structure. Fix Required: Update training prompts with explicit CLAIM[c*] examples.
Problem: Model does NOT properly cite evidence documents. Evidence: Only 1/30 examples had valid citations. Root Cause: Training data doesn't teach evidence citation. Fix Required: Add explicit citation training examples.
Key Insight: Semantic validation provides actionable diagnostics that exact-match hid.
| Evaluation Type | Result | Diagnostic Value |
|---|---|---|
| Exact-match | 0% | "Model failed" (no insight) |
| Semantic validation | Schema 0%, Citation 3.3% | "Fix prompts for schema and citations" (actionable) |
- Fix training prompts to enforce CLAIM[c*] schema
- Add citation examples to training data
- Re-train model with updated prompts
- Re-evaluate with semantic validation
- Monitor semantic metrics trends
- Tune similarity threshold based on empirical results
- Add unit tests for semantic validation
- Document training prompt templates
- Integrate semantic validation into CI/CD
- Add semantic validation to other CNS agents (Antagonist, Synthesizer)
- Publish semantic validation methodology
Requirement: "Exact-match is incompatible with CNS goals"
Implementation: Exact-match retired to _LEGACY status
Requirement: Track schema compliance, citation accuracy, entailment, similarity Implementation: All metrics tracked and reported
Requirement: 4-stage validation pipeline Implementation: Implemented exactly as specified:
- Citation accuracy (hard gate)
- Entailment (DeBERTa-v3, threshold ≥0.75)
- Semantic similarity (sentence-transformers, threshold ≥0.70)
- Paraphrase tolerance
- DeBERTa-v3-large: ~1.5GB RAM, ~3s load time (CPU)
- all-MiniLM-L6-v2: ~80MB RAM, ~1s load time (CPU)
- Entailment (Stage 2): ~500ms per example
- Similarity (Stage 3): ~100ms per example
- Total: ~600ms per example (30 examples = ~18 seconds)
Both models support CUDA acceleration (10-20x speedup on GPU).
-
Match evaluation to model architecture
- LoRA pattern-learning requires semantic evaluation
- Exact-match is incompatible with low-rank adaptation
-
Semantic validation provides better diagnostics
- Identified specific failure modes (schema, citations)
- Enabled actionable next steps
-
Follow specifications early
- AGENTS.md Section 1.0 explicitly warned against exact-match
- Should have implemented semantic validation from the start
-
Test assumptions with diverse metrics
- Multiple validation stages reveal different failure modes
- Single metric (exact-match) hides important information
- AGENTS.md Section 1.0: Evaluation Philosophy (Exact-Match Exit)
- AGENTS.md Section 1.1: Proposer Health Metrics
- AGENTS.md Section 4.1: Semantic Grounding: Operational Definition
- Implementation:
thinker/semantic_validation.py - Evaluation:
thinker/evaluation.py - Comparison:
runs/comparison_report.txt - Issue Tracking:
ISSUE_semantic_validation_emergency_fix.md
Per the emergency redesign prompt, success criteria were:
- ✅ Implement semantic validation - Complete (
semantic_validation.py) - ✅ Update evaluation.py - Complete (4-stage validation integrated)
- ✅ Create comparison report - Complete (
generate_comparison_report.py,runs/comparison_report.txt) - ✅ Update documentation - Complete (README.md, issue doc)
- ✅ File issue - Complete (
ISSUE_semantic_validation_emergency_fix.md)
Additional Achievement:
- Identified root cause of 0% scores: schema and citation training issues
- Provided actionable next steps for fixing training prompts
For questions or issues with semantic validation:
- See
ISSUE_semantic_validation_emergency_fix.mdfor detailed technical analysis - Check
README.md"Evaluation: 4-Stage Semantic Validation" section - Review
thinker/semantic_validation.pydocstrings for API documentation
Delivery Date: 2025-11-11 Status: ✅ COMPLETE Next Action: Fix training prompts for schema compliance and citation accuracy