Skip to content

Latest commit

 

History

History
389 lines (293 loc) · 12.7 KB

File metadata and controls

389 lines (293 loc) · 12.7 KB

Emergency Experiment Redesign - Deliverables Summary

Date: 2025-11-11 Component: CNS 3.0 Proposer Agent Evaluation Status: ✅ COMPLETE


Executive Summary

Successfully implemented 4-stage semantic validation to replace exact-match evaluation for the CNS Proposer agent. The semantic validation aligns with AGENTS.md Section 1.0-1.1 and Section 4.1 specifications.

Key Finding: Semantic validation revealed that the model has deeper training issues (schema compliance 0%, citation accuracy 3.3%) that exact-match evaluation completely hid. The evaluation pipeline now provides actionable diagnostic insights.


2025-11-18 Update — Topology Instrumentation + Fresh Eval

  • New instrumentation:
    • thinker/logic/betti.py (β₁ / cycle detection) and thinker/metrics/chirality.py (Fisher-Rao-based chirality) now run on every evaluation sample. Results are embedded in runs/thinker_eval/*.jsonl.
    • thinker/evaluation.py streams per-sample logs ([eval] sample N/50 …) so long jobs have visible progress and report entailment, β₁, and chirality for each SNO.
    • Dataset utilities now emit a cleaned SciFact JSONL (scifact_claim_extractor_clean.jsonl) and training scripts default to CNS_CLAIM_EVIDENCE_WEIGHT=2.0 to prioritize evidence copying before reasoning.
  • Latest evaluation (SciFact dev, adapter claim-extractor-scifact-20251118T173307, Tinker backend):
    • Schema 100%, citation 96%, mean entailment 0.448 (38% ≥0.75), mean similarity 0.25 (20% ≥0.70), overall semantic pass 38%.
    • β₁ = 0 for all 50 samples, indicating reasoning graphs are acyclic before Antagonist processing.
    • Mean chirality score 0.561 and mean Fisher-Rao distance 16.75 quantify thesis/antithesis tension for downstream critics.
    • Full artifact: runs/thinker_eval/scifact_dev_eval.jsonl.

Deliverable 1: Updated evaluation.py ✅

File: thinker/evaluation.py Status: Complete Lines Changed: ~155 lines (complete rewrite of evaluation logic)

Changes:

  1. Integrated SemanticValidator for 4-stage validation
  2. Updated metrics collection to track all AGENTS.md Section 1.1 metrics:
    • Schema compliance rate
    • Citation accuracy rate
    • Mean entailment score
    • Entailment pass rate
    • Mean semantic similarity
    • Semantic similarity rate
    • Paraphrase acceptance rate
    • Overall pass rate
  3. Retained legacy exact-match metrics (labeled _LEGACY) for comparison
  4. Updated console output to show detailed metrics breakdown

Sample Output:

================================================================================
4-STAGE SEMANTIC VALIDATION METRICS (AGENTS.md Section 1.1)
================================================================================
Total examples: 30

Schema Compliance:     0.0% (target: ≥95%)
Citation Accuracy:     3.3% (hard gate)
Mean Entailment Score: 0.000 (threshold: ≥0.75)
Entailment Pass Rate:  0.0%
Mean Similarity Score: 0.000 (threshold: ≥0.70)
Similarity Pass Rate:  0.0% (target: ≥60%)
Paraphrase Accepted:   0.0%

🎯 OVERALL PASS RATE:   0.0%

--------------------------------------------------------------------------------
LEGACY EXACT-MATCH METRICS (for comparison only, DO NOT optimize):
C1 Exact Match:        0.0%
Evidence Exact Match:  0.0%
================================================================================

Deliverable 2: Semantic Validation Module ✅

File: thinker/semantic_validation.py Status: Complete Lines: 368 lines (new file)

Implementation:

  • SemanticValidator class with 4-stage validation pipeline
  • ValidationResult dataclass capturing all validation scores
  • Models used:
    • DeBERTa-v3-large for NLI entailment (Stage 2)
    • all-MiniLM-L6-v2 for semantic similarity (Stage 3)

Validation Stages:

Stage 1: Citation Accuracy (Hard Gate)
├─ Extract document IDs from generated output
├─ Validate IDs exist in evidence corpus
└─ Short-circuit if failed (citation_valid = False)

Stage 2: Entailment Score
├─ Gather evidence text from cited documents
├─ Compute P(evidenceclaim) using DeBERTa-v3
└─ Threshold: ≥0.75 (per AGENTS.md Section 1.1)

Stage 3: Semantic Similarity
├─ Encode generated and gold claims
├─ Compute cosine similarity
└─ Threshold: ≥0.70 (target: ≥60% pass rate)

Stage 4: Paraphrase Tolerance
└─ Accept valid rephrasings if Stage 1-2 pass

Key Features:

  • Short-circuit logic (fail fast at each stage)
  • Schema compliance checking (CLAIM[c*] format)
  • Multiple citation pattern recognition
  • Device auto-detection (CUDA/CPU)

Deliverable 3: Comparison Report ✅

File: generate_comparison_report.py Status: Complete Output: runs/comparison_report.txt

Report Contents:

Metrics Comparison

Metric Old (Exact-Match) New (Semantic) Delta
C1 Match 0.0% N/A N/A
Schema N/A 0.0% -95.0pp (target)
Citation N/A 3.3% +3.3pp
Entailment N/A 0.000 N/A
Similarity N/A 0.000 N/A
Overall 0.0% 0.0% 0.0pp

Sample Detailed Examples

The report shows 30 examples with:

  • Claim ID
  • Gold claim (truncated)
  • Generated claim (truncated)
  • Old exact-match result (PASS/FAIL)
  • New semantic validation result (PASS/FAIL)
  • Detailed scores (entailment, similarity, citation, schema)

Key Insight

No examples passed the new validation, revealing that the 0% exact-match score was hiding deeper issues:

  • Model not learning CLAIM[c*] schema (0% compliance)
  • Model not citing evidence properly (3.3% accuracy)
  • Training prompts need fixing, not just evaluation metrics

Deliverable 4: Updated Documentation ✅

README.md Updates

Section Added: "Evaluation: 4-Stage Semantic Validation (2025-11-11 Update)" Content:

  • Explanation of why semantic validation replaced exact-match
  • Description of 4-stage validation pipeline
  • New metrics list with targets
  • Implementation files reference
  • Current status and next steps
  • Dependencies added

Issue Documentation

File: ISSUE_semantic_validation_emergency_fix.md Content:

  • Problem summary and root cause analysis
  • Solution implementation details
  • Comparison results
  • Critical findings (schema and citation failures)
  • Recommended next steps
  • Files changed summary
  • Verification instructions
  • Alignment with AGENTS.md
  • Lessons learned

Deliverable 5: Dependencies Updated ✅

File: requirements.txt Change: Added transformers library

  sentence-transformers
+ transformers
  tinker

Why: Required for DeBERTa-v3-large NLI model (Stage 2 entailment)

Installation:

source .venv/bin/activate
pip install transformers

File Manifest

Created Files

  1. thinker/semantic_validation.py (368 lines)
  2. generate_comparison_report.py (264 lines)
  3. ISSUE_semantic_validation_emergency_fix.md (documentation)
  4. DELIVERABLES_semantic_validation.md (this file)

Modified Files

  1. thinker/evaluation.py (~155 lines changed)
  2. requirements.txt (added 1 dependency)
  3. README.md (added 68-line section)

Generated Files

  1. runs/comparison_report.txt (comparison analysis output)

Verification Steps

1. Run Updated Evaluation

cd /home/home/p/g/North-Shore-AI/tinkerer
source .venv/bin/activate
pip install transformers  # If not already installed
python -m thinker.cli eval

Expected: New metrics output showing 4-stage validation scores.

2. Generate Comparison Report

python generate_comparison_report.py

Expected: Creates runs/comparison_report.txt with old vs new metrics.

3. Verify Dependencies

python -c "from thinker.semantic_validation import SemanticValidator; print('✅ Import successful')"

Expected: No import errors, models load successfully.


Critical Findings

Finding 1: Schema Compliance Failure (0%)

Problem: Model outputs do NOT produce CLAIM[c*] format. Evidence: 0/30 examples had valid schema. Root Cause: Training prompts don't enforce CLAIM[c*] structure. Fix Required: Update training prompts with explicit CLAIM[c*] examples.

Finding 2: Citation Accuracy Failure (3.3%)

Problem: Model does NOT properly cite evidence documents. Evidence: Only 1/30 examples had valid citations. Root Cause: Training data doesn't teach evidence citation. Fix Required: Add explicit citation training examples.

Finding 3: Diagnostic Value

Key Insight: Semantic validation provides actionable diagnostics that exact-match hid.

Evaluation Type Result Diagnostic Value
Exact-match 0% "Model failed" (no insight)
Semantic validation Schema 0%, Citation 3.3% "Fix prompts for schema and citations" (actionable)

Recommended Next Steps

Immediate (URGENT)

  1. Fix training prompts to enforce CLAIM[c*] schema
  2. Add citation examples to training data
  3. Re-train model with updated prompts
  4. Re-evaluate with semantic validation

Short-term

  1. Monitor semantic metrics trends
  2. Tune similarity threshold based on empirical results
  3. Add unit tests for semantic validation
  4. Document training prompt templates

Long-term

  1. Integrate semantic validation into CI/CD
  2. Add semantic validation to other CNS agents (Antagonist, Synthesizer)
  3. Publish semantic validation methodology

Alignment with AGENTS.md

✅ Section 1.0: Evaluation Philosophy (Exact-Match Exit)

Requirement: "Exact-match is incompatible with CNS goals" Implementation: Exact-match retired to _LEGACY status

✅ Section 1.1: Proposer Health Metrics

Requirement: Track schema compliance, citation accuracy, entailment, similarity Implementation: All metrics tracked and reported

✅ Section 4.1: Semantic Grounding: Operational Definition

Requirement: 4-stage validation pipeline Implementation: Implemented exactly as specified:

  1. Citation accuracy (hard gate)
  2. Entailment (DeBERTa-v3, threshold ≥0.75)
  3. Semantic similarity (sentence-transformers, threshold ≥0.70)
  4. Paraphrase tolerance

Performance Characteristics

Model Loading

  • DeBERTa-v3-large: ~1.5GB RAM, ~3s load time (CPU)
  • all-MiniLM-L6-v2: ~80MB RAM, ~1s load time (CPU)

Inference Speed (CPU)

  • Entailment (Stage 2): ~500ms per example
  • Similarity (Stage 3): ~100ms per example
  • Total: ~600ms per example (30 examples = ~18 seconds)

GPU Acceleration

Both models support CUDA acceleration (10-20x speedup on GPU).


Lessons Learned

  1. Match evaluation to model architecture

    • LoRA pattern-learning requires semantic evaluation
    • Exact-match is incompatible with low-rank adaptation
  2. Semantic validation provides better diagnostics

    • Identified specific failure modes (schema, citations)
    • Enabled actionable next steps
  3. Follow specifications early

    • AGENTS.md Section 1.0 explicitly warned against exact-match
    • Should have implemented semantic validation from the start
  4. Test assumptions with diverse metrics

    • Multiple validation stages reveal different failure modes
    • Single metric (exact-match) hides important information

References

  • AGENTS.md Section 1.0: Evaluation Philosophy (Exact-Match Exit)
  • AGENTS.md Section 1.1: Proposer Health Metrics
  • AGENTS.md Section 4.1: Semantic Grounding: Operational Definition
  • Implementation: thinker/semantic_validation.py
  • Evaluation: thinker/evaluation.py
  • Comparison: runs/comparison_report.txt
  • Issue Tracking: ISSUE_semantic_validation_emergency_fix.md

Success Criteria: ✅ ACHIEVED

Per the emergency redesign prompt, success criteria were:

  1. Implement semantic validation - Complete (semantic_validation.py)
  2. Update evaluation.py - Complete (4-stage validation integrated)
  3. Create comparison report - Complete (generate_comparison_report.py, runs/comparison_report.txt)
  4. Update documentation - Complete (README.md, issue doc)
  5. File issue - Complete (ISSUE_semantic_validation_emergency_fix.md)

Additional Achievement:

  • Identified root cause of 0% scores: schema and citation training issues
  • Provided actionable next steps for fixing training prompts

Contact & Support

For questions or issues with semantic validation:

  1. See ISSUE_semantic_validation_emergency_fix.md for detailed technical analysis
  2. Check README.md "Evaluation: 4-Stage Semantic Validation" section
  3. Review thinker/semantic_validation.py docstrings for API documentation

Delivery Date: 2025-11-11 Status: ✅ COMPLETE Next Action: Fix training prompts for schema compliance and citation accuracy