vinash85 · kushalviit · Apr 25, 2026 · Apr 25, 2026 · Apr 25, 2026 · Apr 25, 2026
diff --git a/.gitignore b/.gitignore
@@ -77,6 +77,11 @@ finetune/eval/results_full/*/ft_chemqa.jsonl
 finetune/eval/results_full/*/*.log
 finetune/eval/results_full/*/*.jsonl
 
+# Phase 4 grounding-audit outputs (sample, judged claims, summaries,
+# logs). Reproducible from the orchestrator + a fresh OpenRouter run;
+# never pushed.
+phase4_grounding/outputs/
+
 # OS
 .DS_Store
 Thumbs.db
diff --git a/DATASHEET.md b/DATASHEET.md
@@ -73,7 +73,12 @@ No pre-defined split. Recommended:
 - Held-out human-annotated: scheduled for Round 2.
 
 **Are there any errors, sources of noise, redundancies?**
-Yes, many — see `LIMITATIONS.md` in full.
+Yes, many — see `LIMITATIONS.md` in full. The Phase 4 grounding audit
+(`phase4_grounding/RESULTS.md`) measures **55.20% of claims as
+UNSUPPORTED** by the cited evidence (95% CI 53.4–57.0%, n=3,076 claims
+across 300 Q&As, keep-structural view) — i.e. the gold subset contains
+substantial training-recall content. Engineering, ADME, and metabolism
+topics carry the worst grounding.
 
 **Does the dataset rely on external resources?**
 Pipeline inputs are PubMed baseline XML, PMC open-access full text, and

diff --git a/LIMITATIONS.md b/LIMITATIONS.md
@@ -35,15 +35,42 @@ including the three models in this pipeline. Consequences:
 See `CONTAMINATION.md` for the proposed canary-based validation
 methodology.
 
-## 3. The "soft rule" permits training-recall
+## 3. The "soft rule" permits training-recall — measured
 
 Phase 1 and Phase 2 system prompts explicitly allow functional claims
 "supported by the evidence ... used silently as background knowledge".
 This phrasing admits recall from pretraining. There is no mechanism at
 generation time to distinguish a claim supported by a specific evidence
-sentence from one the model would have made anyway. A Phase 4 grounding
-check (LLM-based claim-to-evidence alignment scoring) is scheduled for
-Round 2 but is not yet implemented.
+sentence from one the model would have made anyway.
+
+A Phase 4 grounding audit decomposes Phase-2 answers claim-by-claim and
+labels each claim as STATED, IMPLIED, STRUCTURAL (derivable from SMILES
+alone), or UNSUPPORTED (training-recall candidate). Headline result on
+a 300-Q&A stratified sample (3,076 claims):
+
+| View | UNSUPPORTED | 95% Wilson CI |
+|---|---|---|
+| keep-structural (clean training-recall proxy) | **55.20%** | 53.43–56.97% |
+| drop-structural (PLAN-spec view) | **44.70%** | 43.11–46.30% |
+
+Both views are well above the 20% threshold that would have permitted a
+narrow grounding claim. **The paper's grounding language is therefore
+narrowed, and training-recall risk is flagged in `RESPONSIBLE_AI.md`.**
+
+Engineering / design / metabolism topics carry the worst grounding (75 /
+67 / 69% UNSUPPORTED). Q&As *with* evidence attached are *more*
+UNSUPPORTED than those without, consistent with the model elaborating
+beyond what the evidence sentence states.
+
+Cross-check validation: an independent judge (`google/gemini-2.5-pro`)
+re-judged 30 of the 300 Q&As; macro UNSUPPORTED rates differ by only
++3.68pp from the primary judge, with 26/30 per-row rates agreeing to
+within 20pp. The headline is robust to judge choice.
+
+Full results, per-topic / per-split breakdowns, methodology, and a note
+on dual-use refusals during judging are in
+`phase4_grounding/RESULTS.md`. Code and orchestrator are in
+`phase4_grounding/`.
 
 ## 4. Compound coverage is biased toward well-studied drugs
 

diff --git a/RESPONSIBLE_AI.md b/RESPONSIBLE_AI.md
@@ -99,16 +99,29 @@ model as clinically validated.**
   (`scripts/audit_redaction.py`).
 - Coverage-analysis script quantifies therapeutic-area and
   molecular-property skew (`scripts/analyze_coverage.py`).
+- **Phase 4 grounding audit** measures the rate at which Phase-2
+  answers contain claims not traceable to the cited evidence
+  (`phase4_grounding/RESULTS.md`). Headline: **55.20% UNSUPPORTED** in
+  the keep-structural view (95% CI 53.4–57.0%) on a 300-Q&A sample;
+  cross-validated by an independent judge to within +3.7pp. This
+  empirically substantiates Misuse Risk #2 (fabricated mechanisms) and
+  is the basis for narrowing the paper's grounding claim. Engineering /
+  design / metabolism Q&As carry the highest training-recall risk.
+- **Dual-use refusal protocol** — the Phase 4 audit revealed that
+  `claude-sonnet-4.6` refuses to judge ~3% of dual-use chemistry Q&As
+  (toxin engineering, pesticide modifications, controlled-substance
+  analog reasoning). Falling back to `gemini-2.5-pro` recovers all of
+  them. Audits run on a single model will systematically miss this
+  topic; reproducers should use a heterogeneous-judge protocol.
 
 ### Deferred to Round 2
 
 - Human-evaluated accuracy on a safety-critical-claim sub-sample
   (scheduled).
-- Phase 4 grounding check that verifies each functional claim is
-  traceable to an evidence sentence (proposed; requires additional LLM
-  compute).
 - RAI review of the engineering-question category for synthesis-uplift
-  risk (proposed).
+  risk (proposed). The Phase 4 audit measured engineering Q&As at 74.6%
+  UNSUPPORTED, the worst of any topic — a strong prior for prioritizing
+  this review.
 - Dataset card fields per the Croissant RAI schema (skeleton provided in
   `croissant.json`; full population after full-run execution).
 

diff --git a/croissant.json b/croissant.json
@@ -51,13 +51,14 @@
   "rai:dataImputationProtocol": "Missing molecular_formula / molecular_weight from CID-Mass.gz are left null; compounds with zero matching evidence sentences after redaction are dropped (30.2% of premium-tier compounds).",
   "rai:dataPreprocessingProtocol": "Compound-name redaction to [COMPOUND] using longest-match-first synonym regex. Sentence-level dedup by redacted text. Random sampling (per-CID-seeded) to cap at 500 sentences per compound.",
   "rai:dataManipulationProtocol": "Four-phase LLM pipeline: Phase 1 generation, Phase 2 blind re-answer, Phase 3 agreement judge. All prompts versioned in the source tree at chem2textqa/qa_pipeline/phase_*/ .",
-  "rai:dataSocialImpact": "Drug-related training data with public-health implications. A model fine-tuned on this data could produce plausible-but-incorrect clinical claims. See RESPONSIBLE_AI.md for intended use, misuse risks, and mitigations.",
+  "rai:dataSocialImpact": "Drug-related training data with public-health implications. A Phase 4 grounding audit on a 300-Q&A stratified sample measured 55.20% of claims as UNSUPPORTED by the cited evidence (95% Wilson CI 53.4–57.0%, n=3076 claims; cross-validated by an independent judge to within +3.7pp). A model fine-tuned on this data could produce plausible-but-incorrect clinical claims, and the published gold subset contains substantial training-recall content. See RESPONSIBLE_AI.md and phase4_grounding/RESULTS.md for intended use, misuse risks, mitigations, and the full audit.",
   "rai:dataBiases": [
     "Therapeutic-area bias: oncology / CV / CNS over-represented",
     "Approval-status bias: FDA-approved drugs vs research chemicals",
     "Publication bias: English-language biomedical literature",
     "Model-consensus bias: 'gold' labels reflect what two LLMs agree on, not ground truth",
-    "Training-data overlap: evidence sentences are likely in LLM pretraining corpora"
+    "Training-data overlap: evidence sentences are likely in LLM pretraining corpora",
+    "Training-recall content (measured): 55.20% of Phase-2 answer claims are not traceable to the cited evidence (Phase 4 audit, n=3076 claims). Engineering / ADME / metabolism Q&As are worst-grounded (>67% UNSUPPORTED); mechanism / therapeutic-use / toxicity are better-grounded (<41%). See phase4_grounding/RESULTS.md."
   ],
   "rai:dataUseCases": [
     "Intended: instruction tuning for medicinal-chemistry LLM research.",