Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,11 @@ finetune/eval/results_full/*/ft_chemqa.jsonl
finetune/eval/results_full/*/*.log
finetune/eval/results_full/*/*.jsonl

# Phase 4 grounding-audit outputs (sample, judged claims, summaries,
# logs). Reproducible from the orchestrator + a fresh OpenRouter run;
# never pushed.
phase4_grounding/outputs/

# OS
.DS_Store
Thumbs.db
7 changes: 6 additions & 1 deletion DATASHEET.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,12 @@ No pre-defined split. Recommended:
- Held-out human-annotated: scheduled for Round 2.

**Are there any errors, sources of noise, redundancies?**
Yes, many — see `LIMITATIONS.md` in full.
Yes, many — see `LIMITATIONS.md` in full. The Phase 4 grounding audit
(`phase4_grounding/RESULTS.md`) measures **55.20% of claims as
UNSUPPORTED** by the cited evidence (95% CI 53.4–57.0%, n=3,076 claims
across 300 Q&As, keep-structural view) — i.e. the gold subset contains
substantial training-recall content. Engineering, ADME, and metabolism
topics carry the worst grounding.

**Does the dataset rely on external resources?**
Pipeline inputs are PubMed baseline XML, PMC open-access full text, and
Expand Down
35 changes: 31 additions & 4 deletions LIMITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,42 @@ including the three models in this pipeline. Consequences:
See `CONTAMINATION.md` for the proposed canary-based validation
methodology.

## 3. The "soft rule" permits training-recall
## 3. The "soft rule" permits training-recall — measured

Phase 1 and Phase 2 system prompts explicitly allow functional claims
"supported by the evidence ... used silently as background knowledge".
This phrasing admits recall from pretraining. There is no mechanism at
generation time to distinguish a claim supported by a specific evidence
sentence from one the model would have made anyway. A Phase 4 grounding
check (LLM-based claim-to-evidence alignment scoring) is scheduled for
Round 2 but is not yet implemented.
sentence from one the model would have made anyway.

A Phase 4 grounding audit decomposes Phase-2 answers claim-by-claim and
labels each claim as STATED, IMPLIED, STRUCTURAL (derivable from SMILES
alone), or UNSUPPORTED (training-recall candidate). Headline result on
a 300-Q&A stratified sample (3,076 claims):

| View | UNSUPPORTED | 95% Wilson CI |
|---|---|---|
| keep-structural (clean training-recall proxy) | **55.20%** | 53.43–56.97% |
| drop-structural (PLAN-spec view) | **44.70%** | 43.11–46.30% |

Both views are well above the 20% threshold that would have permitted a
narrow grounding claim. **The paper's grounding language is therefore
narrowed, and training-recall risk is flagged in `RESPONSIBLE_AI.md`.**

Engineering / design / metabolism topics carry the worst grounding (75 /
67 / 69% UNSUPPORTED). Q&As *with* evidence attached are *more*
UNSUPPORTED than those without, consistent with the model elaborating
beyond what the evidence sentence states.

Cross-check validation: an independent judge (`google/gemini-2.5-pro`)
re-judged 30 of the 300 Q&As; macro UNSUPPORTED rates differ by only
+3.68pp from the primary judge, with 26/30 per-row rates agreeing to
within 20pp. The headline is robust to judge choice.

Full results, per-topic / per-split breakdowns, methodology, and a note
on dual-use refusals during judging are in
`phase4_grounding/RESULTS.md`. Code and orchestrator are in
`phase4_grounding/`.

## 4. Compound coverage is biased toward well-studied drugs

Expand Down
21 changes: 17 additions & 4 deletions RESPONSIBLE_AI.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,16 +99,29 @@ model as clinically validated.**
(`scripts/audit_redaction.py`).
- Coverage-analysis script quantifies therapeutic-area and
molecular-property skew (`scripts/analyze_coverage.py`).
- **Phase 4 grounding audit** measures the rate at which Phase-2
answers contain claims not traceable to the cited evidence
(`phase4_grounding/RESULTS.md`). Headline: **55.20% UNSUPPORTED** in
the keep-structural view (95% CI 53.4–57.0%) on a 300-Q&A sample;
cross-validated by an independent judge to within +3.7pp. This
empirically substantiates Misuse Risk #2 (fabricated mechanisms) and
is the basis for narrowing the paper's grounding claim. Engineering /
design / metabolism Q&As carry the highest training-recall risk.
- **Dual-use refusal protocol** — the Phase 4 audit revealed that
`claude-sonnet-4.6` refuses to judge ~3% of dual-use chemistry Q&As
(toxin engineering, pesticide modifications, controlled-substance
analog reasoning). Falling back to `gemini-2.5-pro` recovers all of
them. Audits run on a single model will systematically miss this
topic; reproducers should use a heterogeneous-judge protocol.

### Deferred to Round 2

- Human-evaluated accuracy on a safety-critical-claim sub-sample
(scheduled).
- Phase 4 grounding check that verifies each functional claim is
traceable to an evidence sentence (proposed; requires additional LLM
compute).
- RAI review of the engineering-question category for synthesis-uplift
risk (proposed).
risk (proposed). The Phase 4 audit measured engineering Q&As at 74.6%
UNSUPPORTED, the worst of any topic — a strong prior for prioritizing
this review.
- Dataset card fields per the Croissant RAI schema (skeleton provided in
`croissant.json`; full population after full-run execution).

Expand Down
5 changes: 3 additions & 2 deletions croissant.json
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,14 @@
"rai:dataImputationProtocol": "Missing molecular_formula / molecular_weight from CID-Mass.gz are left null; compounds with zero matching evidence sentences after redaction are dropped (30.2% of premium-tier compounds).",
"rai:dataPreprocessingProtocol": "Compound-name redaction to [COMPOUND] using longest-match-first synonym regex. Sentence-level dedup by redacted text. Random sampling (per-CID-seeded) to cap at 500 sentences per compound.",
"rai:dataManipulationProtocol": "Four-phase LLM pipeline: Phase 1 generation, Phase 2 blind re-answer, Phase 3 agreement judge. All prompts versioned in the source tree at chem2textqa/qa_pipeline/phase_*/ .",
"rai:dataSocialImpact": "Drug-related training data with public-health implications. A model fine-tuned on this data could produce plausible-but-incorrect clinical claims. See RESPONSIBLE_AI.md for intended use, misuse risks, and mitigations.",
"rai:dataSocialImpact": "Drug-related training data with public-health implications. A Phase 4 grounding audit on a 300-Q&A stratified sample measured 55.20% of claims as UNSUPPORTED by the cited evidence (95% Wilson CI 53.4–57.0%, n=3076 claims; cross-validated by an independent judge to within +3.7pp). A model fine-tuned on this data could produce plausible-but-incorrect clinical claims, and the published gold subset contains substantial training-recall content. See RESPONSIBLE_AI.md and phase4_grounding/RESULTS.md for intended use, misuse risks, mitigations, and the full audit.",
"rai:dataBiases": [
"Therapeutic-area bias: oncology / CV / CNS over-represented",
"Approval-status bias: FDA-approved drugs vs research chemicals",
"Publication bias: English-language biomedical literature",
"Model-consensus bias: 'gold' labels reflect what two LLMs agree on, not ground truth",
"Training-data overlap: evidence sentences are likely in LLM pretraining corpora"
"Training-data overlap: evidence sentences are likely in LLM pretraining corpora",
"Training-recall content (measured): 55.20% of Phase-2 answer claims are not traceable to the cited evidence (Phase 4 audit, n=3076 claims). Engineering / ADME / metabolism Q&As are worst-grounded (>67% UNSUPPORTED); mechanism / therapeutic-use / toxicity are better-grounded (<41%). See phase4_grounding/RESULTS.md."
],
"rai:dataUseCases": [
"Intended: instruction tuning for medicinal-chemistry LLM research.",
Expand Down
Loading