task5- Phase 4 grounding by kushalviit · Pull Request #10 · vinash85/Chem2TextQA

kushalviit · 2026-04-25T05:11:41Z

Summary

                                                                                                                                                                                                                                                                                          - Lands the `phase4_grounding/` package: sample → judge → aggregate pipeline with async OpenRouter client, claim parser, reporter, orchestrator, and 99 tests.                                                                                                                                                                                                                                                                                                                                                                                              - Runs the production audit on a 300-Q&A stratified sample of `dataset_gold.jsonl`. Headline: **55.20% UNSUPPORTED** in the keep-structural view (95% Wilson CI 53.4–57.0%, n=3,076 claims). Cross-check   with `gemini-2.5-pro` on 30 Q&As agrees within +3.7pp — headline is judge-robust.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      - Wires the measured numbers into `DATASHEET.md`, `RESPONSIBLE_AI.md`, `LIMITATIONS.md`, and `croissant.json` RAI fields. Phase 4 moves from "Round 2 deferred" to "Round 1 implemented." Full writeup at   `phase4_grounding/RESULTS.md`.
                                                                                                                                                                                                                                                                                                                                                                  ### Methodology notes worth flagging in review   
                                                                                                                                                                                                                                                                                                                                         - **Dual-use refusal protocol**: `claude-sonnet-4.6` (primary) refused on 8/300 Q&As (toxin engineering, controlled-substance analogs, pesticide modifications). All 8 recovered via `gemini-2.5-pro`      fallback (`scripts/rejudge_errors.py`). Single-judge audits would systematically miss these.                                                                                                                                                                                                                                                                                                                                                                                                                                                - **Where recall risk concentrates**: engineering 74.6%, ADME 69.5%, metabolism 68.8% UNSUPPORTED. Mechanism / therapeutic_use / toxicity all ≤41%.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   - **Counterintuitive finding**: Q&As *with* evidence attached have a *higher* UNSUPPORTED rate (62.9% vs 46.5%) — the model elaborates beyond what the evidence sentence states.             
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ### Costs

Total real spend: $10.16 OpenRouter (vs. $100 budget cap). Reproducible from phase4_grounding/run_phase4_grounding.sh --n 300 --max-usd 100.
## Test plan - [ ] conda run -n chem2text-phase4 python -m pytest phase4_grounding/tests/ -q — expect 99 passes - [ ] python -c "import json; json.load(open('croissant.json'))" — confirms RAI block is valid JSON - [ ] Skim phase4_grounding/RESULTS.md for the headline / methodology / refusal-protocol narrative - [ ] Optional: re-run the full audit end-to-end with an OpenRouter key — orchestrator is resumable and budget-bounded

Lands the full phase4_grounding/ pipeline (sample → judge → aggregate) with the async OpenRouter client, claim parser, reporter, tests, and orchestrator script. Parser accepts null rationale (STRUCTURAL/UNSUPPORTED claims) and strips markdown code fences so gemini-2.5-pro outputs parse alongside sonnet-4.6's. 99 tests pass.

OpenRouter occasionally returns choices[0].message.content = null (refusal, truncation, or empty tool-call response). The fence-stripper assumed a string and crashed on .strip(), killing the whole asyncio.gather mid-run. Parser now short-circuits non-string raw responses to a clean ParseResult error so the judge's retry/error-file path takes over.

Adds phase4_grounding/RESULTS.md with the headline UNSUPPORTED rate (55.20% keep-structural, 95% CI 53.4–57.0%, n=3076 claims / 300 Q&As) and a methodology / refusal note on the dual-use-chemistry sonnet refusals (8/300, all recovered via gemini fallback). Updates: - LIMITATIONS.md §3: replaces "scheduled for Round 2" with the measured rates and Wilson CIs. - RESPONSIBLE_AI.md: moves the Phase 4 grounding check from Round 2 to Round 1, adds the dual-use-refusal protocol finding. - DATASHEET.md: cites the headline rate in the "errors / noise" answer. Adds two analysis scripts used to land the final dataset: - rejudge_errors.py: gemini-2.5-pro fallback for sonnet-refused rows. - analyze_cross_check_agreement.py: macro and per-row primary↔gemini agreement on the 30 dual-judged Q&As (macro Δ = +3.68pp keep view).

Adds the measured 55.20% UNSUPPORTED rate (95% CI 53.4–57.0%, n=3076 claims) and the by-topic breakdown to rai:dataBiases, and references the audit in rai:dataSocialImpact. Both fields now point at phase4_grounding/RESULTS.md for the full report.

Copilot

Pull request overview

Adds the Phase 4 “grounding audit” subsystem (phase4_grounding/) to sample functional Q&As, judge claim grounding via OpenRouter asynchronously with retries/budgeting, and aggregate results into keep/drop-structural Markdown summaries; wires the measured headline metrics into dataset RAI/docs.

Changes:

Introduces the phase4_grounding library modules (sampling, evidence attach, prompt rendering, parsing, judging, aggregation, reporting) plus reproducible scripts/orchestrator.
Adds a comprehensive pytest suite with fixtures and integration tests (offline via fake OpenRouter client).
Updates repository documentation/RAI artifacts to include Phase 4 audit results and methodology.

Reviewed changes

Copilot reviewed 37 out of 42 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
phase4_grounding/tests/test_sampling.py	Unit tests for stratified sampling + determinism and exhausted strata behavior
phase4_grounding/tests/test_reporter.py	Reporter output/decision banner tests
phase4_grounding/tests/test_prompt.py	Prompt rendering tests + snapshot checks
phase4_grounding/tests/test_parser.py	Strict JSON/schema parser tests incl. markdown-fence stripping + None handling
phase4_grounding/tests/test_openrouter_client.py	OpenRouterClient retry/backoff/budget/pricing unit tests
phase4_grounding/tests/test_models.py	Dataclass smoke tests + tiny dataset shape checks
phase4_grounding/tests/test_judge.py	End-to-end judge retry/error behavior with fake client
phase4_grounding/tests/test_integration.py	Script-level integration coverage for sample→judge→aggregate
phase4_grounding/tests/test_evidence.py	Evidence attachment + renumbering tests
phase4_grounding/tests/test_aggregator.py	Aggregation correctness (keep/drop views, Wilson CI, breakdowns)
phase4_grounding/tests/data/tiny_dataset.jsonl	Deterministic mini dataset fixture for tests
phase4_grounding/tests/conftest.py	Shared fixtures incl. fake async OpenRouter client
phase4_grounding/tests/init.py	Test package marker
phase4_grounding/scripts/sample_qa.py	CLI: sample rows and attach evidence; write `sample.jsonl`
phase4_grounding/scripts/rejudge_errors.py	One-off rejudge tool for errored rows (fallback model/max_tokens)
phase4_grounding/scripts/judge_claims.py	CLI + core runner: judge rows (resumable), write success/error JSONL, cross-check pass
phase4_grounding/scripts/analyze_cross_check_agreement.py	Computes primary vs cross-check agreement and writes Markdown report
phase4_grounding/scripts/aggregate.py	CLI: load judged outputs and write dual summary markdown files
phase4_grounding/scripts/init.py	Scripts package marker
phase4_grounding/run_phase4_grounding.sh	Resumable 3-step orchestrator script
phase4_grounding/prompts/claim_decomp.txt	Judge prompt template (STRICT JSON schema + label definitions)
phase4_grounding/grounding/topic_bucket.py	Re-exports repo `bucket_topic` without requiring `scripts/` as importable pkg
phase4_grounding/grounding/sampling.py	Stratified functional QA sampler w/ topic weights + evidence-branch split
phase4_grounding/grounding/reporter.py	Renders keep/drop-structural Markdown summaries + tables
phase4_grounding/grounding/prompt.py	Template-based prompt builder with caching
phase4_grounding/grounding/parser.py	Strict response parser + schema validation + fence stripping
phase4_grounding/grounding/openrouter_client.py	Async OpenRouter client (retries/backoff/budget/cost tracking)
phase4_grounding/grounding/models.py	Dataclass contracts for pipeline objects/metrics
phase4_grounding/grounding/judge.py	Orchestrates prompt→chat→parse with single retry; emits JudgedQA or JudgeError
phase4_grounding/grounding/evidence.py	Evidence selection + renumbering into `[E#]` display IDs
phase4_grounding/grounding/aggregator.py	Computes keep/drop view metrics, breakdowns, Wilson CI
phase4_grounding/grounding/init.py	Grounding package marker
phase4_grounding/environment.yml	Conda environment definition for Phase 4 runs/tests
phase4_grounding/USAGE.md	Reproduction and CLI usage guide
phase4_grounding/RESULTS.md	Audit results writeup and methodology (headline metrics, refusals, cross-check)
phase4_grounding/PLAN.md	Design/spec for Phase 4 implementation
croissant.json	Adds measured grounding-audit findings to RAI fields
RESPONSIBLE_AI.md	Documents Phase 4 grounding audit + refusal protocol implications
LIMITATIONS.md	Updates “soft rule permits training recall” section with measured results
DATASHEET.md	Adds Phase 4 grounding headline as measured noise/training-recall signal
.gitignore	Ignores `phase4_grounding/outputs/` as reproducible artifacts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+
+@dataclass(frozen=True)
+class ChatResult:
+    text: str


+def _index_records(dataset_path: Path) -> dict[int, dict]:
+    idx: dict[int, dict] = {}
+    with dataset_path.open() as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            rec = json.loads(line)
+            idx[int(rec["cid"])] = rec
+    return idx


+            sum(r["p_unsupported"] * r["p_n"] for r in rows) / sum(r["p_n"] for r in rows)
+        )
+        g_macro = (
+            sum(r["g_unsupported"] * r["g_n"] for r in rows) / sum(r["g_n"] for r in rows)


kushalviit added 4 commits April 24, 2026 22:28

kushalviit requested review from avi-lab and luistafoi April 25, 2026 05:13

Macaulay001 requested a review from Copilot April 30, 2026 15:40

Copilot started reviewing on behalf of Macaulay001 April 30, 2026 15:41 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task5- Phase 4 grounding#10

task5- Phase 4 grounding#10
kushalviit wants to merge 4 commits into
mainfrom
Phase-4-grounding

kushalviit commented Apr 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kushalviit commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kushalviit commented Apr 25, 2026 •

edited

Loading