Skip to content

task5- Phase 4 grounding#10

Open
kushalviit wants to merge 4 commits into
mainfrom
Phase-4-grounding
Open

task5- Phase 4 grounding#10
kushalviit wants to merge 4 commits into
mainfrom
Phase-4-grounding

Conversation

@kushalviit
Copy link
Copy Markdown

@kushalviit kushalviit commented Apr 25, 2026

Summary

                                                                                                                                                                                                                                                                                          - Lands the `phase4_grounding/` package: sample → judge → aggregate pipeline with async OpenRouter client, claim parser, reporter, orchestrator, and 99 tests.                                                                                                                                                                                                                                                                                                                                                                                              - Runs the production audit on a 300-Q&A stratified sample of `dataset_gold.jsonl`. Headline: **55.20% UNSUPPORTED** in the keep-structural view (95% Wilson CI 53.4–57.0%, n=3,076 claims). Cross-check   with `gemini-2.5-pro` on 30 Q&As agrees within +3.7pp — headline is judge-robust.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      - Wires the measured numbers into `DATASHEET.md`, `RESPONSIBLE_AI.md`, `LIMITATIONS.md`, and `croissant.json` RAI fields. Phase 4 moves from "Round 2 deferred" to "Round 1 implemented." Full writeup at   `phase4_grounding/RESULTS.md`.
                                                                                                                                                                                                                                                                                                                                                                  ### Methodology notes worth flagging in review   
                                                                                                                                                                                                                                                                                                                                         - **Dual-use refusal protocol**: `claude-sonnet-4.6` (primary) refused on 8/300 Q&As (toxin engineering, controlled-substance analogs, pesticide modifications). All 8 recovered via `gemini-2.5-pro`      fallback (`scripts/rejudge_errors.py`). Single-judge audits would systematically miss these.                                                                                                                                                                                                                                                                                                                                                                                                                                                - **Where recall risk concentrates**: engineering 74.6%, ADME 69.5%, metabolism 68.8% UNSUPPORTED. Mechanism / therapeutic_use / toxicity all ≤41%.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   - **Counterintuitive finding**: Q&As *with* evidence attached have a *higher* UNSUPPORTED rate (62.9% vs 46.5%) — the model elaborates beyond what the evidence sentence states.             
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ### Costs            

Total real spend: $10.16 OpenRouter (vs. $100 budget cap). Reproducible from phase4_grounding/run_phase4_grounding.sh --n 300 --max-usd 100.
## Test plan - [ ] conda run -n chem2text-phase4 python -m pytest phase4_grounding/tests/ -q — expect 99 passes - [ ] python -c "import json; json.load(open('croissant.json'))" — confirms RAI block is valid JSON - [ ] Skim phase4_grounding/RESULTS.md for the headline / methodology / refusal-protocol narrative - [ ] Optional: re-run the full audit end-to-end with an OpenRouter key — orchestrator is resumable and budget-bounded

Lands the full phase4_grounding/ pipeline (sample → judge → aggregate)
with the async OpenRouter client, claim parser, reporter, tests, and
orchestrator script. Parser accepts null rationale (STRUCTURAL/UNSUPPORTED
claims) and strips markdown code fences so gemini-2.5-pro outputs parse
alongside sonnet-4.6's. 99 tests pass.
OpenRouter occasionally returns choices[0].message.content = null
(refusal, truncation, or empty tool-call response). The fence-stripper
assumed a string and crashed on .strip(), killing the whole asyncio.gather
mid-run. Parser now short-circuits non-string raw responses to a clean
ParseResult error so the judge's retry/error-file path takes over.
Adds phase4_grounding/RESULTS.md with the headline UNSUPPORTED rate
(55.20% keep-structural, 95% CI 53.4–57.0%, n=3076 claims / 300 Q&As)
and a methodology / refusal note on the dual-use-chemistry sonnet
refusals (8/300, all recovered via gemini fallback).

Updates:
- LIMITATIONS.md §3: replaces "scheduled for Round 2" with the measured
  rates and Wilson CIs.
- RESPONSIBLE_AI.md: moves the Phase 4 grounding check from Round 2 to
  Round 1, adds the dual-use-refusal protocol finding.
- DATASHEET.md: cites the headline rate in the "errors / noise" answer.

Adds two analysis scripts used to land the final dataset:
- rejudge_errors.py: gemini-2.5-pro fallback for sonnet-refused rows.
- analyze_cross_check_agreement.py: macro and per-row primary↔gemini
  agreement on the 30 dual-judged Q&As (macro Δ = +3.68pp keep view).
Adds the measured 55.20% UNSUPPORTED rate (95% CI 53.4–57.0%, n=3076
claims) and the by-topic breakdown to rai:dataBiases, and references
the audit in rai:dataSocialImpact. Both fields now point at
phase4_grounding/RESULTS.md for the full report.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the Phase 4 “grounding audit” subsystem (phase4_grounding/) to sample functional Q&As, judge claim grounding via OpenRouter asynchronously with retries/budgeting, and aggregate results into keep/drop-structural Markdown summaries; wires the measured headline metrics into dataset RAI/docs.

Changes:

  • Introduces the phase4_grounding library modules (sampling, evidence attach, prompt rendering, parsing, judging, aggregation, reporting) plus reproducible scripts/orchestrator.
  • Adds a comprehensive pytest suite with fixtures and integration tests (offline via fake OpenRouter client).
  • Updates repository documentation/RAI artifacts to include Phase 4 audit results and methodology.

Reviewed changes

Copilot reviewed 37 out of 42 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
phase4_grounding/tests/test_sampling.py Unit tests for stratified sampling + determinism and exhausted strata behavior
phase4_grounding/tests/test_reporter.py Reporter output/decision banner tests
phase4_grounding/tests/test_prompt.py Prompt rendering tests + snapshot checks
phase4_grounding/tests/test_parser.py Strict JSON/schema parser tests incl. markdown-fence stripping + None handling
phase4_grounding/tests/test_openrouter_client.py OpenRouterClient retry/backoff/budget/pricing unit tests
phase4_grounding/tests/test_models.py Dataclass smoke tests + tiny dataset shape checks
phase4_grounding/tests/test_judge.py End-to-end judge retry/error behavior with fake client
phase4_grounding/tests/test_integration.py Script-level integration coverage for sample→judge→aggregate
phase4_grounding/tests/test_evidence.py Evidence attachment + renumbering tests
phase4_grounding/tests/test_aggregator.py Aggregation correctness (keep/drop views, Wilson CI, breakdowns)
phase4_grounding/tests/data/tiny_dataset.jsonl Deterministic mini dataset fixture for tests
phase4_grounding/tests/conftest.py Shared fixtures incl. fake async OpenRouter client
phase4_grounding/tests/init.py Test package marker
phase4_grounding/scripts/sample_qa.py CLI: sample rows and attach evidence; write sample.jsonl
phase4_grounding/scripts/rejudge_errors.py One-off rejudge tool for errored rows (fallback model/max_tokens)
phase4_grounding/scripts/judge_claims.py CLI + core runner: judge rows (resumable), write success/error JSONL, cross-check pass
phase4_grounding/scripts/analyze_cross_check_agreement.py Computes primary vs cross-check agreement and writes Markdown report
phase4_grounding/scripts/aggregate.py CLI: load judged outputs and write dual summary markdown files
phase4_grounding/scripts/init.py Scripts package marker
phase4_grounding/run_phase4_grounding.sh Resumable 3-step orchestrator script
phase4_grounding/prompts/claim_decomp.txt Judge prompt template (STRICT JSON schema + label definitions)
phase4_grounding/grounding/topic_bucket.py Re-exports repo bucket_topic without requiring scripts/ as importable pkg
phase4_grounding/grounding/sampling.py Stratified functional QA sampler w/ topic weights + evidence-branch split
phase4_grounding/grounding/reporter.py Renders keep/drop-structural Markdown summaries + tables
phase4_grounding/grounding/prompt.py Template-based prompt builder with caching
phase4_grounding/grounding/parser.py Strict response parser + schema validation + fence stripping
phase4_grounding/grounding/openrouter_client.py Async OpenRouter client (retries/backoff/budget/cost tracking)
phase4_grounding/grounding/models.py Dataclass contracts for pipeline objects/metrics
phase4_grounding/grounding/judge.py Orchestrates prompt→chat→parse with single retry; emits JudgedQA or JudgeError
phase4_grounding/grounding/evidence.py Evidence selection + renumbering into [E#] display IDs
phase4_grounding/grounding/aggregator.py Computes keep/drop view metrics, breakdowns, Wilson CI
phase4_grounding/grounding/init.py Grounding package marker
phase4_grounding/environment.yml Conda environment definition for Phase 4 runs/tests
phase4_grounding/USAGE.md Reproduction and CLI usage guide
phase4_grounding/RESULTS.md Audit results writeup and methodology (headline metrics, refusals, cross-check)
phase4_grounding/PLAN.md Design/spec for Phase 4 implementation
croissant.json Adds measured grounding-audit findings to RAI fields
RESPONSIBLE_AI.md Documents Phase 4 grounding audit + refusal protocol implications
LIMITATIONS.md Updates “soft rule permits training recall” section with measured results
DATASHEET.md Adds Phase 4 grounding headline as measured noise/training-recall signal
.gitignore Ignores phase4_grounding/outputs/ as reproducible artifacts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


@dataclass(frozen=True)
class ChatResult:
text: str
Comment on lines +38 to +47
def _index_records(dataset_path: Path) -> dict[int, dict]:
idx: dict[int, dict] = {}
with dataset_path.open() as f:
for line in f:
line = line.strip()
if not line:
continue
rec = json.loads(line)
idx[int(rec["cid"])] = rec
return idx
Comment on lines +95 to +98
sum(r["p_unsupported"] * r["p_n"] for r in rows) / sum(r["p_n"] for r in rows)
)
g_macro = (
sum(r["g_unsupported"] * r["g_n"] for r in rows) / sum(r["g_n"] for r in rows)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants