task5- Phase 4 grounding#10
Open
kushalviit wants to merge 4 commits into
Open
Conversation
Lands the full phase4_grounding/ pipeline (sample → judge → aggregate) with the async OpenRouter client, claim parser, reporter, tests, and orchestrator script. Parser accepts null rationale (STRUCTURAL/UNSUPPORTED claims) and strips markdown code fences so gemini-2.5-pro outputs parse alongside sonnet-4.6's. 99 tests pass.
OpenRouter occasionally returns choices[0].message.content = null (refusal, truncation, or empty tool-call response). The fence-stripper assumed a string and crashed on .strip(), killing the whole asyncio.gather mid-run. Parser now short-circuits non-string raw responses to a clean ParseResult error so the judge's retry/error-file path takes over.
Adds phase4_grounding/RESULTS.md with the headline UNSUPPORTED rate (55.20% keep-structural, 95% CI 53.4–57.0%, n=3076 claims / 300 Q&As) and a methodology / refusal note on the dual-use-chemistry sonnet refusals (8/300, all recovered via gemini fallback). Updates: - LIMITATIONS.md §3: replaces "scheduled for Round 2" with the measured rates and Wilson CIs. - RESPONSIBLE_AI.md: moves the Phase 4 grounding check from Round 2 to Round 1, adds the dual-use-refusal protocol finding. - DATASHEET.md: cites the headline rate in the "errors / noise" answer. Adds two analysis scripts used to land the final dataset: - rejudge_errors.py: gemini-2.5-pro fallback for sonnet-refused rows. - analyze_cross_check_agreement.py: macro and per-row primary↔gemini agreement on the 30 dual-judged Q&As (macro Δ = +3.68pp keep view).
Adds the measured 55.20% UNSUPPORTED rate (95% CI 53.4–57.0%, n=3076 claims) and the by-topic breakdown to rai:dataBiases, and references the audit in rai:dataSocialImpact. Both fields now point at phase4_grounding/RESULTS.md for the full report.
There was a problem hiding this comment.
Pull request overview
Adds the Phase 4 “grounding audit” subsystem (phase4_grounding/) to sample functional Q&As, judge claim grounding via OpenRouter asynchronously with retries/budgeting, and aggregate results into keep/drop-structural Markdown summaries; wires the measured headline metrics into dataset RAI/docs.
Changes:
- Introduces the
phase4_groundinglibrary modules (sampling, evidence attach, prompt rendering, parsing, judging, aggregation, reporting) plus reproducible scripts/orchestrator. - Adds a comprehensive pytest suite with fixtures and integration tests (offline via fake OpenRouter client).
- Updates repository documentation/RAI artifacts to include Phase 4 audit results and methodology.
Reviewed changes
Copilot reviewed 37 out of 42 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| phase4_grounding/tests/test_sampling.py | Unit tests for stratified sampling + determinism and exhausted strata behavior |
| phase4_grounding/tests/test_reporter.py | Reporter output/decision banner tests |
| phase4_grounding/tests/test_prompt.py | Prompt rendering tests + snapshot checks |
| phase4_grounding/tests/test_parser.py | Strict JSON/schema parser tests incl. markdown-fence stripping + None handling |
| phase4_grounding/tests/test_openrouter_client.py | OpenRouterClient retry/backoff/budget/pricing unit tests |
| phase4_grounding/tests/test_models.py | Dataclass smoke tests + tiny dataset shape checks |
| phase4_grounding/tests/test_judge.py | End-to-end judge retry/error behavior with fake client |
| phase4_grounding/tests/test_integration.py | Script-level integration coverage for sample→judge→aggregate |
| phase4_grounding/tests/test_evidence.py | Evidence attachment + renumbering tests |
| phase4_grounding/tests/test_aggregator.py | Aggregation correctness (keep/drop views, Wilson CI, breakdowns) |
| phase4_grounding/tests/data/tiny_dataset.jsonl | Deterministic mini dataset fixture for tests |
| phase4_grounding/tests/conftest.py | Shared fixtures incl. fake async OpenRouter client |
| phase4_grounding/tests/init.py | Test package marker |
| phase4_grounding/scripts/sample_qa.py | CLI: sample rows and attach evidence; write sample.jsonl |
| phase4_grounding/scripts/rejudge_errors.py | One-off rejudge tool for errored rows (fallback model/max_tokens) |
| phase4_grounding/scripts/judge_claims.py | CLI + core runner: judge rows (resumable), write success/error JSONL, cross-check pass |
| phase4_grounding/scripts/analyze_cross_check_agreement.py | Computes primary vs cross-check agreement and writes Markdown report |
| phase4_grounding/scripts/aggregate.py | CLI: load judged outputs and write dual summary markdown files |
| phase4_grounding/scripts/init.py | Scripts package marker |
| phase4_grounding/run_phase4_grounding.sh | Resumable 3-step orchestrator script |
| phase4_grounding/prompts/claim_decomp.txt | Judge prompt template (STRICT JSON schema + label definitions) |
| phase4_grounding/grounding/topic_bucket.py | Re-exports repo bucket_topic without requiring scripts/ as importable pkg |
| phase4_grounding/grounding/sampling.py | Stratified functional QA sampler w/ topic weights + evidence-branch split |
| phase4_grounding/grounding/reporter.py | Renders keep/drop-structural Markdown summaries + tables |
| phase4_grounding/grounding/prompt.py | Template-based prompt builder with caching |
| phase4_grounding/grounding/parser.py | Strict response parser + schema validation + fence stripping |
| phase4_grounding/grounding/openrouter_client.py | Async OpenRouter client (retries/backoff/budget/cost tracking) |
| phase4_grounding/grounding/models.py | Dataclass contracts for pipeline objects/metrics |
| phase4_grounding/grounding/judge.py | Orchestrates prompt→chat→parse with single retry; emits JudgedQA or JudgeError |
| phase4_grounding/grounding/evidence.py | Evidence selection + renumbering into [E#] display IDs |
| phase4_grounding/grounding/aggregator.py | Computes keep/drop view metrics, breakdowns, Wilson CI |
| phase4_grounding/grounding/init.py | Grounding package marker |
| phase4_grounding/environment.yml | Conda environment definition for Phase 4 runs/tests |
| phase4_grounding/USAGE.md | Reproduction and CLI usage guide |
| phase4_grounding/RESULTS.md | Audit results writeup and methodology (headline metrics, refusals, cross-check) |
| phase4_grounding/PLAN.md | Design/spec for Phase 4 implementation |
| croissant.json | Adds measured grounding-audit findings to RAI fields |
| RESPONSIBLE_AI.md | Documents Phase 4 grounding audit + refusal protocol implications |
| LIMITATIONS.md | Updates “soft rule permits training recall” section with measured results |
| DATASHEET.md | Adds Phase 4 grounding headline as measured noise/training-recall signal |
| .gitignore | Ignores phase4_grounding/outputs/ as reproducible artifacts |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @dataclass(frozen=True) | ||
| class ChatResult: | ||
| text: str |
Comment on lines
+38
to
+47
| def _index_records(dataset_path: Path) -> dict[int, dict]: | ||
| idx: dict[int, dict] = {} | ||
| with dataset_path.open() as f: | ||
| for line in f: | ||
| line = line.strip() | ||
| if not line: | ||
| continue | ||
| rec = json.loads(line) | ||
| idx[int(rec["cid"])] = rec | ||
| return idx |
Comment on lines
+95
to
+98
| sum(r["p_unsupported"] * r["p_n"] for r in rows) / sum(r["p_n"] for r in rows) | ||
| ) | ||
| g_macro = ( | ||
| sum(r["g_unsupported"] * r["g_n"] for r in rows) / sum(r["g_n"] for r in rows) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Total real spend: $10.16 OpenRouter (vs. $100 budget cap). Reproducible from
phase4_grounding/run_phase4_grounding.sh --n 300 --max-usd 100.## Test plan - [ ]
conda run -n chem2text-phase4 python -m pytest phase4_grounding/tests/ -q— expect 99 passes - [ ]python -c "import json; json.load(open('croissant.json'))"— confirms RAI block is valid JSON - [ ] Skimphase4_grounding/RESULTS.mdfor the headline / methodology / refusal-protocol narrative - [ ] Optional: re-run the full audit end-to-end with an OpenRouter key — orchestrator is resumable and budget-bounded