A benchmark for measuring whether frontier LLMs flatten contested scientific questions into false consensus, or calibrate their responses to the actual structure of scientific disagreement.
OpenQuestion targets questions where multiple incompatible frameworks claim partial validity and ground truth is unavailable in principle. This complements benchmarks like BioMysteryBench, which targets questions with verifiable data-derived answers; OpenQuestion targets the contested-question regime that BioMysteryBench explicitly does not address.
Figure 1. OpenQuestion tests whether LLMs preserve the structure of scientific disagreement rather than collapsing contested questions into false consensus or unsupported synthesis.
The repository contains two versions: V1 (frozen) and V2 (active). V2 is the main artifact; V1 is preserved for reproducibility and as the methodological starting point V2 builds on.
V2 introduces a spectrum framing for scientific disagreement and tests it on three chromatin-biophysics questions across three Anthropic models with four prompt framings each.
Most "competing frameworks" in mechanistic biology are not flat contradictions. They sit on a spectrum:
genuine_adversarial— positions make different empirical predictions; no reconciling framework currently exists in the literature. (V2-002: cohesin loop extrusion limits.)borderline— partial reconciliation exists but residual puzzles remain. (V2-001: TAD organization in single cells.)pseudo_disagreement— apparent positions are scale- or state-dependent regimes of one underlying picture. (V2-003: chromatin material state.)
The correct response shape depends on where the question sits on this
spectrum. A model that produces a confident synthesis on a
genuine_adversarial question is committing false_synthesis. A
model that presents pseudo_disagreement as adversarial is committing
false_adversarial. The benchmark scores both response shape (D1-D3)
and calibration to disagreement type (D4).
Each question is asked under four framings: naive, neutral,
competing_models_probe (asks for the competing models without
naming them), and synthesis_probe (asks for the underlying
reconciling picture). The two probes test different aspects of
calibration.
V2 tests Haiku 4.5, Sonnet 4.6, and Opus 4.7 via the Anthropic Messages API directly. Cross-vendor confounds from V1 are removed; capability scaling within a single training pipeline becomes the central test.
3 questions × 4 framings × 3 models × 3 trials = 108 responses. Each response is scored on a 4-dimension rubric (D1-D4, each 0-2, total 0-8) plus a categorical failure_mode label by the author and by an LLM co-rater (Opus 4.7 via API, three independent runs).
Total cost: ~$10.40 in Anthropic API spend.
The full writeup is in v2/findings_writeup.md. Headline results:
- Naive prompts produce canonical-view collapse across all three
models. All 27 naive-framing cells score ≤3/8, with
false_consensusas the dominant failure mode. Prompt structure is required for any disagreement-aware response. - Capability scaling does not improve calibration. Mean Spearman ρ between model size rank and total score = −0.476 across 12 cells. Opus scores lower than Haiku and Sonnet on average. A more capable model is not automatically a more calibrated model on contested questions.
- Probe framing reshapes the failure mode predictably on
genuine_adversarial questions. The synthesis_probe on V2-002
lowers scores by 1.89 points on average, predominantly through
fabricated reconciling frameworks (
false_synthesis). The probe reliably induces the predicted failure mode. Over_reconciliationis the dominant failure mode on pseudo_disagreement questions even under direct prompting. The textbook hedge — "viscoelastic with both passive and active components" — wins even when the prompt asks for the underlying reconciling picture.- No response in the pilot scored 8/8. The response shape that would score well — canonical-view-with-empirically-anchored-caveats — does not appear in any of the 108 responses.
v2/
├── README.md # V2-specific docs
├── questions_v2.yaml # 3 pilot questions
├── scoring_rubric_v2.md # 4-dimension rubric
├── collect_responses.py # Anthropic API runner
├── blind_responses_v2.py # per-(question, framing) blinding
├── score_response_v2.py # interactive author scoring
├── cor_score_v2.py # automated Opus co-rater
├── analyze_v2.py # agreement + findings analysis
├── responses/ # raw API responses
├── blind_responses/ # blinded text files
├── author_scores_v2.csv # author scores
├── cor_scores_v2_run{1,2,3}.csv # three co-rater runs
├── agreement_report.md # author-vs-co-rater + self-consistency
├── findings_report.md # substantive results
└── findings_writeup.md # full writeup of pilot findings
cd v2
python collect_responses.py --trials 3
python blind_responses_v2.py --run-dir responses/run_<timestamp>
python cor_score_v2.py --output cor_scores_v2_run1.csv
python cor_score_v2.py --output cor_scores_v2_run2.csv
python cor_score_v2.py --output cor_scores_v2_run3.csv
python score_response_v2.py # interactive; takes ~3.5 hours
python analyze_v2.py allTotal: ~25 minutes for response collection, ~30-45 minutes per co-rater run, ~3.5 hours of interactive author scoring, ~10 seconds for analysis.
V1 was the first iteration of OpenQuestion (2026-Q1), testing whether frontier LLMs flatten 11 contested chromatin-biophysics questions into false consensus. V1 tested three models from three vendors (Claude, Codex/GPT, Gemini) via their respective CLIs.
V1's central finding: models systematically flatten when asked about the mechanism explaining a well-established empirical correlation ("how does X affect Y" where X-Y are robustly correlated but the mechanistic explanation is contested). The failure mode is structural flattening of contested relationships rather than item-level flattening of contested items — models name multiple mechanisms (D2 = 0.88) but rarely frame them as contested (D3 = 0.39, D4 = 0.55).
V1's methodological lessons that informed V2:
- Rubric dimensions cascaded in V1 (D1 → D2 → D4 followed mechanically). V2 redesigns the rubric to be independently variable.
- V1 questions were author-constructed from KB-derived disagreements rather than sourced from documented adversarial literature. V2 requires every position to have ≥2 papers explicitly defending it.
- V1 tested cross-vendor models, which confounds training pipeline, RLHF, and inference parameters. V2 tests within-family for cleaner capability-scaling signal.
V1 is preserved in v1/ for reproducibility. The V1 README documents
its own methodology and findings.
v1/
├── README.md
├── flattening_test_questions.yaml # 11 naive-framing questions
├── scoring_rubric.md # V1 rubric (4 dimensions, 0-2)
├── blind_responses_destyled/ # blinded responses, structurally normalized
├── scores.csv # author scores
├── cor_scores_v{1,2,3}.csv # co-rater scores (V1 protocol)
└── analysis.md # V1 findings
It is:
- A working artifact for measuring whether LLMs flatten contested scientific questions.
- A small-scale demonstration of methodology that scales to larger question banks.
- Complementary to verifiable-ground-truth benchmarks like BioMysteryBench.
It isn't:
- Statistically powered. The pilot is descriptive, not inferential.
- A claim of generality outside chromatin biophysics. All questions in V1 and V2 are from this single subfield.
- A complete benchmark for "AI for contested science." It targets one specific failure mode (structural flattening of disagreement) and one specific question shape (mechanistic disagreement in bench-life-sciences).
A few aspects of OpenQuestion may be useful beyond the specific findings:
Disagreement spectrum framing. Treating contested questions as a spectrum from genuine adversarial through borderline to pseudo-disagreement, and scoring calibration to that spectrum explicitly, is a more nuanced framing than "is there disagreement or not." Most real scientific contestation lives in the middle of this spectrum, and benchmarks that don't acknowledge the middle risk treating soft reconciliation as either correct (it sometimes is) or incorrect (it sometimes is) without distinguishing.
Two probe framings per question. Asking the same question under
both competing_models_probe and synthesis_probe framings
separates calibration from prompt-following. A calibrated model
scores similarly on both; an asymmetric pattern suggests the model is
following the prompt structure rather than the question's
disagreement structure.
Field-whitelist contamination protocol. Rather than ensuring the co-rater doesn't see sensitive fields through prose policy, the implementation explicitly constructs the prompt from a whitelist of allowed fields. Adding new sensitive fields to the question schema cannot accidentally leak them, because the new field has to be explicitly added to the whitelist to appear. This is structurally stronger than blacklist-based approaches and stronger than process-based ("the rater agreed not to look") protocols.
Failure-mode taxonomy. Categorical labels (false_consensus,
false_adversarial, false_synthesis, over_reconciliation,
selective_endorsement, none) are scored alongside the dimensional
rubric. This gives cleaner per-failure-mode statistics than free-text
notes (V1's approach) at the cost of forcing borderline cases into
labels.
OpenQuestion was built by Suho Lee (NYU postdoctoral biophysicist) as part of an exploratory project on AI-for-science evaluation methodology. Both V1 and V2 are reproducible end-to-end from the included seeds, scripts, and SDK versions recorded in run metadata.
The work is connected to but independent of any institutional affiliation. Feedback welcome at [contact info pending].
MIT.
