OpenQuestion

A benchmark for measuring whether frontier LLMs flatten contested scientific questions into false consensus, or calibrate their responses to the actual structure of scientific disagreement.

OpenQuestion targets questions where multiple incompatible frameworks claim partial validity and ground truth is unavailable in principle. This complements benchmarks like BioMysteryBench, which targets questions with verifiable data-derived answers; OpenQuestion targets the contested-question regime that BioMysteryBench explicitly does not address.

Figure 1. OpenQuestion tests whether LLMs preserve the structure of scientific disagreement rather than collapsing contested questions into false consensus or unsupported synthesis.

The repository contains two versions: V1 (frozen) and V2 (active). V2 is the main artifact; V1 is preserved for reproducibility and as the methodological starting point V2 builds on.

V2 (active)

V2 introduces a spectrum framing for scientific disagreement and tests it on three chromatin-biophysics questions across three Anthropic models with four prompt framings each.

Spectrum framing

Most "competing frameworks" in mechanistic biology are not flat contradictions. They sit on a spectrum:

genuine_adversarial — positions make different empirical predictions; no reconciling framework currently exists in the literature. (V2-002: cohesin loop extrusion limits.)
borderline — partial reconciliation exists but residual puzzles remain. (V2-001: TAD organization in single cells.)
pseudo_disagreement — apparent positions are scale- or state-dependent regimes of one underlying picture. (V2-003: chromatin material state.)

The correct response shape depends on where the question sits on this spectrum. A model that produces a confident synthesis on a genuine_adversarial question is committing false_synthesis. A model that presents pseudo_disagreement as adversarial is committing false_adversarial. The benchmark scores both response shape (D1-D3) and calibration to disagreement type (D4).

Two probe framings per question

Each question is asked under four framings: naive, neutral, competing_models_probe (asks for the competing models without naming them), and synthesis_probe (asks for the underlying reconciling picture). The two probes test different aspects of calibration.

Within-family model comparison

V2 tests Haiku 4.5, Sonnet 4.6, and Opus 4.7 via the Anthropic Messages API directly. Cross-vendor confounds from V1 are removed; capability scaling within a single training pipeline becomes the central test.

V2 pilot scope

3 questions × 4 framings × 3 models × 3 trials = 108 responses. Each response is scored on a 4-dimension rubric (D1-D4, each 0-2, total 0-8) plus a categorical failure_mode label by the author and by an LLM co-rater (Opus 4.7 via API, three independent runs).

Total cost: ~$10.40 in Anthropic API spend.

V2 pilot findings (summary)

The full writeup is in v2/findings_writeup.md. Headline results:

Naive prompts produce canonical-view collapse across all three models. All 27 naive-framing cells score ≤3/8, with false_consensus as the dominant failure mode. Prompt structure is required for any disagreement-aware response.
Capability scaling does not improve calibration. Mean Spearman ρ between model size rank and total score = −0.476 across 12 cells. Opus scores lower than Haiku and Sonnet on average. A more capable model is not automatically a more calibrated model on contested questions.
Probe framing reshapes the failure mode predictably on genuine_adversarial questions. The synthesis_probe on V2-002 lowers scores by 1.89 points on average, predominantly through fabricated reconciling frameworks (false_synthesis). The probe reliably induces the predicted failure mode.
Over_reconciliation is the dominant failure mode on pseudo_disagreement questions even under direct prompting. The textbook hedge — "viscoelastic with both passive and active components" — wins even when the prompt asks for the underlying reconciling picture.
No response in the pilot scored 8/8. The response shape that would score well — canonical-view-with-empirically-anchored-caveats — does not appear in any of the 108 responses.

Layout

v2/
├── README.md                       # V2-specific docs
├── questions_v2.yaml               # 3 pilot questions
├── scoring_rubric_v2.md            # 4-dimension rubric
├── collect_responses.py            # Anthropic API runner
├── blind_responses_v2.py           # per-(question, framing) blinding
├── score_response_v2.py            # interactive author scoring
├── cor_score_v2.py                 # automated Opus co-rater
├── analyze_v2.py                   # agreement + findings analysis
├── responses/                      # raw API responses
├── blind_responses/                # blinded text files
├── author_scores_v2.csv            # author scores
├── cor_scores_v2_run{1,2,3}.csv    # three co-rater runs
├── agreement_report.md             # author-vs-co-rater + self-consistency
├── findings_report.md              # substantive results
└── findings_writeup.md             # full writeup of pilot findings

Run V2 end-to-end

cd v2
python collect_responses.py --trials 3
python blind_responses_v2.py --run-dir responses/run_<timestamp>
python cor_score_v2.py --output cor_scores_v2_run1.csv
python cor_score_v2.py --output cor_scores_v2_run2.csv
python cor_score_v2.py --output cor_scores_v2_run3.csv
python score_response_v2.py        # interactive; takes ~3.5 hours
python analyze_v2.py all

Total: ~25 minutes for response collection, ~30-45 minutes per co-rater run, ~3.5 hours of interactive author scoring, ~10 seconds for analysis.

V1 (frozen)

V1 was the first iteration of OpenQuestion (2026-Q1), testing whether frontier LLMs flatten 11 contested chromatin-biophysics questions into false consensus. V1 tested three models from three vendors (Claude, Codex/GPT, Gemini) via their respective CLIs.

V1's central finding: models systematically flatten when asked about the mechanism explaining a well-established empirical correlation ("how does X affect Y" where X-Y are robustly correlated but the mechanistic explanation is contested). The failure mode is structural flattening of contested relationships rather than item-level flattening of contested items — models name multiple mechanisms (D2 = 0.88) but rarely frame them as contested (D3 = 0.39, D4 = 0.55).

V1's methodological lessons that informed V2:

Rubric dimensions cascaded in V1 (D1 → D2 → D4 followed mechanically). V2 redesigns the rubric to be independently variable.
V1 questions were author-constructed from KB-derived disagreements rather than sourced from documented adversarial literature. V2 requires every position to have ≥2 papers explicitly defending it.
V1 tested cross-vendor models, which confounds training pipeline, RLHF, and inference parameters. V2 tests within-family for cleaner capability-scaling signal.

V1 is preserved in v1/ for reproducibility. The V1 README documents its own methodology and findings.

v1/
├── README.md
├── flattening_test_questions.yaml  # 11 naive-framing questions
├── scoring_rubric.md               # V1 rubric (4 dimensions, 0-2)
├── blind_responses_destyled/       # blinded responses, structurally normalized
├── scores.csv                      # author scores
├── cor_scores_v{1,2,3}.csv         # co-rater scores (V1 protocol)
└── analysis.md                     # V1 findings

What this benchmark is and isn't

It is:

A working artifact for measuring whether LLMs flatten contested scientific questions.
A small-scale demonstration of methodology that scales to larger question banks.
Complementary to verifiable-ground-truth benchmarks like BioMysteryBench.

It isn't:

Statistically powered. The pilot is descriptive, not inferential.
A claim of generality outside chromatin biophysics. All questions in V1 and V2 are from this single subfield.
A complete benchmark for "AI for contested science." It targets one specific failure mode (structural flattening of disagreement) and one specific question shape (mechanistic disagreement in bench-life-sciences).

Methodological contributions

A few aspects of OpenQuestion may be useful beyond the specific findings:

Disagreement spectrum framing. Treating contested questions as a spectrum from genuine adversarial through borderline to pseudo-disagreement, and scoring calibration to that spectrum explicitly, is a more nuanced framing than "is there disagreement or not." Most real scientific contestation lives in the middle of this spectrum, and benchmarks that don't acknowledge the middle risk treating soft reconciliation as either correct (it sometimes is) or incorrect (it sometimes is) without distinguishing.

Two probe framings per question. Asking the same question under both competing_models_probe and synthesis_probe framings separates calibration from prompt-following. A calibrated model scores similarly on both; an asymmetric pattern suggests the model is following the prompt structure rather than the question's disagreement structure.

Field-whitelist contamination protocol. Rather than ensuring the co-rater doesn't see sensitive fields through prose policy, the implementation explicitly constructs the prompt from a whitelist of allowed fields. Adding new sensitive fields to the question schema cannot accidentally leak them, because the new field has to be explicitly added to the whitelist to appear. This is structurally stronger than blacklist-based approaches and stronger than process-based ("the rater agreed not to look") protocols.

Failure-mode taxonomy. Categorical labels (false_consensus, false_adversarial, false_synthesis, over_reconciliation, selective_endorsement, none) are scored alongside the dimensional rubric. This gives cleaner per-failure-mode statistics than free-text notes (V1's approach) at the cost of forcing borderline cases into labels.

Provenance and reproducibility

OpenQuestion was built by Suho Lee (NYU postdoctoral biophysicist) as part of an exploratory project on AI-for-science evaluation methodology. Both V1 and V2 are reproducible end-to-end from the included seeds, scripts, and SDK versions recorded in run metadata.

The work is connected to but independent of any institutional affiliation. Feedback welcome at [contact info pending].

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
v1		v1
v2		v2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY_REDACTIONS.md		SECURITY_REDACTIONS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenQuestion

V2 (active)

Spectrum framing

Two probe framings per question

Within-family model comparison

V2 pilot scope

V2 pilot findings (summary)

Layout

Run V2 end-to-end

V1 (frozen)

What this benchmark is and isn't

Methodological contributions

Provenance and reproducibility

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenQuestion

V2 (active)

Spectrum framing

Two probe framings per question

Within-family model comparison

V2 pilot scope

V2 pilot findings (summary)

Layout

Run V2 end-to-end

V1 (frozen)

What this benchmark is and isn't

Methodological contributions

Provenance and reproducibility

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages