Companion repository for:
Vibe Science: How Adversarial Agent Loops Turn Vibe Researching into Verifiable Science Carmine Russo, Elisa Bertelli — VibeX 2026 (co-located with EASE 2026), Glasgow, June 9–12, 2026
This repository contains the complete research artifacts from a 21-sprint investigation into CRISPR-Cas9 off-target prediction, conducted using the Vibe Science adversarial agent loop (Claude Code as Researcher-Agent, ChatGPT as external Reviewer-Agent). The investigation ran from January to February 2026.
The VibeX 2026 paper focuses on the agent architecture (adversarial loops, claim ledgers, serendipity engines), not on the biology. This repository provides the raw evidence: every sprint report, every analysis script, every claim with its lifecycle, and every reviewer intervention — so that the process claims in the paper can be independently verified.
| Phase | Sprints | What Happened |
|---|---|---|
| Failure | 1–2 | Original hypothesis (Unbalanced Optimal Transport models chromatin filter) scored AUROC 0.375 — worse than random guessing (0.500). |
| Serendipity | 3 | The Serendipity Engine flagged structured residual patterns in the failure data (score 13/15), triggering a formal pivot. |
| Pivot | 4–5 | Shifted to an "Affinity-First" framework. Discovered that 87% of cleaved sites fall in the top binding-affinity quartile. |
| Stress-test | 6–8 | R2 (ChatGPT) demanded hierarchical bootstrap. The "Regime Switch" claim collapsed — confidence intervals overlapped. Pivoted again to positional mismatch effects. |
| Discovery | 9–12 | Found that transitions are tolerated better than transversions (p = 2.35e-69), with a unique cytosine exception. Built a 5-feature Macro model that outperforms 20-feature Fine models. |
| Falsification | 13–14 | Cross-cell-line validation: macro patterns generalize, fine position rankings do not. Permutation tests passed. |
| Deep dive | 15–16 | Discovered the C exception (cytosine violates the Trans > Transv rule). Found a suspicious OR = 2.30 for consecutive mismatches. |
| Paper-saver | 17 | R2 demanded propensity matching. Exact stratified matching on total-mismatch count (57 strata) reversed the coefficient from −0.379 to +0.022. The entire consecutive-mismatch effect was a confounder. Caught before any draft was written. |
| Cross-assay | 18 | 4 findings validated on an independent GUIDE-seq dataset (1,380,770 sites, 0.088% positive rate). |
| Anti-leakage | 19 | All 78 guides confirmed as dissimilar (minimum Hamming distance = 7). No data leakage. |
| Final stress | 20 | All surviving claims re-tested with block bootstrap. Macro model: AUPRC 0.365 (14.5× lift over random). |
| Mechanism | 21 | 6/6 structural predictions from Cas9 biology confirmed. Two-stage model (R-loop formation + conformational checkpoint) is consistent with known biophysics. |
| Model | AUPRC | vs. Random | vs. MIT Score | vs. CFD Score |
|---|---|---|---|---|
| Macro (5 features) | 0.365 | 14.5× | +350% | +166% |
| Fine (20 features) | 0.352 | 14.0× | — | — |
| CFD (baseline) | 0.137 | 5.4× | — | — |
| MIT (baseline) | 0.081 | 3.2× | — | — |
| Status | Count | Description |
|---|---|---|
| Validated | 7 | Survived cross-assay replication + block bootstrap |
| Qualified | 4 | Signal present but requires caveats |
| Downgraded | 5 | Signal partial or unreliable as originally stated |
| Killed | 6 | Fully refuted (including UOT and consecutive-mismatch effect) |
| Exploratory | 12 | Discussed but never formally promoted |
| Total | 34 | 50% retraction rate among promoted claims |
- Position-dependent mismatch tolerance: Mismatches near the PAM-proximal end are disproportionately damaging to Cas9 cleavage.
- Transition > Transversion tolerance: Transitions are tolerated ~2× better than transversions (p < 1e-6 in both assays), with a unique C exception where C>A (transversion) is tolerated better than C>T (transition).
- Macro features > Fine features: A 5-feature model (log_change, n_mm, seed_mm, trans_ratio, is_ngg) outperforms 20 fine-grained positional features.
- Mismatch-burden threshold: Sharp conformational checkpoint at 4–5 total mismatches; 91.2% of high-affinity sites are blocked in vivo.
crispr-offtarget-serendipity/
│
├── README.md # This file
│
├── sprint-reports/ # Markdown reports per sprint
│ ├── SPRINT5_RESULTS.md
│ ├── SPRINT8_CRITICAL_FINDINGS.md
│ ├── SPRINT8_FINAL_FINDINGS.md
│ ├── SPRINT9_FINDINGS.md
│ ├── SPRINT10-12_FINDINGS.md
│ ├── SPRINT14_FALSIFICATION_REPORT.md
│ ├── SPRINT15_DEEP_ANALYSIS_REPORT.md
│ ├── SPRINT16_COMPREHENSIVE_REPORT.md
│ ├── SPRINT17_CRITICAL_REPORT.md # The "paper-saver" episode
│ ├── SPRINT18_FINAL_REPORT.md # Cross-assay validation
│ ├── SPRINT19_FALSIFICATION_REPORT.md
│ ├── SPRINT20_STRESS_TEST_REPORT.md
│ └── SPRINT21_MECHANISM_REPORT.md
│
├── research-journey/ # High-level summaries and overviews
│ ├── ENDOCRISPR_RESEARCH_JOURNEY_v3.md
│ ├── ENDOCRISPR_UNIFIED_REPORT.md
│ ├── MASTER_INDEX.md
│ └── CROSS_ASSAY_FINAL_RESULTS.md
│
├── claim-ledger/ # Formal claim tracking
│ ├── CLAIM_LEDGER_SPRINT1-21.md # 34 claims with lifecycle
│ └── MANIFEST_SPRINT1-21.md # Sprint-by-sprint artifact registry
│
├── scripts/ # Analysis scripts (Python)
│ ├── endocrispr_uot_analysis.py # Sprint 1-2: UOT (failed)
│ ├── sprint3_prepare_data.py
│ ├── sprint3_prepare_data_v2.py
│ ├── sprint3_ablation_suite.py
│ ├── sprint4_hypothesis_test.py
│ ├── sprint4_affinity_stratified.py
│ ├── sprint4_master_analysis.py
│ ├── sprint5_competition_analysis.py
│ └── ... (63 scripts total)
│
├── notebooks/ # Colab / Jupyter notebooks
│ ├── EndoCRISPR_Sprint2_Colab.ipynb
│ ├── EndoCRISPR_Sprint3_Colab.ipynb
│ ├── EndoCRISPR_Sprint3_FINAL.ipynb
│ ├── Sprint4_ATAC_Extraction_Colab.ipynb
│ ├── Sprint6_H3K27ac_Extraction_Colab.ipynb
│ ├── colab_GSE149363_analysis.ipynb
│ ├── COLAB_cross_assay_validation.ipynb
│ ├── COLAB_dl_benchmark.ipynb
│ ├── COLAB_cross_assay_model_transfer.ipynb
│ └── EndoCRISPR_Sprint3_Colab_LITE.ipynb
│
├── results/ # CSV outputs from analysis scripts
│ └── ... (64 result CSVs)
│
└── r2-interventions/ # ChatGPT (external R2) review logs
└── CHATGPT_FEEDBACK_LOG.md
Primary dataset: Lazzarotto et al., 2020, Nature Biotechnology 38:1317–1327. DOI: 10.1038/s41587-020-0555-7 GEO Accession: GSE149363
- 80,306 candidate off-target sites across 78 guide RNA sequences
- Primary human CD4+/CD8+ T-cells
- CHANGE-seq (in vitro) vs. GUIDE-seq (in cellula) cross-assay design
- 2.52% positive rate (extreme class imbalance)
Cross-assay validation dataset: 1,380,770 GUIDE-seq sites (0.088% positive rate) from independent experimental conditions.
This research was conducted using the Vibe Science v3.5 protocol:
- R1 (Researcher-Agent): Claude Code (Opus 4.5, Anthropic) — executed all analyses, generated scripts, produced sprint reports.
- R2 (External Reviewer-Agent): ChatGPT (GPT-5.2, OpenAI) — reviewed each sprint with no access to code or data. Its only instruction: "Demolish everything demolishable. Trust only the data."
- Human operators: Carmine Russo and Elisa Bertelli — made pivot decisions, transferred sprint summaries between agents, and served as final arbiters.
| Sprint | R2 Demanded | Impact |
|---|---|---|
| 5 | Covariate isolation for sponge effect | Confounding exposed |
| 8 | Hierarchical bootstrap CIs | "Regime Switch" claim killed (d = 0.07) |
| 11 | Challenged "bidirectional" terminology | Corrected to "differential tolerance" |
| 14 | Permutation tests for Trans > Transv | Finding survived |
| 16 | Flagged OR = 2.30 as suspicious | Queued for propensity matching |
| 17 | Demanded propensity matching | Consecutive-mismatch claim killed (paper-saver) |
| 19 | Anti-leakage audit | Data integrity confirmed |
- Python 3.10+
- scikit-learn, scipy, numpy, pandas
- pyBigWig (for ATAC-seq extraction)
- Google Colab (for notebooks) or local Jupyter
# Table 2 (Claim Evolution): derived from claim-ledger/CLAIM_LEDGER_SPRINT1-21.md
# Table 3 (Feature Traceability): derived from claim-ledger/MANIFEST_SPRINT1-21.md
# Table 4 (R2 Interventions): derived from r2-interventions/CHATGPT_FEEDBACK_LOG.mdThe claim ledger and manifest are structured markdown files that map directly to the paper's tables.
- Vibe Science plugin: th3vib3coder/vibe-science
- VibeX 2026 workshop: conf.researchr.org/home/ease-2026/vibex-2026
If you use these artifacts, please cite:
@inproceedings{russo2026vibescience,
author = {Russo, Carmine and Bertelli, Elisa},
title = {Vibe Science: How Adversarial Agent Loops Turn Vibe Researching into Verifiable Science},
booktitle = {Proceedings of the 1st International Workshop on Vibe Coding and Vibe Researching (VibeX), co-located with EASE 2026},
year = {2026},
location = {Glasgow, Scotland, United Kingdom}
}Apache 2.0. See LICENSE.