This repository contains the analysis code and data for METR's time horizon methodology, as described in "Measuring AI Ability to Complete Long Tasks".
An interactive dashboard for exploring AI capability horizons with rigorous statistical analysis:
- Weighted Least Squares Regression — Models weighted by inverse variance from confidence intervals
- Bootstrap Confidence Intervals — 2000 resamples for uncertainty quantification on doubling times
- Proper Prediction Intervals — Combines trend uncertainty with residual variance
- Model Comparison — Exponential vs linear fit comparison via AIC
- Out-of-Sample Validation — Tests predictions on held-out recent models
- Custom Scenarios — Input your own doubling time assumptions to compare projections
- Milestone Calculator — Estimate when AI will reach target capability horizons
- Doubling Time Calculator — Compute implied doubling times between any two models or dates
The dashboard pulls data from METR's benchmark results (v1.1) and updates daily via GitHub Actions.
The time horizon methodology measures AI agent capabilities by:
- Collecting tasks with known human completion times
- Running AI agents on these tasks and recording success/failure
- Fitting a logistic curve modeling P(success) as a function of log2(human_minutes)
- Extracting the "time horizon" - the task duration where the model hits a success threshold
Key finding: Under the v1.1 methodology, AI agent time horizons have been doubling approximately every 4.3 months (131 days) since 2023.
.
├── src/horizon/ # Analysis code (installable Python package)
│ ├── utils/ # Core utilities (logistic regression, plots)
│ ├── wrangle/ # Data wrangling (bootstrap, logistic fitting)
│ └── plot/ # Plot generation modules
├── data/
│ └── external/
│ └── release_dates.yaml # Model release dates
└── reports/
├── time-horizon-1-0/ # Time Horizon v1.0
│ ├── dvc.yaml # DVC pipeline definition
│ ├── params.yaml # Report parameters
│ └── data/raw/
│ └── runs.jsonl # Run data
└── time-horizon-1-1/ # Time Horizon v1.1
├── dvc.yaml
├── params.yaml
└── data/raw/
└── runs.jsonl
# Clone the repository
git clone https://github.com/METR/eval-analysis-public.git
cd eval-analysis-public
# Install the horizon package in editable mode
pip install -e .
# Or with dev dependencies
pip install -e ".[dev]"Each report has its own DVC pipeline. To run a report:
# Run the time-horizon-1-0 report
cd reports/time-horizon-1-0
dvc repro
# Run the time-horizon-1-1 report
cd reports/time-horizon-1-1
dvc reproThe pipelines will:
- Run bootstrap sampling for confidence intervals
- Fit logistic regression models
- Generate plots and metrics
The runs.jsonl files contain one JSON object per line with the following key fields:
| Field | Description |
|---|---|
task_id |
Unique task identifier |
task_family |
Group of related tasks |
alias |
Public model name |
score_binarized |
0 (failure) or 1 (success) |
score_cont |
Continuous score 0-1 |
human_minutes |
How long a qualified human expert takes |
invsqrt_task_weight |
Diversity-adjusted weight for this run |
After running dvc repro, you'll find:
data/wrangled/bootstrap/*.csv- Bootstrap samples for confidence intervalsdata/wrangled/logistic_fits/*.csv- Logistic regression fits with p50/p80 horizonsplots/- Generated visualizationsmetrics/- YAML files with key metrics
import pandas as pd
# Load logistic fits
fits = pd.read_csv("reports/time-horizon-1-0/data/wrangled/logistic_fits/headline.csv")
# See horizons by agent
print(fits[["agent", "p50", "p50q0.025", "p50q0.975"]].sort_values("p50", ascending=False))
# Load raw runs
runs = pd.read_json("reports/time-horizon-1-0/data/raw/runs.jsonl", lines=True)
# Success rate by agent
print(runs.groupby("alias")["score_binarized"].mean().sort_values(ascending=False))The main model report with comprehensive analysis of 48+ models using the original metr-task-standard evaluation framework, including:
- Time horizon trends (p50, p80)
- Bootstrap confidence intervals
- Token usage analysis
- Comparison overlays with time-horizon-1-1 results
The current primary report, using the updated v1.1 task suite (228 tasks, up from 170 in v1.0) evaluated on the Inspect framework. Includes stages for comparing doubling times with time-horizon-1-0 (compare_doubling_times_vs_th_1_0).
If you use this code or data, please cite:
@article{metr2025horizon,
title={Measuring AI Ability to Complete Long Tasks},
author={METR},
journal={arXiv preprint arXiv:2503.14499},
year={2025}
}See LICENSE file for details.