METR Time Horizon Analysis

This repository contains the analysis code and data for METR's time horizon methodology, as described in "Measuring AI Ability to Complete Long Tasks".

Interactive Dashboard

View the live dashboard →

An interactive dashboard for exploring AI capability horizons with rigorous statistical analysis:

Weighted Least Squares Regression — Models weighted by inverse variance from confidence intervals
Bootstrap Confidence Intervals — 2000 resamples for uncertainty quantification on doubling times
Proper Prediction Intervals — Combines trend uncertainty with residual variance
Model Comparison — Exponential vs linear fit comparison via AIC
Out-of-Sample Validation — Tests predictions on held-out recent models
Custom Scenarios — Input your own doubling time assumptions to compare projections
Milestone Calculator — Estimate when AI will reach target capability horizons
Doubling Time Calculator — Compute implied doubling times between any two models or dates

The dashboard pulls data from METR's benchmark results (v1.1) and updates daily via GitHub Actions.

Overview

The time horizon methodology measures AI agent capabilities by:

Collecting tasks with known human completion times
Running AI agents on these tasks and recording success/failure
Fitting a logistic curve modeling P(success) as a function of log2(human_minutes)
Extracting the "time horizon" - the task duration where the model hits a success threshold

Key finding: Under the v1.1 methodology, AI agent time horizons have been doubling approximately every 4.3 months (131 days) since 2023.

Repository Structure

.
├── src/horizon/           # Analysis code (installable Python package)
│   ├── utils/             # Core utilities (logistic regression, plots)
│   ├── wrangle/           # Data wrangling (bootstrap, logistic fitting)
│   └── plot/              # Plot generation modules
├── data/
│   └── external/
│       └── release_dates.yaml  # Model release dates
└── reports/
    ├── time-horizon-1-0/  # Time Horizon v1.0
    │   ├── dvc.yaml       # DVC pipeline definition
    │   ├── params.yaml    # Report parameters
    │   └── data/raw/
    │       └── runs.jsonl # Run data
    └── time-horizon-1-1/  # Time Horizon v1.1
        ├── dvc.yaml
        ├── params.yaml
        └── data/raw/
            └── runs.jsonl

Installation

# Clone the repository
git clone https://github.com/METR/eval-analysis-public.git
cd eval-analysis-public

# Install the horizon package in editable mode
pip install -e .

# Or with dev dependencies
pip install -e ".[dev]"

Running the Reports

Each report has its own DVC pipeline. To run a report:

# Run the time-horizon-1-0 report
cd reports/time-horizon-1-0
dvc repro

# Run the time-horizon-1-1 report
cd reports/time-horizon-1-1
dvc repro

The pipelines will:

Run bootstrap sampling for confidence intervals
Fit logistic regression models
Generate plots and metrics

Data Format

The runs.jsonl files contain one JSON object per line with the following key fields:

Field	Description
`task_id`	Unique task identifier
`task_family`	Group of related tasks
`alias`	Public model name
`score_binarized`	0 (failure) or 1 (success)
`score_cont`	Continuous score 0-1
`human_minutes`	How long a qualified human expert takes
`invsqrt_task_weight`	Diversity-adjusted weight for this run

Key Outputs

After running dvc repro, you'll find:

data/wrangled/bootstrap/*.csv - Bootstrap samples for confidence intervals
data/wrangled/logistic_fits/*.csv - Logistic regression fits with p50/p80 horizons
plots/ - Generated visualizations
metrics/ - YAML files with key metrics

Analyzing Results

import pandas as pd

# Load logistic fits
fits = pd.read_csv("reports/time-horizon-1-0/data/wrangled/logistic_fits/headline.csv")

# See horizons by agent
print(fits[["agent", "p50", "p50q0.025", "p50q0.975"]].sort_values("p50", ascending=False))

# Load raw runs
runs = pd.read_json("reports/time-horizon-1-0/data/raw/runs.jsonl", lines=True)

# Success rate by agent
print(runs.groupby("alias")["score_binarized"].mean().sort_values(ascending=False))

Reports

time-horizon-1-0

The main model report with comprehensive analysis of 48+ models using the original metr-task-standard evaluation framework, including:

Time horizon trends (p50, p80)
Bootstrap confidence intervals
Token usage analysis
Comparison overlays with time-horizon-1-1 results

time-horizon-1-1

The current primary report, using the updated v1.1 task suite (228 tasks, up from 170 in v1.0) evaluated on the Inspect framework. Includes stages for comparing doubling times with time-horizon-1-0 (compare_doubling_times_vs_th_1_0).

Citation

If you use this code or data, please cite:

@article{metr2025horizon,
  title={Measuring AI Ability to Complete Long Tasks},
  author={METR},
  journal={arXiv preprint arXiv:2503.14499},
  year={2025}
}

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
.devcontainer		.devcontainer
.dvc		.dvc
.github/workflows		.github/workflows
data/external		data/external
docs/superpowers		docs/superpowers
reports		reports
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
.nojekyll		.nojekyll
CNAME		CNAME
Plot.png		Plot.png
README.md		README.md
benchmark_results_1_0.yaml		benchmark_results_1_0.yaml
benchmark_results_1_1.yaml		benchmark_results_1_1.yaml
dashboard.html		dashboard.html
data-status.json		data-status.json
index.html		index.html
pyproject.toml		pyproject.toml
test_dashboard.js		test_dashboard.js
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

METR Time Horizon Analysis

Interactive Dashboard

Overview

Repository Structure

Installation

Running the Reports

Data Format

Key Outputs

Analyzing Results

Reports

time-horizon-1-0

time-horizon-1-1

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

METR Time Horizon Analysis

Interactive Dashboard

Overview

Repository Structure

Installation

Running the Reports

Data Format

Key Outputs

Analyzing Results

Reports

time-horizon-1-0

time-horizon-1-1

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages