FutureCast-Bench

FutureCast-Bench is a context-aware forecasting benchmark from the FutureCast（天星台） project. It is designed to evaluate whether forecasting models can move beyond numerical extrapolation and reason with real-world context.

Instead of representing each forecasting task only as:

historical time series -> future values

FutureCast-Bench represents each task as:

historical time series + numeric exogenous variables + textual exogenous context + evidence annotations -> forecasting target

The goal is to support the next generation of time series foundation models, LLM-driven forecasting models, slow-thinking forecasting systems, and agentic forecasting workflows.

Quickstart

FutureCast-Bench now includes a small file-processing example and a tiny in-repository sample dataset. The sample shows the intended workflow: read a raw CSV file, write the standard FutureCast layout, then validate the generated files.

git clone https://github.com/ustc-time-series/Future-Cast.git
cd Future-Cast
pip install -e .

Process the toy raw file into the standard layout:

python scripts/prepare_toy_energy.py \
  --source examples/raw/toy_energy/toy_energy_raw.csv \
  --output examples/sample_data/toy_energy

Validate the generated FutureCast layout:

python scripts/validate_futurecast_layout.py toy_energy --data-root .

The output follows the same file organization used by every processed dataset:

examples/sample_data/toy_energy/
  processed/
    target/
    numeric_exogenous/
    text_exogenous/
  tasks/

For full processed datasets stored outside the GitHub repository, use the same validation script with the corresponding data root:

python scripts/validate_futurecast_layout.py aq_data --data-root /path/to/FutureCast

The first code release focuses on reproducible file processing, lightweight CSV layout validation, and a minimal sample dataset. Full benchmark evaluation, baselines, and leaderboard submission checks are planned next.

Why FutureCast-Bench?

Most existing time series benchmarks are built around the numerical series itself. They evaluate whether a model can forecast future values from past values across different domains, frequencies, and horizons. This is necessary, but it is not enough for real-world forecasting.

In real applications, the future is often shaped by information outside the target sequence:

electricity demand and prices are affected by weather, holidays, supply-demand balance, market rules, and grid conditions;
traffic flow is affected by accidents, weather, events, commuting patterns, and spatial structure;
retail demand is affected by promotion, price, holidays, inventory, store location, and consumer behavior;
clinical variables are affected by patient status, treatment intervention, missingness, and medical knowledge;
macroeconomic indicators are affected by policy changes, inflation, interest rates, employment, and market expectations;
climate and hydrology variables are affected by seasonal cycles, geography, precipitation, snowpack, and local physical conditions;
air quality is affected by regional transport, meteorology, co-pollutants, station location, and seasonal environmental conditions;
cloud machine utilization is affected by workload scheduling, resource contention, business cycles, and cluster-level operations;
industrial sensor signals are affected by operating conditions, maintenance, environment, and abnormal events.

Two time series can have similar historical shapes but very different futures because their surrounding contexts are different. A benchmark that only measures numerical error cannot tell whether a model is merely fitting statistical patterns or actually understanding why the future changes.

FutureCast-Bench is designed to fill this gap. It evaluates not only whether a model predicts accurately, but also whether it can:

identify which contextual information is relevant;
align context with the correct time interval and forecasting target;
reason about how external factors affect future values;
revise predictions when new evidence appears;
generate evidence-grounded explanations for forecasting decisions.

Core Capabilities

FutureCast-Bench is organized around three core capabilities.

1. Context-Sequence Alignment and Fusion

Models should be able to align target time series with heterogeneous context, including calendar information, spatial attributes, weather, events, business rules, domain knowledge, and textual descriptions.

2. Contextual Reasoning for Forecasting

Models should not only output future values, but also reason about the potential impact of contextual factors. For example, high temperature may increase electricity load, promotion may increase retail demand, and policy change may shift macroeconomic trends.

3. Dynamic Context Adaptation

Real forecasting is often an iterative process. A model may first make a prediction with incomplete information, then receive new evidence, update its reasoning, and revise the forecast. FutureCast-Bench includes tasks for evaluating this dynamic adaptation ability.

Current Benchmark Status

The current processed benchmark covers 15 datasets across 10 domains, with approximately 149K forecasting series and 277M timestamp-level records. Each dataset is stored in a lightweight CSV layout with one target file, one numeric exogenous file, one text exogenous file, and one task YAML definition for each benchmark task.

Domain	Dataset	Forecasting Target	Forecasting Unit	Frequency	Dataset Size	Variables	Lookback / Prediction Windows
Energy	SDWPF	Wind turbine active power	Wind turbine	10 min	134 turbine series; about 11.36M timestamp records	1 target; 24 numeric exogenous variables; text exogenous context	Lookback: 24 hours / 7 days / 30 days; Prediction: 8 hours / 24 hours / 7 days
Power	AEMO NEM DispatchIS	Regional electricity price (`RRP`)	NEM region	5 min	5 region series; about 129.6K timestamp records	1 target; 11 numeric exogenous variables; text exogenous context	Lookback: 24 hours / 7 days / 28 days; Prediction: 1 hour / 24 hours / 7 days
Power	OPSD German Load	German actual electricity load	Country-level load series	1 hour	1 load series; about 50.4K hourly records	1 target; 3 numeric exogenous variables; text exogenous context	Lookback: 7 / 30 / 90 days; Prediction: 24 hours / 7 days / 30 days
Sales	M5	Daily unit sales	Item-store pair	1 day	30,490 item-store series; about 59.18M daily records	1 target; 7 numeric exogenous variables; text exogenous context	Lookback: 56 / 168 / 365 days; Prediction: 28 / 84 / 168 days
Sales	Rossmann Store Sales	Store sales	Store	1 day	1,115 store series; about 1.06M daily records	1 target; 13 numeric exogenous variables; text exogenous context	Lookback: 56 / 168 / 365 days; Prediction: 28 / 84 / 168 days
Sales	Favorita Grocery Sales	Unit sales	Store-item pair from one selected store	1 day	4,081 store-item series; about 2.62M daily records	1 target; 21 numeric exogenous variables; text exogenous context	Lookback: 56 / 168 / 365 days; Prediction: 28 / 84 / 168 days
Medical	PhysioNet 2012	ICU clinical variable value	ICU stay-variable pair	1 hour	107,188 patient-variable series; about 5.25M hourly records	1 target; 19 numeric exogenous variables; text exogenous context	Lookback: 6 / 12 / 24 hours; Prediction: 6 / 12 / 24 hours
Traffic	PEMS04	Traffic flow	Road sensor	5 min	307 sensor series; about 5.22M timestamp records	1 target; 15 numeric exogenous variables; text exogenous context	Lookback: 24 hours / 7 days / 28 days; Prediction: 1 hour / 24 hours / 7 days
Traffic	PEMS07	Traffic flow	Road sensor	5 min	883 sensor series; about 24.92M timestamp records	1 target; 13 numeric exogenous variables; text exogenous context	Lookback: 24 hours / 7 days / 28 days; Prediction: 1 hour / 24 hours / 7 days
Traffic	NYC TLC	Hourly pickup trip count	Taxi type-pickup zone pair	1 hour	513 taxi-zone series; about 3.36M hourly records	1 target; 19 numeric exogenous variables; text exogenous context	Lookback: 7 / 28 / 90 days; Prediction: 24 hours / 7 days / 14 days
Economics	FRED-MD	Transformed macroeconomic value	Macroeconomic variable	1 month	126 macroeconomic series; about 100.9K monthly records	1 target; 8 numeric exogenous variables; text exogenous context	Lookback: 36 / 60 / 120 months; Prediction: 1 / 12 / 24 months
Climate	Jena Climate	Air temperature	Weather station	10 min	1 temperature series; about 420.2K timestamp records	1 target; 5 numeric exogenous variables; text exogenous context	Lookback: 24 hours / 7 days / 30 days; Prediction: 6 hours / 24 hours / 7 days
Hydrology	Basin Streamflow	Daily streamflow	River basin	1 day	27 basin series; about 345K daily records	1 target; 16 numeric exogenous variables; text exogenous context	Lookback: 1 / 3 / 10 years; Prediction: 7 days / 30 days / 1 year
AIOps	Alibaba Cluster	CPU utilization	Machine	1 hour	100 machine series; about 19K hourly records	1 target; 6 numeric exogenous variables; text exogenous context	Lookback: 24 hours / 7 days / 30 days; Prediction: 6 hours / 24 hours / 7 days
Air Quality	AQ Data	PM2.5 concentration	Monitoring station	1 hour	3,720 station series; about 163.03M hourly records	1 target; 14 numeric exogenous variables; text exogenous context	Lookback: 24 hours / 7 days / 30 days; Prediction: 24 hours / 3 days / 7 days

Dataset Cards

Each dataset card summarizes the forecasting task, forecasting unit, contextual variables, business meaning, recommended windows, and current release notes. The cards are intended to make the benchmark understandable before users inspect the raw files or task YAML definitions.

SDWPF

Domain: energy and wind power forecasting.
Task: forecast wind turbine active power from historical SCADA signals, weather-related variables, turbine position, and timestamp context.
Forecasting unit: one wind turbine per series; 134 turbine series at 10-minute frequency.
Context variables: wind speed, wind direction, temperature, pressure, humidity, turbine operating signals, turbine coordinates, elevation, and calendar fields.
Windows: 24 hours to 8 hours, 7 days to 24 hours, and 30 days to 7 days.
Current note: useful for evaluating context-aware renewable power forecasting under turbine-level heterogeneity.

AEMO NEM DispatchIS

Domain: electricity price forecasting.
Task: forecast regional reference price (RRP) for Australian NEM regions.
Forecasting unit: one electricity market region per series; 5 region series at 5-minute frequency.
Context variables: total demand, available generation, net interchange, settlement interval, and calendar fields.
Windows: 24 hours to 1 hour, 7 days to 24 hours, and 28 days to 7 days.
Current note: demand, generation, and interchange variables are treated as historical market context aligned with each region.

OPSD German Load

Domain: electricity load forecasting.
Task: forecast German actual electricity load with renewable generation and day-ahead price context.
Forecasting unit: one country-level load series at hourly frequency.
Context variables: wind generation, solar generation, and day-ahead electricity price.
Windows: 7 days to 24 hours, 30 days to 7 days, and 90 days to 30 days.
Current note: useful as a compact power-system benchmark linking load, renewable output, and market price.

M5

Domain: retail sales forecasting.
Task: forecast daily unit sales for item-store pairs.
Forecasting unit: one item-store pair per series; 30,490 daily sales series.
Context variables: selling price, calendar fields, SNAP indicator, and event-day indicator with text context for product-store-time semantics.
Windows: 8 weeks to 28 days, 24 weeks to 12 weeks, and 1 year to 24 weeks.
Current note: large-scale daily retail benchmark for evaluating promotion, price, calendar, and item-store context.

Rossmann Store Sales

Domain: retail store sales forecasting.
Task: forecast daily store sales for Rossmann stores.
Forecasting unit: one store per series; 1,115 daily store series.
Context variables: customers, promotion, opening status, school holiday, state holiday, competition distance, and store promotion metadata.
Windows: 8 weeks to 28 days, 24 weeks to 12 weeks, and 1 year to 24 weeks.
Current note: useful for store-level demand forecasting with strong calendar, promotion, and competition effects.

Favorita Grocery Sales

Domain: grocery sales forecasting.
Task: forecast daily unit sales for store-item pairs from one selected store subset.
Forecasting unit: one store-item pair per series; 4,081 daily series.
Context variables: promotion, transactions, oil price, calendar fields, store cluster, item class, item perishability, and holiday indicators.
Windows: 8 weeks to 28 days, 24 weeks to 12 weeks, and 1 year to 24 weeks.
Current note: released as a compact one-store subset to keep the benchmark manageable while preserving item-level sales diversity.

PhysioNet 2012

Domain: medical and ICU time-series forecasting.
Task: forecast ICU clinical variable values from recent patient measurements and patient-level context.
Forecasting unit: one ICU stay-variable pair per series; 107,188 hourly clinical series.
Context variables: vital signs, laboratory-style variables, hour features, age, gender, height, ICU type, SAPS-I, and SOFA.
Windows: 6 hours to 6 hours, 12 hours to 12 hours, and 24 hours to 24 hours.
Current note: useful for testing forecasting under clinical heterogeneity, sparse observations, and patient-state context.

PEMS04

Domain: road traffic forecasting.
Task: forecast 5-minute traffic flow for road sensors.
Forecasting unit: one road sensor per series; 307 sensor series.
Context variables: occupancy, speed, road-network graph features, sensor index, and synthetic 5-minute calendar fields.
Windows: 24 hours to 1 hour, 7 days to 24 hours, and 28 days to 7 days.
Current note: the released source file does not include original wall-clock timestamps, so a regular synthetic 5-minute grid preserves temporal order.

PEMS07

Domain: road traffic forecasting.
Task: forecast 5-minute traffic flow for a larger road-sensor network.
Forecasting unit: one road sensor per series; 883 sensor series.
Context variables: road-network graph features, sensor index, and synthetic 5-minute calendar fields.
Windows: 24 hours to 1 hour, 7 days to 24 hours, and 28 days to 7 days.
Current note: complements PEMS04 with a larger sensor graph and the same order-preserving synthetic timestamp design.

NYC TLC

Domain: urban mobility forecasting.
Task: forecast hourly taxi pickup demand.
Forecasting unit: one taxi type and pickup-zone pair per series; 513 hourly series.
Context variables: passenger count, trip distance, fare amount, total amount, tip amount, drop-off diversity, location ID, and calendar cycles.
Windows: 7 days to 24 hours, 28 days to 7 days, and 90 days to 14 days.
Current note: useful for city-level demand forecasting with spatial zone and mobility-context signals.

FRED-MD

Domain: macroeconomic forecasting.
Task: forecast transformed monthly macroeconomic indicator values.
Forecasting unit: one macroeconomic variable per series; 126 monthly series.
Context variables: raw value, transformation code, year, month, quarter, month index, and cyclic month features.
Windows: 3 years to 1 month, 5 years to 12 months, and 10 years to 24 months.
Current note: supports long-horizon macroeconomic forecasting where variable meaning and transformation metadata matter.

Jena Climate

Domain: climate and weather forecasting.
Task: forecast air temperature at the Jena weather station.
Forecasting unit: one station-level temperature series at 10-minute frequency.
Context variables: pressure, relative humidity, wind velocity, maximum wind velocity, and wind direction.
Windows: 24 hours to 6 hours, 7 days to 24 hours, and 30 days to 7 days.
Current note: compact single-station weather benchmark with dense high-frequency observations.

Basin Streamflow

Domain: hydrology and streamflow forecasting.
Task: forecast daily streamflow for river basins.
Forecasting unit: one river basin per series; 27 daily basin series.
Context variables: precipitation, radiation, snow water equivalent, maximum and minimum temperature, vapor pressure, basin latitude, basin elevation, basin area, and water-year calendar features.
Windows: 1 year to 7 days, 3 years to 30 days, and 10 years to 1 year.
Current note: useful for scientific forecasting where physical basin attributes and hydro-meteorological forcing are central.

Alibaba Cluster

Domain: AIOps and cloud-resource forecasting.
Task: forecast machine CPU utilization.
Forecasting unit: one machine per series; first 100 machine series at hourly frequency.
Context variables: memory utilization, disk IO utilization, network IO utilization, hour, weekday, and weekend indicator.
Windows: 24 hours to 6 hours, 7 days to 24 hours, and 30 days to 7 days.
Current note: compact subset designed for standard CPU-utilization forecasting without processing the full cluster trace.

AQ Data

Domain: air quality forecasting.
Task: forecast hourly PM2.5 concentration for monitoring stations.
Forecasting unit: one monitoring station per series; 3,720 hourly station series.
Context variables: PM10, NO2, O3, SO2, CO, latitude, longitude, coarse region code, station index, and calendar fields.
Windows: 24 hours to 24 hours, 7 days to 3 days, and 30 days to 7 days.
Current note: the provided source files do not contain temperature, humidity, wind speed, wind direction, or pressure, so the first release uses available co-pollutants, station coordinates, regional context, and calendar variables.

Data Quality Analysis

A standalone data quality report is provided in docs/data_quality_report.md. It summarizes the current format-level and sample-level quality checks, including target/numeric/text file-count consistency, sampled timestamp + series_id alignment, sampled missingness, known source-data limitations, and recommended next quality-audit steps.

The current report confirms that all 15 processed datasets follow the unified lightweight layout and pass sampled alignment checks. It also documents important caveats such as sparse AQ and clinical observations, synthetic timestamps in PEMS, relative trace timestamps in Alibaba Cluster, and missing source weather variables in AQ Data.

Data Organization

Each dataset is organized into a unified lightweight CSV-based structure.

datasets/<domain>/<dataset_id>/
  processed/
    target/
      <series_id>.csv
    numeric_exogenous/
      <series_id>.csv
    text_exogenous/
      <series_id>.csv
  tasks/
    <dataset_task>.yaml

The key components are:

target/: one target-variable CSV for each forecasting series, with timestamp, series_id, and the target column;
numeric_exogenous/: structured exogenous variables aligned with the target file, such as calendar fields, prices, promotions, weather variables, graph features, clinical covariates, hydrologic forcing, and machine-resource signals;
text_exogenous/: timestamp-aligned natural-language context for the entity, time point, domain, and forecasting task;
tasks/: YAML task definitions describing the target variable, numeric and text exogenous variables, alignment rule, chronological split policy, and short-, medium-, and long-horizon forecasting windows.

Variable Types

FutureCast-Bench uses three public-facing variable categories.

Variable Type	Meaning	Examples
Target variable	The value to be forecasted	wind power, electricity price, electricity load, sales, traffic flow, clinical value, macroeconomic indicator, streamflow, CPU utilization, PM2.5
Numeric exogenous variables	Structured variables outside the target sequence	calendar features, price, promotion, weather, hydrologic forcing, sensor graph features, patient attributes, machine-resource variables, station coordinates, co-pollutants
Text exogenous variables	Natural-language context associated with entities or timestamps	holiday descriptions, region descriptions, station descriptions, basin descriptions, machine context, product-store descriptions, variable descriptions, air-quality station context

Task System

FutureCast-Bench supports a multi-layer task system from basic forecasting to context-aware reasoning.

Task Type	Goal
Context-aware forecasting	Forecast future values using historical series and contextual information
Context selection	Identify which contextual variables are relevant to the current forecasting task
Context-sequence alignment	Align events or textual context with the corresponding time intervals and target series
Trend reasoning	Infer future trend direction from contextual evidence
Event impact analysis	Estimate how external events affect future values
Counterfactual forecasting	Forecast under changed contextual assumptions
Context gap detection	Identify missing information needed for a better forecast
Forecast revision	Update forecasts when new evidence becomes available
Explanation generation	Produce evidence-grounded forecasting explanations

Evaluation Dimensions

FutureCast-Bench evaluates models along multiple dimensions.

Dimension	Example Metrics
Numerical forecasting accuracy	MAE, RMSE, WAPE, MASE, CRPS
Trend judgment	direction accuracy, trend accuracy, turning-point F1
Context understanding	context relevance accuracy, evidence selection F1
Reasoning quality	evidence-grounded score, reasoning faithfulness, counterfactual consistency
Dynamic adaptation	context gap detection, forecast revision gain

Forecast Revision Gain

One distinctive metric in FutureCast-Bench is Forecast Revision Gain, which measures whether a model can improve its prediction after receiving new contextual evidence.

Forecast Revision Gain = initial forecast error - revised forecast error

A positive value indicates that the model successfully used new contextual information to revise its forecast.

Why This Benchmark Matters

FutureCast-Bench is designed for a forecasting setting where models need to understand not only what happened before, but also why the future may change.

It aims to support research on:

context-aware time series foundation models;
LLM-driven time series forecasting;
multimodal forecasting with numerical and textual context;
reasoning-enhanced forecasting models;
agentic forecasting systems that actively seek missing context;
robust forecasting under events, distribution shifts, and hard-test scenarios.

Roadmap

Planned next steps include:

expanding event-rich contextual annotations for electricity markets, transportation, retail, hydrology, air quality, AIOps, and medical forecasting;
adding more financial market and industrial operation datasets;
releasing model baselines and leaderboard protocols;
providing scripts for reproducible preprocessing and evaluation.

Repository Status

This repository is under active development. The current version includes benchmark documentation, dataset cards, a data quality report, lightweight file-processing utilities, and a toy sample dataset that demonstrates the standard FutureCast processed layout. Full data release links, preprocessing script packaging, baseline models, and evaluation instructions will be updated as the benchmark is finalized.

Contact

FutureCast（天星台） is developed by the AGI Research Group at the State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China.

For questions, suggestions, or collaboration, please open an issue in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
examples		examples
scripts		scripts
src/futurecast_bench		src/futurecast_bench
tests		tests
.gitignore		.gitignore
README.md		README.md
logo.png		logo.png
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

FutureCast-Bench

Quickstart

Why FutureCast-Bench?

Core Capabilities

1. Context-Sequence Alignment and Fusion

2. Contextual Reasoning for Forecasting

3. Dynamic Context Adaptation

Current Benchmark Status

Dataset Cards

SDWPF

AEMO NEM DispatchIS

OPSD German Load

M5

Rossmann Store Sales

Favorita Grocery Sales

PhysioNet 2012

PEMS04

PEMS07

NYC TLC

FRED-MD

Jena Climate

Basin Streamflow

Alibaba Cluster

AQ Data

Data Quality Analysis

Data Organization

Variable Types

Task System

Evaluation Dimensions

Forecast Revision Gain

Why This Benchmark Matters

Roadmap

Repository Status

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages