Context is Key: A Benchmark for Forecasting with Essential Textual Information

📄 Paper (ICML 2025) - 🌐 Website - ✉️ Contact - 🌟 Contributors - 📝 Citation - 📦 ICML 2025 Release - 📊 Official Results - 🤗 Hugging Face Dataset

Overview of code

Baselines: All baseline code can be found here.
Tasks: All baseline code can be found here.
Metrics: All metric-related code can be found here.
Experiments: Code used to run the experiments can be found here.

Official results

Here are the updated aggregated benchmark results (equivalent to Table 1 in the paper). The full experimental results can also be found here.

This table holds for the version of the benchmark as of July 11th, 2025. See the CHANGELOG for how the benchmark was updated since the ICML 2025 release.

	Average RCPRS	Intemporal Information	Historical Information	Future Information	Covariate Information	Causal Information
Direct Prompt - Llama-3.1-405B-Instruct	0.143 ± 0.006	0.174 ± 0.010	0.146 ± 0.001	0.075 ± 0.005	0.144 ± 0.008	0.303 ± 0.037
Direct Prompt - Llama-3-70B-Instruct	0.286 ± 0.004	0.336 ± 0.006	0.180 ± 0.003	0.193 ± 0.006	0.228 ± 0.004	0.629 ± 0.019
Direct Prompt - Mixtral-8x7B-Instruct	0.519 ± 0.009	0.723 ± 0.014	0.236 ± 0.002	0.241 ± 0.001	0.353 ± 0.004	0.850 ± 0.012
Direct Prompt - Qwen-2.5-7B-Instruct	0.290 ± 0.003	0.290 ± 0.004	0.176 ± 0.003	0.287 ± 0.007	0.240 ± 0.002	0.524 ± 0.003
Direct Prompt - Qwen-2.5-0.5B-Instruct	0.463 ± 0.018	0.609 ± 0.028	0.165 ± 0.004	0.218 ± 0.012	0.476 ± 0.022	0.428 ± 0.006
Direct Prompt - GPT-4o	0.191 ± 0.004	0.218 ± 0.007	0.118 ± 0.001	0.121 ± 0.001	0.143 ± 0.003	0.363 ± 0.011
Direct Prompt - GPT-4o-mini	0.354 ± 0.004	0.475 ± 0.007	0.139 ± 0.002	0.143 ± 0.002	0.341 ± 0.004	0.645 ± 0.011
LLMP - Llama3-70B-Instruct	0.540 ± 0.013	0.438 ± 0.018	0.516 ± 0.029	0.847 ± 0.024	0.547 ± 0.016	0.396 ± 0.028
LLMP - Llama3-70B	0.237 ± 0.006	0.212 ± 0.005	0.121 ± 0.008	0.299 ± 0.017	0.194 ± 0.004	0.365 ± 0.011
LLMP - Mixtral-8x7B-Instruct	0.265 ± 0.004	0.242 ± 0.007	0.173 ± 0.004	0.324 ± 0.005	0.219 ± 0.005	0.440 ± 0.007
LLMP - Mixtral-8x7B	0.264 ± 0.008	0.250 ± 0.008	0.119 ± 0.003	0.310 ± 0.019	0.231 ± 0.006	0.466 ± 0.011
LLMP - Qwen-2.5-7B-Instruct	1.974 ± 0.019	2.509 ± 0.030	2.857 ± 0.056	1.653 ± 0.008	1.702 ± 0.023	1.333 ± 0.080
LLMP - Qwen-2.5-7B	0.910 ± 0.034	1.149 ± 0.043	1.002 ± 0.053	0.601 ± 0.071	0.640 ± 0.044	0.929 ± 0.016
LLMP - Qwen-2.5-0.5B-Instruct	1.937 ± 0.011	2.444 ± 0.017	1.960 ± 0.063	1.443 ± 0.010	1.805 ± 0.012	1.201 ± 0.017
LLMP - Qwen-2.5-0.5B	1.995 ± 0.011	2.546 ± 0.018	2.082 ± 0.052	1.579 ± 0.015	1.821 ± 0.013	1.225 ± 0.013
UniTime	0.370 ± 0.001	0.457 ± 0.002	0.155 ± 0.000	0.194 ± 0.003	0.395 ± 0.001	0.423 ± 0.001
Time-LLM (ETTh1)	0.475 ± 0.001	0.517 ± 0.002	0.183 ± 0.000	0.403 ± 0.002	0.441 ± 0.001	0.481 ± 0.002
Lag-Llama	0.327 ± 0.004	0.330 ± 0.005	0.167 ± 0.005	0.293 ± 0.008	0.294 ± 0.004	0.494 ± 0.014
Chronos-Large	0.326 ± 0.001	0.314 ± 0.002	0.179 ± 0.003	0.379 ± 0.003	0.255 ± 0.002	0.460 ± 0.004
TimeGEN	0.353 ± 0.000	0.332 ± 0.000	0.177 ± 0.000	0.405 ± 0.000	0.292 ± 0.000	0.474 ± 0.000
Moirai-Large	0.520 ± 0.006	0.596 ± 0.009	0.140 ± 0.001	0.431 ± 0.002	0.499 ± 0.007	0.438 ± 0.011
ARIMA	0.475 ± 0.006	0.557 ± 0.009	0.200 ± 0.007	0.350 ± 0.003	0.375 ± 0.006	0.440 ± 0.011
ETS	0.530 ± 0.009	0.639 ± 0.014	0.362 ± 0.014	0.315 ± 0.006	0.402 ± 0.010	0.507 ± 0.017
Exponential Smoothing	0.605 ± 0.013	0.703 ± 0.020	0.493 ± 0.016	0.397 ± 0.006	0.480 ± 0.015	0.828 ± 0.060

Setting environment variables

Here is a list of all environment variables which the Context-is-Key benchmark will access:

Variable Name	Description	Default Value
CIK_MODEL_STORE	Folder to store model weights for the baselines.	`./models`
CIK_DATA_STORE	Folder to store downloaded datasets.	`./data`
CIK_DOMINICK_STORE	Folder to store the Dominick dataset for specific tasks.	`CIK_DATA_STORE + /dominicks`
CIK_TRAFFIC_DATA_STORE	Folder to store the Traffic dataset for specific tasks.	`CIK_DATA_STORE + /traffic_data`
HF_HOME	Cache location for downloading datasets from Hugging Face.	`CIK_DATA_STORE + /hf_cache`
CIK_RESULT_CACHE	Folder to store the output of baselines to avoid recomputation.	`./inference_cache`
CIK_METRIC_SCALING_CACHE	Folder to store scaling factors for each task to avoid recomputation.	`./metric_scaling_cache`
CIK_METRIC_COMPUTE_VARIANCE	If set, computes an estimate of the variance of the metric.	Only compute metric itself by default
CIK_OPENAI_USE_AZURE	If set to "True", use Azure client instead of OpenAI client for baselines using OpenAI models.	`False`
CIK_OPENAI_API_KEY	API key for accessing OpenAI models.	None (Required for baseline)
CIK_OPENAI_API_VERSION	API version for OpenAI models when using the Azure client.	None
CIK_OPENAI_AZURE_ENDPOINT	Azure endpoint for calling OpenAI models.	None
CIK_LLAMA31_405B_URL	API URL for the Llama-3.1-405b baseline.	None (Required for baseline)
CIK_LLAMA31_405B_API_KEY	API key for the Llama-3.1-405b API.	None (Required for baseline)
CIK_NIXTLA_BASE_URL	Azure API URL for the Nixtla TimeGEN baseline.	None (Required for baseline)
CIK_NIXTLA_API_KEY	Azure API key for the Nixtla TimeGEN baseline.	None (Required for baseline)

🌟 Contributors

Citing this work

Please cite the following paper:

@inproceedings{
williams2025context,
title={Context is Key: A Benchmark for Forecasting with Essential Textual Information},
author={Andrew Robert Williams and Arjun Ashok and {\'E}tienne Marcotte and Valentina Zantedeschi and Jithendaraa Subramanian and Roland Riachi and James Requeima and Alexandre Lacoste and Irina Rish and Nicolas Chapados and Alexandre Drouin},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=ih2WuBT1Fn}
}

Name		Name	Last commit message	Last commit date
Latest commit History 842 Commits
cik_benchmark		cik_benchmark
experiments		experiments
notebooks		notebooks
results		results
scripts		scripts
tests		tests
CHANGELOG		CHANGELOG
LICENSE.md		LICENSE.md
README.md		README.md
compile_roi_results.py		compile_roi_results.py
precompute_scaling_cache.py		precompute_scaling_cache.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-r.txt		requirements-r.txt
requirements.txt		requirements.txt
run_baselines.py		run_baselines.py
simple_task_example.py		simple_task_example.py
task_report.py		task_report.py
test_plot.py		test_plot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Context is Key: A Benchmark for Forecasting with Essential Textual Information

Overview of code

Official results

Setting environment variables

🌟 Contributors

Citing this work

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

ServiceNow/context-is-key-forecasting

Folders and files

Latest commit

History

Repository files navigation

Context is Key: A Benchmark for Forecasting with Essential Textual Information

Overview of code

Official results

Setting environment variables

🌟 Contributors

Citing this work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages