Skip to content

hipe-eval/HIPE-OCRepair-2026-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HIPE-OCRepair-2026 Data Repository

HIPE-OCRepair-2026 is an ICDAR 2026 Competition focused on LLM-assisted OCR post-correction of historical documents, with a particular emphasis on historical newspapers.

With renewed interest driven by large language models (LLMs), OCR post-correction has (re)gained momentum, resulting in a growing number of models and experimental approaches. However, these efforts often rely on heterogeneous legacy datasets that come with important limitations, making systematic evaluation and meaningful comparison across approaches difficult.

A central question motivating this competition is:

To what extent can modern large language models address the OCR debt accumulated in large-scale digitized historical collections?

The competition addresses this by providing HIPE-OCRepair-Bench, a unified multilingual benchmark for OCR post-correction, comprising curated datasets, an evaluation protocol, baseline systems, and an open leaderboard.

📋 Participation Guidelines

All information about the task, datasets, evaluation protocol, and submission instructions is available in the Participation Guidelines.

🔗 Important Links

🌐 Competition website https://hipe-eval.github.io/HIPE-OCRepair-2026/
📋 Participation Guidelines README-Participation-Guidelines.md
📈 Scorer https://github.com/hipe-eval/HIPE-OCRepair-scorer
📊 Evaluation repository (after competition) https://github.com/hipe-eval/HIPE-OCRepair-2026-eval
🏆 Leaderboard (to come) https://huggingface.co/spaces/hipe-ocrepair-2026-eval
📝 Registration & contact see competition website

📦 Data

Data is available:

  • in the data/ folder of this repository and in the git releases
  • later: also on Zenodo

Release History

  • 20.03.2026: release of train and dev sets for dta19 dataset | release tag v0.9.2.
  • 11.03.2026: hot fix for impresso-snippets dataset | release tag v0.9.1.
  • 02.03.2026: first data release with overproof, icdar17, impresso-nzz and impresso-snippets | release tag v0.9.

Acknowledgements

The HIPE-OCRepair-2026 organising team expresses its sincere appreciation to the ICDAR-2026 Competition Committee for the overall coordination and support.

HIPE-OCRepair-2026 is part of the HIPE-eval series of shared tasks on historical document and information processing and evaluation.

HIPE-eval editions are organised within the framework of the Impresso – Media Monitoring of the Past project, funded by the Swiss National Science Foundation under grant No. CRSII5_213585 and by the Luxembourg National Research Fund under grant No. 17498891.