The HIPE-OCRepair-scorer is a Python module for evaluating OCR post-correction.
It is developed and used in the context of the HIPE-OCRepair-2026 ICDAR Competition on OCR post-correction for historical documents, which is part of the broader HIPEval (Historical Information Processing Evaluation) initiative, a series of shared tasks on historical document processing.
- HIPE-OCRepair: Website of the competition hosted at ICDAR-2026
- HIPE-OCRepair-2026-data: public data releases (training, validation and test sets) for the HIPE-OCRepair-2026 shared task.
- HIPE-OCRepair-2026-eval: for the Hugging Face leaderboard
- 2x Feb 2026: v0.9, initial release of the OCR post-correction scorer
Main functionalities | Input format, scorer entry points, and naming conventions | Installation and usage | About
The scorer evaluates OCR post-correction outputs against ground-truth transcriptions. It computes match error rates at character and word level (cMER/wMER) as well as preference metrics that compare the post-correction output to the raw OCR hypothesis.
All metrics are based on Match Error Rate (MER), computed as:
where H = hits, S = substitutions, D = deletions, I = insertions. Unlike standard CER/WER, MER is capped in [0, 1] because insertions are included in the denominator. This reduces sensitivity to extreme hallucinations while remaining easy to interpret. MER is equivalent to the normalized CER in the sense of the OCR-D evaluation spec (see https://ocr-d.de/en/spec/ocrd_eval.html#character-error-rate-cer).
Primary metrics
- cMER (character-level MER, micro-averaged): corpus-level character match error rate, the main evaluation metric. Micro-averaged so longer documents contribute more than shorter ones.
- Preference score (macro average): a simple sign-based metric computed per input document and then averaged unweighted across documents. For each item i: $s_i = \text{sign}(\text{cMER}{\text{in},i} - \text{cMER}{\text{out},i})$, yielding 1 (improved), 0 (tied), or -1 (worse). This captures how consistently a system improves over the input, while cMER captures the magnitude of improvement.
Additional metrics
- wMER (word-level MER): reported for completeness, but cMER is preferred in historical OCR due to spelling variation and transcription conventions.
- Confidence intervals: computed for all measures to ensure statistical robustness.
Before scoring, text is normalized as follows:
- Case-folded to lowercase
- Unicode letters and digits are kept (including accented characters such as é, ç, ü)
- All other characters (punctuation, symbols) are replaced with space
- Whitespace is collapsed
This means evaluation is case-insensitive and punctuation-insensitive, but sensitive to accented characters (é ≠ e).
Results can be stratified by dataset or any user-defined mapping.
The scorer accepts two entry points (the same example structure is used in both):
- A pair of JSONL files: one for reference, one for hypothesis.
- A pair of folders: containing reference and hypothesis JSONL files respectively.
Each JSONL record should contain a dictionary with these fields:
document_metadata:{ "document_id": "...", "primary_dataset_name": "..." }ground_truth:{ "transcription_unit": "..." }ocr_hypothesis:{ "transcription_unit": "..." }ocr_postcorrection_output:{ "transcription_unit": "..." }
All JSON documents conform to the HIPE-OCRepair JSON Schema (add link later).
Sample data for quick inspection is available under data/sample.
Reference files follow the HIPE-OCRepair canonical naming convention:
<file_basename>_<version>_<dataset>_<primary_version>_<split>_<language>.jsonl
Where:
file_basename: alwayshipe-ocrepair-benchversion: benchmark versiondataset: dataset name (e.g.,icdar2017)primary_version: primary dataset versionsplit: data split (e.g.,train,test)language: dataset language (e.g.,en,fr)
Submission files to be evaluated are named as:
teamname_<inputfile>_runX.jsonl
The scorer requires Python 3.12 and can be installed as a pip package or used as an editable dependency:
pip install hipe-ocrepair-scorerpython3 -m venv venv
source venv/bin/activate
pip install -e .After installation, the hipe-ocrepair-scorer command is available.
hipe-ocrepair-scorer \
--reference data/sample/reference/hipe-ocrepair-bench_v0.9_icdar2017_v1.2_train_fr.sample.jsonl \
--hypothesis data/sample/hypothesis/no_edits_baseline/no_edits_hipe-ocrepair-bench_v0.9_icdar2017_v1.2_train_fr.sample_run1.jsonlhipe-ocrepair-scorer \
--reference-dir data/sample/reference/ \
--hypothesis-dir data/sample/hypothesis/no_edits_baseline/In folder mode, the scorer matches each reference file to its corresponding hypothesis file by filename. Hypothesis files are expected to contain the reference filename stem (see naming conventions above).
Results are printed to stdout as JSON.
File mode returns scores for the single file pair:
{
"averaged_scores": {
"metric_name": [score, lower_ci, upper_ci],
...
},
"fold_scores": {
"dataset_name": {
"metric_name": [score, lower_ci, upper_ci],
...
}
}
}Folder mode returns per-file results for each reference/hypothesis pair:
{
"per_file": {
"reference_filename_1": {
"averaged_scores": { ... },
"fold_scores": { ... }
},
"reference_filename_2": {
"averaged_scores": { ... },
"fold_scores": { ... }
}
}
}Each metric is a tuple of (score, lower_95%_CI, upper_95%_CI). Metrics include
cmer_micro, wmer_micro, cmer_macro, wmer_macro, pref_score_cmer_macro,
pref_score_wmer_macro, pcis_cmer_macro, and pcis_wmer_macro.
See the LICENSE file in the repository for details.
The HIPE-2026 organising team expresses its sincere appreciation to the ICDAR 2026 Conference and Competition Committee for hosting the task. HIPE-eval editions are organised within the framework of the Impresso - Media Monitoring of the Past project, funded by the Swiss National Science Foundation under grant No. CRSII5_213585 and by the Luxembourg National Research Fund under grant No. 17498891.