Skip to content
Open
Show file tree
Hide file tree
Changes from 74 commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
6c89888
Refactor CLI for consistent logging and late imports
kba Oct 17, 2025
2a1f892
expand keywords and supported Python versions
cneud Oct 17, 2025
20a9536
remove redundant parentheses
cneud Oct 17, 2025
9733d57
replace list declaration with list literal (faster)
cneud Oct 17, 2025
f212ffa
remove unnecessary backslash
cneud Oct 17, 2025
496a0e2
readme and documentation updates
cneud Oct 17, 2025
9d2dbb8
updating model based reading orde detection
vahidrezanezhad Oct 20, 2025
3ec5ceb
Update flowchart
vahidrezanezhad Oct 20, 2025
c845537
updating heuristics and ocr documentation
vahidrezanezhad Oct 20, 2025
a850ef3
factor model loading in Eynollah to EynollahModelZoo
kba Oct 20, 2025
b90cfdf
adapt tests to -l being top-level option now
kba Oct 20, 2025
48d1198
move Eynollah_ocr to separate module
kba Oct 20, 2025
d609a53
organize imports mostly
kba Oct 20, 2025
062f317
Introduce model_zoo to Eynollah_ocr
kba Oct 20, 2025
6e3399f
combine Docker docs
cneud Oct 20, 2025
e5254dc
integrate training docs
cneud Oct 20, 2025
230e7cc
integrate ocrd docs
cneud Oct 20, 2025
7d70835
small fixes to main readme
cneud Oct 20, 2025
44b75eb
cli: model -> model_basedir
kba Oct 21, 2025
c6b863b
typing and asserts
kba Oct 21, 2025
a53d5fc
update docs/makefile to point to v0.6.0 models
kba Oct 21, 2025
9d2b18d
test_run: check log messages starting with eynollah
kba Oct 21, 2025
de34a15
Makefile: fix make models for OCR
kba Oct 21, 2025
bcffa2e
adopt binarizer to the zoo
kba Oct 21, 2025
f0c8667
adopt mb_ro_on_layout to the zoo
kba Oct 21, 2025
1337461
adopt image_enhancer to the zoo
kba Oct 21, 2025
4c8abfe
eynollah_ocr: actually replace the model calls
kba Oct 22, 2025
146658f
eynollah layout: fix trocr_processor model_zoo call
kba Oct 22, 2025
d94285b
rewrite model spec data structure
kba Oct 22, 2025
04bc4a6
reorganize model_zoo
kba Oct 22, 2025
883546a
eynollah models package
kba Oct 22, 2025
874cfc2
.
kba Oct 22, 2025
2fc723d
extend README
vahidrezanezhad Oct 22, 2025
ab9ddd5
OCR examples are added to README
vahidrezanezhad Oct 22, 2025
59eb4fd
images with ro are added to readme
vahidrezanezhad Oct 22, 2025
b56bb44
providing ocr model evaluation metrics
vahidrezanezhad Oct 22, 2025
7b7714a
completing ocr evaluations metric
vahidrezanezhad Oct 22, 2025
d0ad7a9
starting qualitative ocr evaluation
vahidrezanezhad Oct 22, 2025
ec1fd93
wip
kba Oct 23, 2025
6192e5b
qualitative evaluation of ocr models are added to docs
vahidrezanezhad Oct 23, 2025
51d2680
wip
kba Oct 27, 2025
294b635
wip
kba Oct 27, 2025
ef999c8
Merge branch 'model-zoo' of lx0145.sbb.spk-berlin.de:/data/eynollah i…
kba Oct 27, 2025
8822da1
Merge remote-tracking branch 'origin/updating_docs' into docs_and_min…
cneud Oct 28, 2025
22d61e8
remove newspaper images from main readme
cneud Oct 28, 2025
b6f82c7
refactor cli tests
kba Oct 29, 2025
a913bdf
make --model-basedir and --model-overrides top-level CLI options
kba Oct 29, 2025
5e22e9d
model_zoo: make type str to reduce importing overhead
kba Oct 29, 2025
de76eab
Merge branch 'cli-logging' into model-zoo
kba Oct 29, 2025
29c2736
fix merge issues
kba Oct 29, 2025
4772fd1
missed changing override mechanism in eynollah_ocr
kba Oct 29, 2025
9ab565f
model basedir might be a symlink
kba Oct 29, 2025
600ebfe
make: fix to use single-archive ZIP
kba Oct 29, 2025
15e6ecb
make models: update URL
kba Oct 29, 2025
46a45f6
Create examples.md
cneud Oct 29, 2025
f6c0f56
Update README.md
cneud Oct 29, 2025
b1e191b
reformat cli options table
cneud Oct 29, 2025
62d0591
test_layout: str(Path)
kba Oct 30, 2025
8782ef1
CI: :fire: upgrade torch for debugging
kba Oct 30, 2025
c9efbe1
refactor image layout in examples.md
cneud Oct 30, 2025
70d8577
Revert "remove redundant parentheses"
cneud Oct 30, 2025
2d35a05
Revert "replace list declaration with list literal (faster)"
cneud Oct 30, 2025
9dbac28
Revert "remove unnecessary backslash"
cneud Oct 30, 2025
d5b7089
Merge branch 'docs_and_minor_fixes' of https://github.com/qurator-spk…
cneud Oct 30, 2025
f90259d
fix docs links
cneud Oct 30, 2025
b6c7283
further debugging
kba Nov 5, 2025
2c21109
make deps-test should not depend on the models
kba Nov 5, 2025
0bef6e2
make models: unzip to the versioned directory
kba Nov 5, 2025
e449dba
make *test: fix paths
kba Nov 5, 2025
53e879e
make *test: another typo;
kba Nov 5, 2025
0d84e7d
Merge remote-tracking branch 'origin/docs_and_minor_fixes' into model…
kba Nov 6, 2025
d224b0f
try with shapely.set_precision(...mode="keep_collpased")
kba Nov 6, 2025
44037bc
add layout marginalia test
kba Nov 6, 2025
f902756
try importing torch, then shapely, then tensorflow
kba Nov 6, 2025
8732007
.
kba Nov 6, 2025
ed5b5c1
Add test images; call TrOCR processor from the same directory as the …
vahidrezanezhad Nov 7, 2025
3afbce0
tests: adapt paths
kba Nov 13, 2025
a72be69
tests: fix model download URL
kba Nov 13, 2025
9aeff6d
tests: typo
kba Nov 13, 2025
b34329d
tests: more path fixes
kba Nov 13, 2025
b9bc8e7
github ci: cache models with model_zoo default config as key
kba Nov 13, 2025
d665490
.
kba Nov 13, 2025
67003b8
.
kba Nov 13, 2025
0149147
.
kba Nov 25, 2025
103c007
.
kba Nov 26, 2025
9d9d32d
update OCR-D bindings
kba Nov 26, 2025
0f410c2
disable tf/keras logging on first import
kba Nov 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 38 additions & 33 deletions .github/workflows/test-eynollah.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,61 +24,63 @@ jobs:
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
df -h
- uses: actions/checkout@v4
- uses: actions/cache/restore@v4
id: seg_model_cache
with:
path: models_layout_v0_5_0
key: seg-models
- uses: actions/cache/restore@v4
id: ocr_model_cache

- name: Lint with ruff
uses: astral-sh/ruff-action@v3
with:
path: models_ocr_v0_5_1
key: ocr-models
- uses: actions/cache/restore@v4
id: bin_model_cache
src: "./src"

- name: Try to restore models_eynollah
uses: actions/cache/restore@v4
id: all_model_cache
with:
path: default-2021-03-09
key: bin-models
path: models_eynollah
key: models_eynollah

- name: Download models
if: steps.seg_model_cache.outputs.cache-hit != 'true' || steps.bin_model_cache.outputs.cache-hit != 'true' || steps.ocr_model_cache.outputs.cache-hit != true
run: make models
- uses: actions/cache/save@v4
if: steps.seg_model_cache.outputs.cache-hit != 'true'
with:
path: models_layout_v0_5_0
key: seg-models
- uses: actions/cache/save@v4
if: steps.ocr_model_cache.outputs.cache-hit != 'true'
with:
path: models_ocr_v0_5_1
key: ocr-models
if: steps.all_model_cache.outputs.cache-hit != 'true'
run: |
make models
ls -la models_eynollah

- uses: actions/cache/save@v4
if: steps.bin_model_cache.outputs.cache-hit != 'true'
if: steps.all_model_cache.outputs.cache-hit != 'true'
with:
path: default-2021-03-09
key: bin-models
path: models_eynollah
key: models_eynollah

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

# - uses: actions/cache@v4
# with:
# path: |
# path/to/dependencies
# some/other/dependencies
# key: ${{ runner.os }}-${{ hashFiles('**/lockfiles') }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
make install-dev EXTRAS=OCR,plotting
make deps-test EXTRAS=OCR,plotting
ls -l models_*
- name: Lint with ruff
uses: astral-sh/ruff-action@v3
with:
src: "./src"

- name: Hard-upgrade torch for debugging
run: |
python -m pip install --upgrade torch

- name: Test with pytest
run: make coverage PYTEST_ARGS="-vv --junitxml=pytest.xml"

- name: Get coverage results
run: |
coverage report --format=markdown >> $GITHUB_STEP_SUMMARY
coverage html
coverage json
coverage xml

- name: Store coverage results
uses: actions/upload-artifact@v4
with:
Expand All @@ -88,12 +90,15 @@ jobs:
pytest.xml
coverage.xml
coverage.json

- name: Upload coverage results
uses: codecov/codecov-action@v4
with:
files: coverage.xml
fail_ci_if_error: false

- name: Test standalone CLI
run: make smoke-test

- name: Test OCR-D CLI
run: make ocrd-test
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ output.html
*.tif
*.sw?
TAGS
uv.lock
76 changes: 25 additions & 51 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,17 @@ EXTRAS ?=
DOCKER_BASE_IMAGE ?= docker.io/ocrd/core-cuda-tf2:latest
DOCKER_TAG ?= ocrd/eynollah
DOCKER ?= docker
WGET = wget -O

#SEG_MODEL := https://qurator-data.de/eynollah/2021-04-25/models_eynollah.tar.gz
#SEG_MODEL := https://qurator-data.de/eynollah/2022-04-05/models_eynollah_renamed.tar.gz
# SEG_MODEL := https://qurator-data.de/eynollah/2022-04-05/models_eynollah.tar.gz
#SEG_MODEL := https://github.com/qurator-spk/eynollah/releases/download/v0.3.0/models_eynollah.tar.gz
#SEG_MODEL := https://github.com/qurator-spk/eynollah/releases/download/v0.3.1/models_eynollah.tar.gz
SEG_MODEL := https://zenodo.org/records/17194824/files/models_layout_v0_5_0.tar.gz?download=1
SEG_MODELFILE = $(notdir $(patsubst %?download=1,%,$(SEG_MODEL)))
SEG_MODELNAME = $(SEG_MODELFILE:%.tar.gz=%)

BIN_MODEL := https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip
BIN_MODELFILE = $(notdir $(BIN_MODEL))
BIN_MODELNAME := default-2021-03-09

OCR_MODEL := https://zenodo.org/records/17236998/files/models_ocr_v0_5_1.tar.gz?download=1
OCR_MODELFILE = $(notdir $(patsubst %?download=1,%,$(OCR_MODEL)))
OCR_MODELNAME = $(OCR_MODELFILE:%.tar.gz=%)
#SEG_MODEL := https://zenodo.org/records/17194824/files/models_layout_v0_5_0.tar.gz?download=1
EYNOLLAH_MODELS_URL := https://zenodo.org/records/17417471/files/models_all_v0_7_0.zip
EYNOLLAH_MODELS_ZIP = $(notdir $(EYNOLLAH_MODELS_URL))
EYNOLLAH_MODELS_DIR = $(EYNOLLAH_MODELS_ZIP:%.zip=%)

PYTEST_ARGS ?= -vv --isolate

Expand All @@ -38,7 +32,7 @@ help:
@echo " install-dev Install editable with pip"
@echo " deps-test Install test dependencies with pip"
@echo " models Download and extract models to $(CURDIR):"
@echo " $(BIN_MODELNAME) $(SEG_MODELNAME) $(OCR_MODELNAME)"
@echo " $(EYNOLLAH_MODELS_DIR)"
@echo " smoke-test Run simple CLI check"
@echo " ocrd-test Run OCR-D CLI check"
@echo " test Run unit tests"
Expand All @@ -47,34 +41,22 @@ help:
@echo " EXTRAS comma-separated list of features (like 'OCR,plotting') for 'install' [$(EXTRAS)]"
@echo " DOCKER_TAG Docker image tag for 'docker' [$(DOCKER_TAG)]"
@echo " PYTEST_ARGS pytest args for 'test' (Set to '-s' to see log output during test execution, '-vv' to see individual tests. [$(PYTEST_ARGS)]"
@echo " SEG_MODEL URL of 'models' archive to download for segmentation 'test' [$(SEG_MODEL)]"
@echo " BIN_MODEL URL of 'models' archive to download for binarization 'test' [$(BIN_MODEL)]"
@echo " OCR_MODEL URL of 'models' archive to download for binarization 'test' [$(OCR_MODEL)]"
@echo " ALL_MODELS URL of archive of all models [$(ALL_MODELS)]"
@echo ""

# END-EVAL


# Download and extract models to $(PWD)/models_layout_v0_5_0
models: $(BIN_MODELNAME) $(SEG_MODELNAME) $(OCR_MODELNAME)
# Download and extract models to $(PWD)/models_layout_v0_6_0
models: $(EYNOLLAH_MODELS_DIR)

# do not download these files if we already have the directories
.INTERMEDIATE: $(BIN_MODELFILE) $(SEG_MODELFILE) $(OCR_MODELFILE)

$(BIN_MODELFILE):
wget -O $@ $(BIN_MODEL)
$(SEG_MODELFILE):
wget -O $@ $(SEG_MODEL)
$(OCR_MODELFILE):
wget -O $@ $(OCR_MODEL)

$(BIN_MODELNAME): $(BIN_MODELFILE)
mkdir $@
unzip -d $@ $<
$(SEG_MODELNAME): $(SEG_MODELFILE)
tar zxf $<
$(OCR_MODELNAME): $(OCR_MODELFILE)
tar zxf $<
.INTERMEDIATE: $(EYNOLLAH_MODELS_ZIP)

$(EYNOLLAH_MODELS_ZIP):
$(WGET) $@ $(EYNOLLAH_MODELS_URL)

$(EYNOLLAH_MODELS_DIR): $(EYNOLLAH_MODELS_ZIP)
unzip $<

build:
$(PIP) install build
Expand All @@ -88,34 +70,28 @@ install:
install-dev:
$(PIP) install -e .$(and $(EXTRAS),[$(EXTRAS)])

ifeq (OCR,$(findstring OCR, $(EXTRAS)))
deps-test: $(OCR_MODELNAME)
endif
deps-test: $(BIN_MODELNAME) $(SEG_MODELNAME)
deps-test:
$(PIP) install -r requirements-test.txt
ifeq (OCR,$(findstring OCR, $(EXTRAS)))
ln -rs $(OCR_MODELNAME)/* $(SEG_MODELNAME)/
endif

smoke-test: TMPDIR != mktemp -d
smoke-test: tests/resources/kant_aufklaerung_1784_0020.tif
# layout analysis:
eynollah layout -i $< -o $(TMPDIR) -m $(CURDIR)/$(SEG_MODELNAME)
eynollah layout -i $< -o $(TMPDIR) -m $(CURDIR)/models_eynollah
fgrep -q http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 $(TMPDIR)/$(basename $(<F)).xml
fgrep -c -e TextRegion -e ImageRegion -e SeparatorRegion $(TMPDIR)/$(basename $(<F)).xml
# layout, directory mode (skip one, add one):
eynollah layout -di $(<D) -o $(TMPDIR) -m $(CURDIR)/$(SEG_MODELNAME)
eynollah layout -di $(<D) -o $(TMPDIR) -m $(CURDIR)/models_eynollah
test -s $(TMPDIR)/euler_rechenkunst01_1738_0025.xml
# mbreorder, directory mode (overwrite):
eynollah machine-based-reading-order -di $(<D) -o $(TMPDIR) -m $(CURDIR)/$(SEG_MODELNAME)
fgrep -q http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 $(TMPDIR)/$(basename $(<F)).xml
fgrep -c -e RegionRefIndexed $(TMPDIR)/$(basename $(<F)).xml
# binarize:
eynollah binarization -m $(CURDIR)/$(BIN_MODELNAME) -i $< -o $(TMPDIR)/$(<F)
eynollah binarization -m $(CURDIR)/models_eynollah/eynollah-binarization_20210425 -i $< -o $(TMPDIR)/$(<F)
test -s $(TMPDIR)/$(<F)
@set -x; test "$$(identify -format '%w %h' $<)" = "$$(identify -format '%w %h' $(TMPDIR)/$(<F))"
# enhance:
eynollah enhancement -m $(CURDIR)/$(SEG_MODELNAME) -sos -i $< -o $(TMPDIR) -O
eynollah enhancement -m $(CURDIR)/models_eynollah -sos -i $< -o $(TMPDIR) -O
test -s $(TMPDIR)/$(<F)
@set -x; test "$$(identify -format '%w %h' $<)" = "$$(identify -format '%w %h' $(TMPDIR)/$(<F))"
$(RM) -r $(TMPDIR)
Expand All @@ -126,18 +102,16 @@ ocrd-test: tests/resources/kant_aufklaerung_1784_0020.tif
cp $< $(TMPDIR)
ocrd workspace -d $(TMPDIR) init
ocrd workspace -d $(TMPDIR) add -G OCR-D-IMG -g PHYS_0020 -i OCR-D-IMG_0020 $(<F)
ocrd-eynollah-segment -w $(TMPDIR) -I OCR-D-IMG -O OCR-D-SEG -P models $(CURDIR)/$(SEG_MODELNAME)
ocrd-eynollah-segment -w $(TMPDIR) -I OCR-D-IMG -O OCR-D-SEG -P models $(CURDIR)/models_eynollah
result=$$(ocrd workspace -d $(TMPDIR) find -G OCR-D-SEG); \
fgrep -q http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 $(TMPDIR)/$$result && \
fgrep -c -e TextRegion -e ImageRegion -e SeparatorRegion $(TMPDIR)/$$result
ocrd-sbb-binarize -w $(TMPDIR) -I OCR-D-IMG -O OCR-D-BIN -P model $(CURDIR)/$(BIN_MODELNAME)
ocrd-sbb-binarize -w $(TMPDIR) -I OCR-D-SEG -O OCR-D-SEG-BIN -P model $(CURDIR)/$(BIN_MODELNAME) -P operation_level region
ocrd-sbb-binarize -w $(TMPDIR) -I OCR-D-IMG -O OCR-D-BIN -P model $(CURDIR)/models_eynollah/eynollah-binarization_20210425
ocrd-sbb-binarize -w $(TMPDIR) -I OCR-D-SEG -O OCR-D-SEG-BIN -P model $(CURDIR)/models_eynollah/eynollah-binarization_20210425 -P operation_level region
$(RM) -r $(TMPDIR)

# Run unit tests
test: export MODELS_LAYOUT=$(CURDIR)/$(SEG_MODELNAME)
test: export MODELS_OCR=$(CURDIR)/$(OCR_MODELNAME)
test: export MODELS_BIN=$(CURDIR)/$(BIN_MODELNAME)
test: export EYNOLLAH_MODELS_DIR := $(CURDIR)
test:
$(PYTHON) -m pytest tests --durations=0 --continue-on-collection-errors $(PYTEST_ARGS)

Expand Down
Loading
Loading