Skip to content

🐞 Bug Report: ocr_engine and OCR-related options ignored in /v1/convert/file API (RapidOCR always used) Summary #567

@shohart

Description

@shohart

Bug Report: OCR options ignored in /v1/convert/file

Summary

When using the /v1/convert/file endpoint with multipart/form-data, OCR-related parameters such as:

  • ocr_engine
  • ocr_lang
  • do_ocr
  • force_ocr

are ignored.

Regardless of the provided options, the pipeline consistently falls back to RapidOCR, even when:

  • Tesseract is installed and available
  • OCR options are explicitly specified in the request

However, the same parameters work correctly when using the /ui (Gradio interface).


Environment

  • docling-serve: latest (Docker image)
  • Deployment: Docker container

docker-compose.yml

docling:
  image: ghcr.io/docling-project/docling-serve-cpu:latest
  container_name: docling
  restart: unless-stopped
  ports:
    - "192.168.1.122:5001:5001"
  expose:
    - "5001"
  environment:
    - DOCLING_SERVE_ENABLE_UI=1
    - DOCLING_SERVE_LOAD_MODELS_AT_BOOT=false
    - TESSDATA_PREFIX=/opt/tessdata/
  volumes:
    - docling_cache:/root/.cache
    - docling_data:/app/data
    - ./docling-tessdata:/opt/tessdata
  • OCR engine installed: Tesseract 4.1.1
  • Languages available: eng, rus
  • API usage: curl with multipart/form-data

Steps to Reproduce

1. Ensure Tesseract is installed inside container

docker exec -it docling tesseract --list-langs

Output:

osd
eng
rus

2. Send request using /v1/convert/file

curl.exe -X POST "http://<host>:5001/v1/convert/file" ^
  -H "accept: application/json" ^
  -F "files=@test.pdf;type=application/pdf" ^
  --form-string "options={\"do_ocr\":true,\"ocr_engine\":\"tesseract\",\"ocr_lang\":\"rus,eng\",\"pdf_backend\":\"pypdfium2\",\"to_formats\":[\"md\"]}"

Expected Behavior

  • Tesseract should be used as OCR engine

Logs should contain:

tesseract_ocr_cli_model

Actual Behavior

  • RapidOCR is always used instead

Logs:

[RapidOCR] Using engine_name: onnxruntime
Using /opt/app-root/src/artifacts/RapidOcr/onnx/PP-OCRv4/...
  • No indication that Tesseract is initialized
  • OCR output quality matches RapidOCR behavior

Additional Observations

  • The same document processed via /ui:

    • correctly uses Tesseract
    • produces significantly better OCR results (especially for Russian text)
  • Tesseract is confirmed working:

    • callable via CLI inside container
    • languages available
    • no runtime errors when used independently
  • When forcing OCR options:

    • force_ocr=true does not change behavior
    • ocr_engine=tesseract is ignored

Hypothesis

There may be an inconsistency between:

  • Gradio UI pipeline configuration
  • /v1/convert/file API handling of options

Possible causes:

  • options field in multipart requests is not parsed correctly
  • OCR configuration is overridden by default pipeline settings
  • Pipeline selection differs between UI and API
  • ocr_engine is treated as a hint rather than enforced parameter

Additional Tests

The issue persists even when:

  • using --form-string for options
  • using minimal options payload
  • removing unrelated parameters

Impact

  • Makes API usage unreliable for OCR workflows
  • Prevents deterministic selection of OCR engine
  • Blocks integration in automated pipelines

Request

Please clarify:

  1. Is ocr_engine expected to be strictly enforced?
  2. Is /v1/convert/file fully equivalent to /ui in terms of pipeline behavior?
  3. Is there a supported way to force Tesseract as default OCR engine?

Suggested Fix

  • Ensure options JSON is correctly parsed from multipart requests
  • Allow strict enforcement of ocr_engine
  • Align API behavior with /ui pipeline

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions