🐞 Bug Report: ocr_engine and OCR-related options ignored in /v1/convert/file API (RapidOCR always used) Summary

# Bug Report: OCR options ignored in /v1/convert/file

## Summary

When using the `/v1/convert/file` endpoint with `multipart/form-data`, OCR-related parameters such as:

- `ocr_engine`
- `ocr_lang`
- `do_ocr`
- `force_ocr`

are ignored.

Regardless of the provided options, the pipeline consistently falls back to **RapidOCR**, even when:

- Tesseract is installed and available  
- OCR options are explicitly specified in the request  

However, the same parameters work correctly when using the `/ui` (Gradio interface).

---

## Environment

- `docling-serve`: latest (Docker image)  
- Deployment: Docker container  

### docker-compose.yml

```
docling:
  image: ghcr.io/docling-project/docling-serve-cpu:latest
  container_name: docling
  restart: unless-stopped
  ports:
    - "192.168.1.122:5001:5001"
  expose:
    - "5001"
  environment:
    - DOCLING_SERVE_ENABLE_UI=1
    - DOCLING_SERVE_LOAD_MODELS_AT_BOOT=false
    - TESSDATA_PREFIX=/opt/tessdata/
  volumes:
    - docling_cache:/root/.cache
    - docling_data:/app/data
    - ./docling-tessdata:/opt/tessdata
```

- OCR engine installed: **Tesseract 4.1.1**  
- Languages available: `eng`, `rus`  
- API usage: `curl` with `multipart/form-data`  

---

## Steps to Reproduce

### 1. Ensure Tesseract is installed inside container

```
docker exec -it docling tesseract --list-langs
```

Output:

```
osd
eng
rus
```

---

### 2. Send request using `/v1/convert/file`

```
curl.exe -X POST "http://<host>:5001/v1/convert/file" ^
  -H "accept: application/json" ^
  -F "files=@test.pdf;type=application/pdf" ^
  --form-string "options={\"do_ocr\":true,\"ocr_engine\":\"tesseract\",\"ocr_lang\":\"rus,eng\",\"pdf_backend\":\"pypdfium2\",\"to_formats\":[\"md\"]}"
```

---

## Expected Behavior

- Tesseract should be used as OCR engine  

Logs should contain:

```
tesseract_ocr_cli_model
```

---

## Actual Behavior

- RapidOCR is always used instead  

Logs:

```
[RapidOCR] Using engine_name: onnxruntime
Using /opt/app-root/src/artifacts/RapidOcr/onnx/PP-OCRv4/...
```

- No indication that Tesseract is initialized  
- OCR output quality matches RapidOCR behavior  

---

## Additional Observations

- The same document processed via `/ui`:
  - correctly uses Tesseract  
  - produces significantly better OCR results (especially for Russian text)  

- Tesseract is confirmed working:
  - callable via CLI inside container  
  - languages available  
  - no runtime errors when used independently  

- When forcing OCR options:
  - `force_ocr=true` does not change behavior  
  - `ocr_engine=tesseract` is ignored  

---

## Hypothesis

There may be an inconsistency between:

- Gradio UI pipeline configuration  
- `/v1/convert/file` API handling of `options`  

Possible causes:

- `options` field in multipart requests is not parsed correctly  
- OCR configuration is overridden by default pipeline settings  
- Pipeline selection differs between UI and API  
- `ocr_engine` is treated as a hint rather than enforced parameter  

---

## Additional Tests

The issue persists even when:

- using `--form-string` for options  
- using minimal options payload  
- removing unrelated parameters  

---

## Impact

- Makes API usage unreliable for OCR workflows  
- Prevents deterministic selection of OCR engine  
- Blocks integration in automated pipelines  

---

## Request

Please clarify:

1. Is `ocr_engine` expected to be strictly enforced?  
2. Is `/v1/convert/file` fully equivalent to `/ui` in terms of pipeline behavior?  
3. Is there a supported way to force Tesseract as default OCR engine?  

---

## Suggested Fix

- Ensure `options` JSON is correctly parsed from multipart requests  
- Allow strict enforcement of `ocr_engine`  
- Align API behavior with `/ui` pipeline  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐞 Bug Report: ocr_engine and OCR-related options ignored in /v1/convert/file API (RapidOCR always used) Summary #567

Bug Report: OCR options ignored in /v1/convert/file

Summary

Environment

docker-compose.yml

Steps to Reproduce

1. Ensure Tesseract is installed inside container

2. Send request using `/v1/convert/file`

Expected Behavior

Actual Behavior

Additional Observations

Hypothesis

Additional Tests

Impact

Request

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🐞 Bug Report: ocr_engine and OCR-related options ignored in /v1/convert/file API (RapidOCR always used) Summary #567

Description

Bug Report: OCR options ignored in /v1/convert/file

Summary

Environment

docker-compose.yml

Steps to Reproduce

1. Ensure Tesseract is installed inside container

2. Send request using /v1/convert/file

Expected Behavior

Actual Behavior

Additional Observations

Hypothesis

Additional Tests

Impact

Request

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Send request using `/v1/convert/file`