Linked to the issue #399 (which still seems to be happening to me).
When selecting pipeline type = Vlm from the docling-serve UI, inference seems to run indefinitely (100% usage of local GPU). The auto-selected VLM model is Granite-docling.
I've reproduced this behavior using the following docling-serve API invocation:
curl -X 'POST' \
'http://localhost:5001/v1/convert/source' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"sources": [{"kind": "http", "url": "https://arxiv.org/pdf/2501.17887"}],
"options": {
"ocr_engine": "easyocr",
"pdf_backend": "dlparse_v4",
"from_formats": ["pdf", "docx"],
"force_ocr": false,
"image_export_mode": "placeholder",
"do_ocr": true,
"ocr_lang": ["en", "it"],
"table_mode": "accurate",
"to_formats": ["md", "json", "html", "text", "doctags"],
"abort_on_error": false,
"pipeline": "vlm",
"vlm_pipeline_model": "granite_docling"
}
}'
However. I've recently found out that this problem doesn't reproduce if you invoke the docling-serve API the following way (HuggingFace parameters documented at ):
curl -X 'POST' \
'http://localhost:5001/v1/convert/source' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"sources": [{"kind": "http", "url": "https://arxiv.org/pdf/2501.17887"}],
"options": {
"ocr_engine": "easyocr",
"pdf_backend": "dlparse_v4",
"from_formats": ["pdf", "docx"],
"force_ocr": false,
"image_export_mode": "placeholder",
"do_ocr": true,
"ocr_lang": ["en", "it"],
"table_mode": "accurate",
"to_formats": ["md", "json", "html", "text", "doctags"],
"abort_on_error": false,
"pipeline": "vlm",
"vlm_pipeline_model_local": {
"repo_id": "ibm-granite/granite-docling-258M",
"inference_framework": "transformers",
"transformers_model_type": "automodel-imagetexttotext",
"temperature": 0.0,
"prompt": "Convert this page to docling.",
"response_format": "doctags"
}
}
}'
PLEASE NOTE: The hanging problem doesn't reproduce by running the underlying docling library the following way:
docling --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887"
So, it seems that the underlying docling library is executed in a way more similar to the second docling-serve API invocation example, while the docling-serve UI seems to be more similar to the first example above.
docling-serve version = 1.14.3
docling version = 2.80.0
GPU: NVIDIA RTX 4000 SFF Ada Generation
Linked to the issue #399 (which still seems to be happening to me).
When selecting pipeline type = Vlm from the docling-serve UI, inference seems to run indefinitely (100% usage of local GPU). The auto-selected VLM model is Granite-docling.
I've reproduced this behavior using the following docling-serve API invocation:
However. I've recently found out that this problem doesn't reproduce if you invoke the docling-serve API the following way (HuggingFace parameters documented at ):
PLEASE NOTE: The hanging problem doesn't reproduce by running the underlying docling library the following way:
docling --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887"So, it seems that the underlying docling library is executed in a way more similar to the second docling-serve API invocation example, while the docling-serve UI seems to be more similar to the first example above.
docling-serve version = 1.14.3
docling version = 2.80.0
GPU: NVIDIA RTX 4000 SFF Ada Generation