Skip to content

Releases: docling-project/docling

v2.34.0

22 May 18:44

Choose a tag to compare

Feature

  • ocr: Auto-detect rotated pages in Tesseract (#1167) (45265bf)
  • Establish confidence estimation for document and pages (#1313) (9087524)

Fix

  • Fix ZeroDivisionError for cell_bbox.area() (#1636) (c2f595d)
  • integration: Update the Apify Actor integration (#1619) (14d4f5b)

v2.33.0

20 May 19:54

Choose a tag to compare

Feature

  • Add textbox content extraction in msword_backend (#1538) (12a0e64)

Fix

  • Fix issue with detecting docx files, and files with upper case extensions (#1609) (f4d9d41)
  • Load_from_doctags static usage (#1617) (0e00a26)
  • Incorrect force_backend_text behaviour for VLM DocTag pipelines (#1371) (f2e9c07)
  • pypdfium: Resolve overlapping text when merging bounding boxes (#1549) (98b5eeb)

v2.32.0

14 May 14:28

Choose a tag to compare

Feature

  • Improve parallelization for remote services API calls (#1548) (3a04f2a)
  • Support image/webp file type (#1415) (12dab0a)

Fix

  • ocr: Orig field in TesseractOcrCliModel as str (#1553) (9f8b479)
  • settings: Fix nested settings load via environment variables (#1551) (2efb7a7)

Documentation

  • Add advanced chunking & serialization example (#1589) (9f28abf)

v2.31.2

13 May 10:09

Choose a tag to compare

Fix

v2.31.1

12 May 09:44

Choose a tag to compare

Fix

  • Add smoldocling in download utils (#1577) (127e386)
  • HTML: Handle row spans in header rows (#1536) (776e7ec)
  • Mime error in document streams (#1523) (f1658ed)
  • Usage of hashlib for FIPS (#1512) (7c70573)
  • Guard against attribute errors in TesseractOcrModel del (#1494) (4ab7e9d)
  • Enable cuda_use_flash_attention2 for PictureDescriptionVlmModel (#1496) (cc45396)
  • Updated the time-recorder label for reading order (#1490) (976e92e)
  • Incorrect scaling of TableModel bboxes when do_cell_matching is False (#1459) (94d66a0)

Documentation

v2.31.0

25 Apr 08:28

Choose a tag to compare

Feature

  • Add tutorial using Milvus and Docling for RAG pipeline (#1449) (a2fbbba)

Fix

  • html: Handle address, details, and summary tags (#1436) (ed20124)
  • Treat overflowing -v flags as DEBUG (#1419) (8012a3e)
  • codecov: Fix codecov argument and yaml file (#1399) (fa7fc9e)

Documentation

v2.30.0

14 Apr 08:20

Choose a tag to compare

Feature

  • cli: Add option for html with split-page mode (#1355) (c0ba88e)
  • xlsx: Create a page for each worksheet in XLSX backend (#1332) (eef2bde)
  • OllamaVlmModel for Granite Vision 3.2 (#1337) (c605edd)

Fix

  • deps: Widen typer upper bound (#1375) (7e40ad3)
  • Auto-recognize .xlsx, .docx and .pptx files (#1340) (0de70e7)
  • docx: Declare image_data variable when handling pictures (#1359) (415b877)
  • Implement PictureDescriptionApiOptions.bitmap_area_threshold (#1248) (2503999)
  • Properly address page in pipeline _assemble_document when page_range is provided (#1334) (6b696b5)

v2.29.0

10 Apr 12:24

Choose a tag to compare

Feature

  • Handle <code> tags as code blocks (#1320) (0499cd1)
  • docx: Add text formatting and hyperlink support (#630) (bfcab3d)

Fix

  • docx: Adding new latex symbols, simplifying how equations are added to text (#1295) (14e9c0c)
  • pptx: Check if picture shape has an image attached (#1316) (dc3bf9c)
  • docx: Improve text parsing (#1268) (d2d6874)
  • Tesseract OCR CLI can't process images composed with numbers only (#1201) (b3d111a)

Documentation

v2.28.4

29 Mar 11:56

Choose a tag to compare

Fix

v2.28.3

28 Mar 18:30

Choose a tag to compare

Fix