-
Notifications
You must be signed in to change notification settings - Fork 678
Description
[Documentation] Add OCR requirements for pymupdf.layout import in PyMuPDF4LLM
Summary
The PyMuPDF4LLM official documentation does not document that pymupdf.layout must be imported before pymupdf4llm to enable OCR functionality with Tesseract. This is a critical requirement that users cannot discover from the primary documentation source.
Problem Description
Users attempting to use OCR on scanned/image-only PDFs with PyMuPDF4LLM experience failures without understanding why. The standard PyMuPDF4LLM documentation:
- ✗ Makes no mention of OCR at all
- ✗ Doesn't mention
pymupdf.layoutas a requirement - ✗ Doesn't explain OCR behavior for image-only PDFs
- ✗ Doesn't clarify how Tesseract integration works
Current Behavior
Without import pymupdf.layout:
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("scanned.pdf")
# Result: Fails on image-only pages, falls back to alternative extractors- Image-only pages fail text extraction
- Tesseract is never invoked
- Users must implement manual OCR fallback logic
With import pymupdf.layout:
import pymupdf.layout # REQUIRED for OCR support
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("scanned.pdf")
# Result: Automatically detects image-only pages and invokes Tesseract- Layout-sensitive mode activates
- Heuristics automatically detect image-only pages
- Tesseract OCR is automatically invoked when needed
- Text extraction succeeds for scanned documents
Expected Documentation
The official PyMuPDF4LLM documentation should explicitly state:
OCR Support Requirements
To enable OCR support for scanned/image-only documents:
- Import
pymupdf.layoutbefore importingpymupdf4llm - Have Tesseract OCR installed on your system
- When layout detection is enabled, PyMuPDF4LLM will automatically use Tesseract when it detects image-only pages
Example Code
import pymupdf.layout # REQUIRED for OCR support
import pymupdf4llm
# Process PDF with automatic OCR for image-only pages
md_text = pymupdf4llm.to_markdown("scanned.pdf")How It Works
The pymupdf.layout module includes heuristics that:
- Detect when a page is image-only (scanned or photo)
- Automatically invoke Tesseract OCR for those pages
- Return extracted text seamlessly
Without pymupdf.layout, PyMuPDF4LLM operates in standard mode without OCR decision logic.
Why This Gap Exists
This appears to be a recent feature (layout module integration with automatic OCR triggering). The Artifex Blog tutorial (November 2025) explains this clearly, but the official PyMuPDF4LLM documentation has not been updated to include this critical information.
Reference
- Artifex Blog Tutorial: Clearly documents the OCR behavior when
pymupdf.layoutis imported - Official Docs: https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html (missing OCR documentation)
Suggested Resolution
Update the PyMuPDF4LLM documentation to include:
- A dedicated section on OCR support
- Clear requirement to import
pymupdf.layoutbeforepymupdf4llm - System requirements (Tesseract installation)
- Working example code
- Explanation of automatic OCR heuristics