[Doc] Missing OCR setup guide: pymupdf.layout dependency not documented

# [Documentation] Add OCR requirements for pymupdf.layout import in PyMuPDF4LLM

## Summary

The PyMuPDF4LLM official documentation does not document that `pymupdf.layout` must be imported before `pymupdf4llm` to enable OCR functionality with Tesseract. This is a critical requirement that users cannot discover from the primary documentation source.

## Problem Description

Users attempting to use OCR on scanned/image-only PDFs with PyMuPDF4LLM experience failures without understanding why. The standard PyMuPDF4LLM documentation:

- ✗ Makes **no mention** of OCR at all
- ✗ Doesn't mention `pymupdf.layout` as a requirement
- ✗ Doesn't explain OCR behavior for image-only PDFs
- ✗ Doesn't clarify how Tesseract integration works

## Current Behavior

**Without `import pymupdf.layout`:**
```python
import pymupdf4llm

md_text = pymupdf4llm.to_markdown("scanned.pdf")
# Result: Fails on image-only pages, falls back to alternative extractors
```

- Image-only pages fail text extraction
- Tesseract is never invoked
- Users must implement manual OCR fallback logic

**With `import pymupdf.layout`:**
```python
import pymupdf.layout  # REQUIRED for OCR support
import pymupdf4llm

md_text = pymupdf4llm.to_markdown("scanned.pdf")
# Result: Automatically detects image-only pages and invokes Tesseract
```

- Layout-sensitive mode activates
- Heuristics automatically detect image-only pages
- Tesseract OCR is automatically invoked when needed
- Text extraction succeeds for scanned documents

## Expected Documentation

The official PyMuPDF4LLM documentation should explicitly state:

### OCR Support Requirements

To enable OCR support for scanned/image-only documents:

1. Import `pymupdf.layout` **before** importing `pymupdf4llm`
2. Have Tesseract OCR installed on your system
3. When layout detection is enabled, PyMuPDF4LLM will automatically use Tesseract when it detects image-only pages

### Example Code

```python
import pymupdf.layout  # REQUIRED for OCR support
import pymupdf4llm

# Process PDF with automatic OCR for image-only pages
md_text = pymupdf4llm.to_markdown("scanned.pdf")
```

### How It Works

The `pymupdf.layout` module includes heuristics that:
- Detect when a page is image-only (scanned or photo)
- Automatically invoke Tesseract OCR for those pages
- Return extracted text seamlessly

Without `pymupdf.layout`, PyMuPDF4LLM operates in standard mode without OCR decision logic.

## Why This Gap Exists

This appears to be a recent feature (layout module integration with automatic OCR triggering). The Artifex Blog tutorial (November 2025) explains this clearly, but the official PyMuPDF4LLM documentation has not been updated to include this critical information.

## Reference

- **Artifex Blog Tutorial:** Clearly documents the OCR behavior when `pymupdf.layout` is imported
- **Official Docs:** https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html (missing OCR documentation)

## Suggested Resolution

Update the PyMuPDF4LLM documentation to include:
- A dedicated section on OCR support
- Clear requirement to import `pymupdf.layout` before `pymupdf4llm`
- System requirements (Tesseract installation)
- Working example code
- Explanation of automatic OCR heuristics


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Doc] Missing OCR setup guide: pymupdf.layout dependency not documented #4833

[Documentation] Add OCR requirements for pymupdf.layout import in PyMuPDF4LLM

Summary

Problem Description

Current Behavior

Expected Documentation

OCR Support Requirements

Example Code

How It Works

Why This Gap Exists

Reference

Suggested Resolution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Doc] Missing OCR setup guide: pymupdf.layout dependency not documented #4833

Description

[Documentation] Add OCR requirements for pymupdf.layout import in PyMuPDF4LLM

Summary

Problem Description

Current Behavior

Expected Documentation

OCR Support Requirements

Example Code

How It Works

Why This Gap Exists

Reference

Suggested Resolution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions