|
| 1 | +# MarkItDown OCR Plugin |
| 2 | + |
| 3 | +LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files. |
| 4 | + |
| 5 | +Uses the same `llm_client` / `llm_model` pattern that MarkItDown already supports for image descriptions — no new ML libraries or binary dependencies required. |
| 6 | + |
| 7 | +## Features |
| 8 | + |
| 9 | +- **Enhanced PDF Converter**: Extracts text from images within PDFs, with full-page OCR fallback for scanned documents |
| 10 | +- **Enhanced DOCX Converter**: OCR for images in Word documents |
| 11 | +- **Enhanced PPTX Converter**: OCR for images in PowerPoint presentations |
| 12 | +- **Enhanced XLSX Converter**: OCR for images in Excel spreadsheets |
| 13 | +- **Context Preservation**: Maintains document structure and flow when inserting extracted text |
| 14 | + |
| 15 | +## Installation |
| 16 | + |
| 17 | +```bash |
| 18 | +pip install markitdown-ocr |
| 19 | +``` |
| 20 | + |
| 21 | +The plugin uses whatever OpenAI-compatible client you already have. Install one if you don't have it yet: |
| 22 | + |
| 23 | +```bash |
| 24 | +pip install openai |
| 25 | +``` |
| 26 | + |
| 27 | +## Usage |
| 28 | + |
| 29 | +### Command Line |
| 30 | + |
| 31 | +```bash |
| 32 | +markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o |
| 33 | +``` |
| 34 | + |
| 35 | +### Python API |
| 36 | + |
| 37 | +Pass `llm_client` and `llm_model` to `MarkItDown()` exactly as you would for image descriptions: |
| 38 | + |
| 39 | +```python |
| 40 | +from markitdown import MarkItDown |
| 41 | +from openai import OpenAI |
| 42 | + |
| 43 | +md = MarkItDown( |
| 44 | + enable_plugins=True, |
| 45 | + llm_client=OpenAI(), |
| 46 | + llm_model="gpt-4o", |
| 47 | +) |
| 48 | + |
| 49 | +result = md.convert("document_with_images.pdf") |
| 50 | +print(result.text_content) |
| 51 | +``` |
| 52 | + |
| 53 | +If no `llm_client` is provided the plugin still loads, but OCR is silently skipped — falling back to the standard built-in converter. |
| 54 | + |
| 55 | +### Custom Prompt |
| 56 | + |
| 57 | +Override the default extraction prompt for specialized documents: |
| 58 | + |
| 59 | +```python |
| 60 | +md = MarkItDown( |
| 61 | + enable_plugins=True, |
| 62 | + llm_client=OpenAI(), |
| 63 | + llm_model="gpt-4o", |
| 64 | + llm_prompt="Extract all text from this image, preserving table structure.", |
| 65 | +) |
| 66 | +``` |
| 67 | + |
| 68 | +### Any OpenAI-Compatible Client |
| 69 | + |
| 70 | +Works with any client that follows the OpenAI API: |
| 71 | + |
| 72 | +```python |
| 73 | +from openai import AzureOpenAI |
| 74 | + |
| 75 | +md = MarkItDown( |
| 76 | + enable_plugins=True, |
| 77 | + llm_client=AzureOpenAI( |
| 78 | + api_key="...", |
| 79 | + azure_endpoint="https://your-resource.openai.azure.com/", |
| 80 | + api_version="2024-02-01", |
| 81 | + ), |
| 82 | + llm_model="gpt-4o", |
| 83 | +) |
| 84 | +``` |
| 85 | + |
| 86 | +## How It Works |
| 87 | + |
| 88 | +When `MarkItDown(enable_plugins=True, llm_client=..., llm_model=...)` is called: |
| 89 | + |
| 90 | +1. MarkItDown discovers the plugin via the `markitdown.plugin` entry point group |
| 91 | +2. It calls `register_converters()`, forwarding all kwargs including `llm_client` and `llm_model` |
| 92 | +3. The plugin creates an `LLMVisionOCRService` from those kwargs |
| 93 | +4. Four OCR-enhanced converters are registered at **priority -1.0** — before the built-in converters at priority 0.0 |
| 94 | + |
| 95 | +When a file is converted: |
| 96 | + |
| 97 | +1. The OCR converter accepts the file |
| 98 | +2. It extracts embedded images from the document |
| 99 | +3. Each image is sent to the LLM with an extraction prompt |
| 100 | +4. The returned text is inserted inline, preserving document structure |
| 101 | +5. If the LLM call fails, conversion continues without that image's text |
| 102 | + |
| 103 | +## Supported File Formats |
| 104 | + |
| 105 | +### PDF |
| 106 | + |
| 107 | +- Embedded images are extracted by position (via `page.images` / page XObjects) and OCR'd inline, interleaved with the surrounding text in vertical reading order. |
| 108 | +- **Scanned PDFs** (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the LLM as a full-page image. |
| 109 | +- **Malformed PDFs** that pdfplumber/pdfminer cannot open (e.g. truncated EOF) are retried with PyMuPDF page rendering, so content is still recovered. |
| 110 | + |
| 111 | +### DOCX |
| 112 | + |
| 113 | +- Images are extracted via document part relationships (`doc.part.rels`). |
| 114 | +- OCR is run before the DOCX→HTML→Markdown pipeline executes: placeholder tokens are injected into the HTML so that the markdown converter does not escape the OCR markers, and the final placeholders are replaced with the formatted `*[Image OCR]...[End OCR]*` blocks after conversion. |
| 115 | +- Document flow (headings, paragraphs, tables) is fully preserved around the OCR blocks. |
| 116 | + |
| 117 | +### PPTX |
| 118 | + |
| 119 | +- Picture shapes, placeholder shapes with images, and images inside groups are all supported. |
| 120 | +- Shapes are processed in top-to-left reading order per slide. |
| 121 | +- If an `llm_client` is configured, the LLM is asked for a description first; OCR is used as the fallback when no description is returned. |
| 122 | + |
| 123 | +### XLSX |
| 124 | + |
| 125 | +- Images embedded in worksheets (`sheet._images`) are extracted per sheet. |
| 126 | +- Cell position is calculated from the image anchor coordinates (column/row → Excel letter notation). |
| 127 | +- Images are listed under a `### Images in this sheet:` section after the sheet's data table — they are not interleaved into the table rows. |
| 128 | + |
| 129 | +### Output format |
| 130 | + |
| 131 | +Every extracted OCR block is wrapped as: |
| 132 | + |
| 133 | +```text |
| 134 | +*[Image OCR] |
| 135 | +<extracted text> |
| 136 | +[End OCR]* |
| 137 | +``` |
| 138 | + |
| 139 | +## Troubleshooting |
| 140 | + |
| 141 | +### OCR text missing from output |
| 142 | + |
| 143 | +The most likely cause is a missing `llm_client` or `llm_model`. Verify: |
| 144 | + |
| 145 | +```python |
| 146 | +from openai import OpenAI |
| 147 | +from markitdown import MarkItDown |
| 148 | + |
| 149 | +md = MarkItDown( |
| 150 | + enable_plugins=True, |
| 151 | + llm_client=OpenAI(), # required |
| 152 | + llm_model="gpt-4o", # required |
| 153 | +) |
| 154 | +``` |
| 155 | + |
| 156 | +### Plugin not loading |
| 157 | + |
| 158 | +Confirm the plugin is installed and discovered: |
| 159 | + |
| 160 | +```bash |
| 161 | +markitdown --list-plugins # should show: ocr |
| 162 | +``` |
| 163 | + |
| 164 | +### API errors |
| 165 | + |
| 166 | +The plugin propagates LLM API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs. |
| 167 | + |
| 168 | +## Development |
| 169 | + |
| 170 | +### Running Tests |
| 171 | + |
| 172 | +```bash |
| 173 | +cd packages/markitdown-ocr |
| 174 | +pytest tests/ -v |
| 175 | +``` |
| 176 | + |
| 177 | +### Building from Source |
| 178 | + |
| 179 | +```bash |
| 180 | +git clone https://github.com/microsoft/markitdown.git |
| 181 | +cd markitdown/packages/markitdown-ocr |
| 182 | +pip install -e . |
| 183 | +``` |
| 184 | + |
| 185 | +## Contributing |
| 186 | + |
| 187 | +Contributions are welcome! See the [MarkItDown repository](https://github.com/microsoft/markitdown) for guidelines. |
| 188 | + |
| 189 | +## License |
| 190 | + |
| 191 | +MIT — see [LICENSE](LICENSE). |
| 192 | + |
| 193 | +## Changelog |
| 194 | + |
| 195 | +### 0.1.0 (Initial Release) |
| 196 | + |
| 197 | +- LLM Vision OCR for PDF, DOCX, PPTX, XLSX |
| 198 | +- Full-page OCR fallback for scanned PDFs |
| 199 | +- Context-aware inline text insertion |
| 200 | +- Priority-based converter replacement (no code changes required) |
0 commit comments