Skip to content

Commit ff863f2

Browse files
authored
Merge branch 'main' into fix/issue-1505-unicode-decode-error
2 parents d14f9b0 + 63cbbd9 commit ff863f2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+2871
-26
lines changed

README.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
1010
> [!IMPORTANT]
1111
> Breaking changes between 0.0.1 to 0.1.0:
12-
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
12+
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
1313
> * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
1414
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
1515
@@ -132,6 +132,38 @@ markitdown --use-plugins path-to-file.pdf
132132

133133
To find available plugins, search GitHub for the hashtag `#markitdown-plugin`. To develop a plugin, see `packages/markitdown-sample-plugin`.
134134

135+
#### markitdown-ocr Plugin
136+
137+
The `markitdown-ocr` plugin adds OCR support to PDF, DOCX, PPTX, and XLSX converters, extracting text from embedded images using LLM Vision — the same `llm_client` / `llm_model` pattern that MarkItDown already uses for image descriptions. No new ML libraries or binary dependencies required.
138+
139+
**Installation:**
140+
141+
```bash
142+
pip install markitdown-ocr
143+
pip install openai # or any OpenAI-compatible client
144+
```
145+
146+
**Usage:**
147+
148+
Pass the same `llm_client` and `llm_model` you would use for image descriptions:
149+
150+
```python
151+
from markitdown import MarkItDown
152+
from openai import OpenAI
153+
154+
md = MarkItDown(
155+
enable_plugins=True,
156+
llm_client=OpenAI(),
157+
llm_model="gpt-4o",
158+
)
159+
result = md.convert("document_with_images.pdf")
160+
print(result.text_content)
161+
```
162+
163+
If no `llm_client` is provided the plugin still loads, but OCR is silently skipped and the standard built-in converter is used instead.
164+
165+
See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.
166+
135167
### Azure Document Intelligence
136168

137169
To use Microsoft Document Intelligence for conversion:

packages/markitdown-mcp/README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# MarkItDown-MCP
22

3+
> [!IMPORTANT]
4+
> The MarkItDown-MCP package is meant for **local use**, with local trusted agents. In particular, when running the MCP server with Streamable HTTP or SSE, it binds to `localhost` by default, and is not exposed to other machines on the network or Internet. In this configuration, it is meant to be a direct alternative to the STDIO transport, which may be more convenient in some cases. DO NOT bind the server to other interfaces unless you understand the [security implications](#security-considerations) of doing so.
5+
6+
37
[![PyPI](https://img.shields.io/pypi/v/markitdown-mcp.svg)](https://pypi.org/project/markitdown-mcp/)
48
![PyPI - Downloads](https://img.shields.io/pypi/dd/markitdown-mcp)
59
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)
@@ -18,14 +22,14 @@ pip install markitdown-mcp
1822

1923
## Usage
2024

21-
To run the MCP server, using STDIO (default) use the following command:
25+
To run the MCP server, using STDIO (default), use the following command:
2226

2327

2428
```bash
2529
markitdown-mcp
2630
```
2731

28-
To run the MCP server, using Streamable HTTP and SSE use the following command:
32+
To run the MCP server, using Streamable HTTP and SSE, use the following command:
2933

3034
```bash
3135
markitdown-mcp --http --host 127.0.0.1 --port 3001
@@ -96,7 +100,7 @@ If you want to mount a directory, adjust it accordingly:
96100

97101
## Debugging
98102

99-
To debug the MCP server you can use the `mcpinspector` tool.
103+
To debug the MCP server you can use the `MCP Inspector` tool.
100104

101105
```bash
102106
npx @modelcontextprotocol/inspector
@@ -127,7 +131,7 @@ Finally:
127131

128132
## Security Considerations
129133

130-
The server does not support authentication, and runs with the privileges of the user running it. For this reason, when running in SSE or Streamable HTTP mode, it is recommended to run the server bound to `localhost` (default).
134+
The server does not support authentication, and runs with the privileges of the user running it. For this reason, when running in SSE or Streamable HTTP mode, the server binds by default to `localhost`. Even still, it is important to recognize that the server can be accessed by any process or users on the same local machine, and that the `convert_to_markdown` tool can be used to read any file that the server's user has access to, or any data from the network. If you require additional security, consider running the server in a sandboxed environment, such as a virtual machine or container, and ensure that the user permissions are properly configured to limit access to sensitive files and network segments. Above all, DO NOT bind the server to other interfaces (non-localhost) unless you understand the security implications of doing so.
131135

132136
## Trademarks
133137

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
22
#
33
# SPDX-License-Identifier: MIT
4-
__version__ = "0.0.1a4"
4+
__version__ = "0.0.1a5"

packages/markitdown-mcp/src/markitdown_mcp/__main__.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,10 +113,23 @@ def main():
113113
sys.exit(1)
114114

115115
if use_http:
116+
host = args.host if args.host else "127.0.0.1"
117+
if args.host and args.host not in ("127.0.0.1", "localhost"):
118+
print(
119+
"\n"
120+
"WARNING: The server is being bound to a non-localhost interface "
121+
f"({host}).\n"
122+
"This exposes the server to other machines on the network or Internet.\n"
123+
"The server has NO authentication and runs with your user's privileges.\n"
124+
"Any process or user that can reach this interface can read files and\n"
125+
"fetch network resources accessible to this user.\n"
126+
"Only proceed if you understand the security implications.\n",
127+
file=sys.stderr,
128+
)
116129
starlette_app = create_starlette_app(mcp_server, debug=True)
117130
uvicorn.run(
118131
starlette_app,
119-
host=args.host if args.host else "127.0.0.1",
132+
host=host,
120133
port=args.port if args.port else 3001,
121134
)
122135
else:

packages/markitdown-ocr/LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) Microsoft Corporation.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE

packages/markitdown-ocr/README.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# MarkItDown OCR Plugin
2+
3+
LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files.
4+
5+
Uses the same `llm_client` / `llm_model` pattern that MarkItDown already supports for image descriptions — no new ML libraries or binary dependencies required.
6+
7+
## Features
8+
9+
- **Enhanced PDF Converter**: Extracts text from images within PDFs, with full-page OCR fallback for scanned documents
10+
- **Enhanced DOCX Converter**: OCR for images in Word documents
11+
- **Enhanced PPTX Converter**: OCR for images in PowerPoint presentations
12+
- **Enhanced XLSX Converter**: OCR for images in Excel spreadsheets
13+
- **Context Preservation**: Maintains document structure and flow when inserting extracted text
14+
15+
## Installation
16+
17+
```bash
18+
pip install markitdown-ocr
19+
```
20+
21+
The plugin uses whatever OpenAI-compatible client you already have. Install one if you don't have it yet:
22+
23+
```bash
24+
pip install openai
25+
```
26+
27+
## Usage
28+
29+
### Command Line
30+
31+
```bash
32+
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
33+
```
34+
35+
### Python API
36+
37+
Pass `llm_client` and `llm_model` to `MarkItDown()` exactly as you would for image descriptions:
38+
39+
```python
40+
from markitdown import MarkItDown
41+
from openai import OpenAI
42+
43+
md = MarkItDown(
44+
enable_plugins=True,
45+
llm_client=OpenAI(),
46+
llm_model="gpt-4o",
47+
)
48+
49+
result = md.convert("document_with_images.pdf")
50+
print(result.text_content)
51+
```
52+
53+
If no `llm_client` is provided the plugin still loads, but OCR is silently skipped — falling back to the standard built-in converter.
54+
55+
### Custom Prompt
56+
57+
Override the default extraction prompt for specialized documents:
58+
59+
```python
60+
md = MarkItDown(
61+
enable_plugins=True,
62+
llm_client=OpenAI(),
63+
llm_model="gpt-4o",
64+
llm_prompt="Extract all text from this image, preserving table structure.",
65+
)
66+
```
67+
68+
### Any OpenAI-Compatible Client
69+
70+
Works with any client that follows the OpenAI API:
71+
72+
```python
73+
from openai import AzureOpenAI
74+
75+
md = MarkItDown(
76+
enable_plugins=True,
77+
llm_client=AzureOpenAI(
78+
api_key="...",
79+
azure_endpoint="https://your-resource.openai.azure.com/",
80+
api_version="2024-02-01",
81+
),
82+
llm_model="gpt-4o",
83+
)
84+
```
85+
86+
## How It Works
87+
88+
When `MarkItDown(enable_plugins=True, llm_client=..., llm_model=...)` is called:
89+
90+
1. MarkItDown discovers the plugin via the `markitdown.plugin` entry point group
91+
2. It calls `register_converters()`, forwarding all kwargs including `llm_client` and `llm_model`
92+
3. The plugin creates an `LLMVisionOCRService` from those kwargs
93+
4. Four OCR-enhanced converters are registered at **priority -1.0** — before the built-in converters at priority 0.0
94+
95+
When a file is converted:
96+
97+
1. The OCR converter accepts the file
98+
2. It extracts embedded images from the document
99+
3. Each image is sent to the LLM with an extraction prompt
100+
4. The returned text is inserted inline, preserving document structure
101+
5. If the LLM call fails, conversion continues without that image's text
102+
103+
## Supported File Formats
104+
105+
### PDF
106+
107+
- Embedded images are extracted by position (via `page.images` / page XObjects) and OCR'd inline, interleaved with the surrounding text in vertical reading order.
108+
- **Scanned PDFs** (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the LLM as a full-page image.
109+
- **Malformed PDFs** that pdfplumber/pdfminer cannot open (e.g. truncated EOF) are retried with PyMuPDF page rendering, so content is still recovered.
110+
111+
### DOCX
112+
113+
- Images are extracted via document part relationships (`doc.part.rels`).
114+
- OCR is run before the DOCX→HTML→Markdown pipeline executes: placeholder tokens are injected into the HTML so that the markdown converter does not escape the OCR markers, and the final placeholders are replaced with the formatted `*[Image OCR]...[End OCR]*` blocks after conversion.
115+
- Document flow (headings, paragraphs, tables) is fully preserved around the OCR blocks.
116+
117+
### PPTX
118+
119+
- Picture shapes, placeholder shapes with images, and images inside groups are all supported.
120+
- Shapes are processed in top-to-left reading order per slide.
121+
- If an `llm_client` is configured, the LLM is asked for a description first; OCR is used as the fallback when no description is returned.
122+
123+
### XLSX
124+
125+
- Images embedded in worksheets (`sheet._images`) are extracted per sheet.
126+
- Cell position is calculated from the image anchor coordinates (column/row → Excel letter notation).
127+
- Images are listed under a `### Images in this sheet:` section after the sheet's data table — they are not interleaved into the table rows.
128+
129+
### Output format
130+
131+
Every extracted OCR block is wrapped as:
132+
133+
```text
134+
*[Image OCR]
135+
<extracted text>
136+
[End OCR]*
137+
```
138+
139+
## Troubleshooting
140+
141+
### OCR text missing from output
142+
143+
The most likely cause is a missing `llm_client` or `llm_model`. Verify:
144+
145+
```python
146+
from openai import OpenAI
147+
from markitdown import MarkItDown
148+
149+
md = MarkItDown(
150+
enable_plugins=True,
151+
llm_client=OpenAI(), # required
152+
llm_model="gpt-4o", # required
153+
)
154+
```
155+
156+
### Plugin not loading
157+
158+
Confirm the plugin is installed and discovered:
159+
160+
```bash
161+
markitdown --list-plugins # should show: ocr
162+
```
163+
164+
### API errors
165+
166+
The plugin propagates LLM API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs.
167+
168+
## Development
169+
170+
### Running Tests
171+
172+
```bash
173+
cd packages/markitdown-ocr
174+
pytest tests/ -v
175+
```
176+
177+
### Building from Source
178+
179+
```bash
180+
git clone https://github.com/microsoft/markitdown.git
181+
cd markitdown/packages/markitdown-ocr
182+
pip install -e .
183+
```
184+
185+
## Contributing
186+
187+
Contributions are welcome! See the [MarkItDown repository](https://github.com/microsoft/markitdown) for guidelines.
188+
189+
## License
190+
191+
MIT — see [LICENSE](LICENSE).
192+
193+
## Changelog
194+
195+
### 0.1.0 (Initial Release)
196+
197+
- LLM Vision OCR for PDF, DOCX, PPTX, XLSX
198+
- Full-page OCR fallback for scanned PDFs
199+
- Context-aware inline text insertion
200+
- Priority-based converter replacement (no code changes required)

0 commit comments

Comments
 (0)