Skip to content

Commit baec425

Browse files
JorjMcKiejamie-lemon
authored andcommitted
Layout Documentation changes
1 parent 8901c4e commit baec425

File tree

2 files changed

+6
-3
lines changed

2 files changed

+6
-3
lines changed

docs/pymupdf-layout/index.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -123,9 +123,9 @@ OCR support
123123

124124
The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
125125

126-
If a page contains no text at all, but is covered with an image or many vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart ordinary pictures (like photographies - which we don't want to OCR) from image-based text.
126+
If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographies).
127127

128-
If the page does contain text but contains too many unreadable characters (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
128+
If the page does contain text but too many characters are unreadable (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
129129

130130
For these heuristics to work we need both, an existing Tesseract installation and the availability of OpenCV in the Python environment. If either is missing, no OCR is attempted at all.
131131

@@ -136,6 +136,6 @@ For these heuristics to work we need both, an existing Tesseract installation an
136136
|PyMuPDF Layout| and |PyMuPDF4LLM| parameter caveats
137137
-----------------------------------------------------
138138

139-
If you have imported ``pymupdf.layout``, |PyMuPDF4LLM| changes its behavior in various areas. New methods become available and some features are no longer supported. Please visit `this site <https://github.com/pymupdf/pymupdf4llm/discussions/327>`_ for a detailed description of the changes. This web site is being kept up to date while we continue to work on improvements.
139+
If you have imported ``pymupdf.layout``, |PyMuPDF4LLM| changes its behavior in various areas quite significantly. New methods become available and also some features are no longer supported. Please visit `this site <https://github.com/pymupdf/pymupdf4llm/discussions/327>`_ for a detailed description of the changes. That web site is being kept up to date while we continue to work on improvements.
140140

141141
.. include:: ../footer.rst

docs/pymupdf4llm/api.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ The |PyMuPDF4LLM| API
1919
.. method:: to_markdown(doc: pymupdf.Document | str, *, \
2020
detect_bg_color: bool = True, \
2121
dpi: int = 150, \
22+
ocr_dpi: int = 400, \
2223
embed_images: bool = False, \
2324
extract_words: bool = False, \
2425
filename: str | None = None, \
@@ -54,6 +55,8 @@ The |PyMuPDF4LLM| API
5455

5556
:arg int dpi: specify the desired image resolution in dots per inch. Relevant only if `write_images=True` or `embed_images=True`. Default value is 150.
5657

58+
:arg int ocr_dpi: specify the desired image resolution in dots per inch for applying OCR to the intermdeiate image of the page. Default value is 400. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precison but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high. **Only valid in "layout mode".**
59+
5760
:arg bool embed_images: like `write_images`, but images will be included in the markdown text as base64-encoded strings. Mutually exclusive with `write_images` and ignores `image_path`. This may drastically increase the size of your markdown text.
5861

5962
:arg bool extract_words: a value of `True` enforces `page_chunks=True` and adds key "words" to each page dictionary. Its value is a list of words as delivered by PyMuPDF's `Page` method `get_text("words")`. The sequence of the words in this list is the same as the extracted text. **Ignored in "layout mode".**

0 commit comments

Comments
 (0)