Layout Documentation changes

JorjMcKie · jamie-lemon · commit baec425a63ad · 2025-11-26T16:15:00.000Z
diff --git a/docs/pymupdf-layout/index.rst b/docs/pymupdf-layout/index.rst
@@ -123,9 +123,9 @@ OCR support
 
 The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
  
-If a page contains no text at all, but is covered with an image or many vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart ordinary pictures (like photographies - which we don't want to OCR) from image-based text.
+If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographies).
 
-If the page does contain text but contains too many unreadable characters (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
+If the page does contain text but too many characters are unreadable (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
 
 For these heuristics to work we need both, an existing Tesseract installation and the availability of OpenCV in the Python environment. If either is missing, no OCR is attempted at all.
 
@@ -136,6 +136,6 @@ For these heuristics to work we need both, an existing Tesseract installation an
 |PyMuPDF Layout| and |PyMuPDF4LLM| parameter caveats
 -----------------------------------------------------
 
-If you have imported ``pymupdf.layout``, |PyMuPDF4LLM| changes its behavior in various areas. New methods become available and some features are no longer supported. Please visit `this site <https://github.com/pymupdf/pymupdf4llm/discussions/327>`_ for a detailed description of the changes. This web site is being kept up to date while we continue to work on improvements.
+If you have imported ``pymupdf.layout``, |PyMuPDF4LLM| changes its behavior in various areas quite significantly. New methods become available and also some features are no longer supported. Please visit `this site <https://github.com/pymupdf/pymupdf4llm/discussions/327>`_ for a detailed description of the changes. That web site is being kept up to date while we continue to work on improvements.
 
 .. include:: ../footer.rst
diff --git a/docs/pymupdf4llm/api.rst b/docs/pymupdf4llm/api.rst
@@ -19,6 +19,7 @@ The |PyMuPDF4LLM| API
 .. method:: to_markdown(doc: pymupdf.Document | str, *, \
     detect_bg_color: bool = True, \
     dpi: int = 150, \
+    ocr_dpi: int = 400, \
     embed_images: bool = False, \
     extract_words: bool = False, \
     filename: str | None = None, \
@@ -54,6 +55,8 @@ The |PyMuPDF4LLM| API
 
     :arg int dpi: specify the desired image resolution in dots per inch. Relevant only if `write_images=True` or `embed_images=True`. Default value is 150.
 
+    :arg int ocr_dpi: specify the desired image resolution in dots per inch for applying OCR to the intermdeiate image of the page. Default value is 400. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precison but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high. **Only valid in "layout mode".**
+
     :arg bool embed_images: like `write_images`, but images will be included in the markdown text as base64-encoded strings. Mutually exclusive with `write_images` and ignores `image_path`. This may drastically increase the size of your markdown text.
 
     :arg bool extract_words: a value of `True` enforces `page_chunks=True` and adds key "words" to each page dictionary. Its value is a list of words as delivered by PyMuPDF's `Page` method `get_text("words")`. The sequence of the words in this list is the same as the extracted text. **Ignored in "layout mode".**