pymupdf
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎changes.rst‎
Lines changed: 18 additions & 5 deletions b/‎changes.rst‎
Lines changed: 18 additions & 5 deletions
diff --git a/‎docs/changes.rst‎
Lines changed: 19 additions & 4 deletions b/‎docs/changes.rst‎
Lines changed: 19 additions & 4 deletions
diff --git a/‎docs/conf.py‎
Lines changed: 1 addition & 1 deletion b/‎docs/conf.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/document.rst‎
Lines changed: 3 additions & 3 deletions b/‎docs/document.rst‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/faq.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/faq.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/functions.rst‎
Lines changed: 3 additions & 2 deletions b/‎docs/functions.rst‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docs/installation.rst‎
Lines changed: 14 additions & 1 deletion b/‎docs/installation.rst‎
Lines changed: 14 additions & 1 deletion
@@ -1,8 +1,8 @@
-# PyMuPDF 1.19.0
+# PyMuPDF 1.19.1
 
 ![logo](https://github.com/pymupdf/PyMuPDF/blob/master/demo/pymupdf.jpg)
 
-Release date: October 17, 2021
+Release date: October 23, 2021
 
 On **[PyPI](https://pypi.org/project/PyMuPDF)** since August 2016: [![Downloads](https://static.pepy.tech/personalized-badge/pymupdf?period=total&units=international_system&left_color=black&right_color=orange&left_text=Downloads)](https://pepy.tech/project/pymupdf)
 
@@ -11,7 +11,7 @@ On **[PyPI](https://pypi.org/project/PyMuPDF)** since August 2016: [![Downloads]
 
 # Introduction
 
-PyMuPDF (current version 1.19.0) is a Python binding with support for [MuPDF](https://mupdf.com/) (current version 1.19.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.
+PyMuPDF (current version 1.19.1) is a Python binding with support for [MuPDF](https://mupdf.com/) (current version 1.19.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.
 
 MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.
 
 
@@ -1,6 +1,19 @@
 Change Log
 ===========
 
+------
+
+**Changes in Version 1.19.1**
+
+* **Fixed** `#1328 <https://github.com/pymupdf/PyMuPDF/issues/1328>`_. "words" text extraction again returns correct coordinates.
+
+* **Changed** :meth:`Page.get_textpage_ocr` -- support specifying the desired OCR quality via parameter ``dpi``, support choice between full page OCR versus only OCRing displayed images.
+
+* **Changed** :meth:`Page.get_drawings` and :meth:`Page.get_cdrawings` to automatically convert colors to RGB color tuples. Implements `#1332 <https://github.com/pymupdf/PyMuPDF/discussions/1332>`_. Similar change was applied to :meth:`Page.get_texttrace`.
+
+* **Changed** :meth:`Page.get_text` to support a new parameter ``sort``. If set to ``True`` the output is conveniently sorted.
+
+
 ------
 
 **Changes in Version 1.19.0**
@@ -9,21 +22,21 @@ This is the first version supporting MuPDF 1.19.*, published 2021-10-05. It intr
 
 PyMuPDF has now picked up integrated Tesseract OCR support, which was already present in MuPDF v1.18.0.
 
-* Supported images can be OCR-ed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
-* All supported document pages (i.e. not only PDFs), can be OCR-ed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCR-ing) that can be searched and extracted.
+* Supported images can be OCRed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
+* All supported document pages (i.e. not only PDFs), can be OCRed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCRing) that can be searched and extracted without restrictions.
 * All this requires an independent installation of Tesseract. MuPDF actually (only) needs the location of Tesseract's ``"tessdata"`` folder, where its language support data are stored. This location must be available as environment variable ``TESSDATA_PREFIX``.
 
 A new MuPDF feature is **journalling PDF updates**, which is also supported by this PyMuPDF version. Changes may be logged, rolled back or replayed, allowing to implement a whole new level of control over PDF document integrity -- similar to functions present in modern database systems.
 
 A third feature (unrelated to the new MuPDF version) includes the ability to detect when page **objects cover or hide each other**. It is now e.g. possible to see that text is covered by a drawing or an image.
 
-* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
+* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have also changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
 
 * **Added** new parameter `"no_new_id"` to :meth:`Document.save` / :meth:`Document.tobytes` methods. Use it to suppress updating the second item of the document ``/ID`` which in PDF indicates that the original file has been updated. If the PDF has no ``/ID`` at all yet, then no new one will be created either.
 
 * **Added** a **journalling facility** for PDF updates. This allows logging changes, undoing or redoing them, or saving the journal for later use. Refer to :meth:`Document.journal_enable` and friends.
 
-* **Added** new :ref:`Pixmap` methods :meth:`Pixmap.ocr_save` and :meth:`Pixmap.ocr_tobytes`, which generate a 1-page PDF containing the pixmap as PNG image with OCR text layer.
+* **Added** new :ref:`Pixmap` methods :meth:`Pixmap.pdfocr_save` and :meth:`Pixmap.pdfocr_tobytes`, which generate a 1-page PDF containing the pixmap as PNG image with OCR text layer.
 
 * **Added** :meth:`Page.get_textpage_ocr` which executes optical character recognition for the page, then extracts the results and stores them together with "normal" page content in a :ref:`TextPage`. Use or reuse this object in subsequent text extractions and text searches to avoid multiple efforts. The existing text search and text extraction methods have been extended to support a separately created textpage -- see next item.
 
@@ -212,7 +225,7 @@ This is a bug fix version only. We are publishing early because of the potential
 * **Implemented** request `#843 <https://github.com/pymupdf/PyMuPDF/Discussions/843>`_: :meth:`Document.tobytes` now supports linearized PDF output. :meth:`Document.save` now also supports writing to Python **file objects**. In addition, the open function now also supports Python file objects.
 * **Fixed** issue `#844 <https://github.com/pymupdf/PyMuPDF/issues/844>`_.
 * **Fixed** issue `#838 <https://github.com/pymupdf/PyMuPDF/issues/838>`_.
-* **Fixed** issue `#823 <https://github.com/pymupdf/PyMuPDF/issues/823>`_. More logic for better support of OCR-ed text output (Tesseract, ABBYY).
+* **Fixed** issue `#823 <https://github.com/pymupdf/PyMuPDF/issues/823>`_. More logic for better support of OCRed text output (Tesseract, ABBYY).
 * **Fixed** issue `#818 <https://github.com/pymupdf/PyMuPDF/issues/818>`_.
 * **Fixed** issue `#814 <https://github.com/pymupdf/PyMuPDF/issues/814>`_.
 * **Added** :meth:`Document.get_page_labels` which returns a list of page label definitions of a PDF.
 
@@ -1,6 +1,21 @@
 Change Log
 ===========
 
+------
+
+**Changes in Version 1.19.1**
+
+This is the first patch version to support MuPDF v1.19.0. Apart from one bug fix, it includes important improvements for OCR support and the option to **sort extracted text** to the standard reading order "from top-left to bottom-right".
+
+* **Fixed** `#1328 <https://github.com/pymupdf/PyMuPDF/issues/1328>`_. "words" text extraction again returns correct ``(x0, y0)`` coordinates.
+
+* **Changed** :meth:`Page.get_textpage_ocr`: it now supports parameter ``dpi`` to control OCR quality. It is also possible to choose whether the **full page** should be OCRed or **only the images displayed** by the page.
+
+* **Changed** :meth:`Page.get_drawings` and :meth:`Page.get_cdrawings` to automatically convert colors to RGB color tuples. Implements `#1332 <https://github.com/pymupdf/PyMuPDF/discussions/1332>`_. Similar change was applied to :meth:`Page.get_texttrace`.
+
+* **Changed** :meth:`Page.get_text` to support a parameter ``sort``. If set to ``True`` the output is conveniently sorted.
+
+
 ------
 
 **Changes in Version 1.19.0**
@@ -9,15 +24,15 @@ This is the first version supporting MuPDF 1.19.*, published 2021-10-05. It intr
 
 PyMuPDF has now picked up integrated Tesseract OCR support, which was already present in MuPDF v1.18.0.
 
-* Supported images can be OCR-ed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
-* All supported document pages (i.e. not only PDFs), can be OCR-ed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCR-ing) that can be searched and extracted.
+* Supported images can be OCRed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
+* All supported document pages (i.e. not only PDFs), can be OCRed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCRing) that can be searched and extracted without restrictions.
 * All this requires an independent installation of Tesseract. MuPDF actually (only) needs the location of Tesseract's ``"tessdata"`` folder, where its language support data are stored. This location must be available as environment variable ``TESSDATA_PREFIX``.
 
 A new MuPDF feature is **journalling PDF updates**, which is also supported by this PyMuPDF version. Changes may be logged, rolled back or replayed, allowing to implement a whole new level of control over PDF document integrity -- similar to functions present in modern database systems.
 
 A third feature (unrelated to the new MuPDF version) includes the ability to detect when page **objects cover or hide each other**. It is now e.g. possible to see that text is covered by a drawing or an image.
 
-* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
+* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have also changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
 
 * **Added** new parameter `"no_new_id"` to :meth:`Document.save` / :meth:`Document.tobytes` methods. Use it to suppress updating the second item of the document ``/ID`` which in PDF indicates that the original file has been updated. If the PDF has no ``/ID`` at all yet, then no new one will be created either.
 
@@ -212,7 +227,7 @@ This is a bug fix version only. We are publishing early because of the potential
 * **Implemented** request `#843 <https://github.com/pymupdf/PyMuPDF/Discussions/843>`_: :meth:`Document.tobytes` now supports linearized PDF output. :meth:`Document.save` now also supports writing to Python **file objects**. In addition, the open function now also supports Python file objects.
 * **Fixed** issue `#844 <https://github.com/pymupdf/PyMuPDF/issues/844>`_.
 * **Fixed** issue `#838 <https://github.com/pymupdf/PyMuPDF/issues/838>`_.
-* **Fixed** issue `#823 <https://github.com/pymupdf/PyMuPDF/issues/823>`_. More logic for better support of OCR-ed text output (Tesseract, ABBYY).
+* **Fixed** issue `#823 <https://github.com/pymupdf/PyMuPDF/issues/823>`_. More logic for better support of OCRed text output (Tesseract, ABBYY).
 * **Fixed** issue `#818 <https://github.com/pymupdf/PyMuPDF/issues/818>`_.
 * **Fixed** issue `#814 <https://github.com/pymupdf/PyMuPDF/issues/814>`_.
 * **Added** :meth:`Document.get_page_labels` which returns a list of page label definitions of a PDF.
 
@@ -43,7 +43,7 @@
 # built documents.
 #
 # The full version, including alpha/beta/rc tags.
-release = "1.19.0"
+release = "1.19.1"
 
 # The short X.Y version
 version = release
 
@@ -904,13 +904,13 @@ For details on **embedded files** refer to Appendix 3.
           * This list has no duplicate entries: the combination of :data:`xref`, *name* and *referencer* is unique.
           * In general, this is a superset of the fonts actually in use by this page. The PDF creator may e.g. have specified some global list, of which each page only makes partial use.
 
-    .. method:: get_page_text(pno, output="text")
+    .. method:: get_page_text(pno, output="text", flags=3, textpage=None, sort=False)
 
       Extracts the text of a page given its page number *pno* (zero-based). Invokes :meth:`Page.get_text`.
 
       :arg int pno: page number, 0-based, any value *-inf < pno < page_count*.
 
-      :arg str output: A string specifying the requested output format: text, html, json or xml. Default is *text*.
+      For other parameter refer to the page method.
 
       :rtype: str
 
@@ -1067,7 +1067,7 @@ For details on **embedded files** refer to Appendix 3.
       :arg bool attached_files: Search for 'FileAttachment' annotations and remove the file content.
       :arg bool clean_pages: Remove any comments from page painting sources. If this option is set to *False*, then this is also done for *hidden_text* and *redactions*.
       :arg bool embedded_files: Remove embedded files.
-      :arg bool hidden_text: Remove OCR-ed text and invisible text [#f7]_.
+      :arg bool hidden_text: Remove OCRed text and invisible text [#f7]_.
       :arg bool javascript: Remove JavaScript sources.
       :arg bool metadata: Remove PDF standard metadata.
       :arg bool redactions: Apply redaction annotations.
 
@@ -2134,7 +2134,7 @@ Cause
 Solution
 ^^^^^^^^
 1. Use layout preserving text extraction: ``python -m fitz gettext file.pdf``.
-2. If other text extraction tools also don't work, then the only solution again is OCR-ing the page.
+2. If other text extraction tools also don't work, then the only solution again is OCRing the page.
 
 --------------------------
 
 
@@ -421,7 +421,8 @@ Yet others are handy, general-purpose utilities.
    .. method:: Page.get_texttrace()
 
       * New in v1.18.16
-      * Changed in v1.19.0
+      * Changed in v1.19.0: added key "seqno".
+      * Changed in v1.19.1: stroke and fill colors now always are either RGB or GRAY
 
       Return low-level text information of the page. The method is available for **all** document types. The result is a list of Python dictionaries with the following content::
 
@@ -471,7 +472,7 @@ Yet others are handy, general-purpose utilities.
          - 1: Stroked text -- equivalent to ``1 Tr``, only the character borders are shown.
          - 3: Ignored text -- equivalent to ``3 Tr`` (hidden text).
 
-      3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of the character's border line. This value may not be provided at all with the text data. In this case, a value of 5% of the fontsize (``span["size"] * 0,05``) is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except for their types, which are 0, resp 1. It is your responsibility to handle this type of situation - in :meth:`Page.get_text`, MuPDF is doing it for you.
+      3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of the character's border line. This value may not be provided at all with the text data. In this case, a value of 5% of the fontsize (``span["size"] * 0,05``) is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except for their types, which are 0, resp 1. It is your responsibility to handle this type of situation - in :meth:`Page.get_text`, MuPDF is doing this for you.
       4. For data compactness, the character's unicode is provided here. Use built-in function ``chr()`` for the character itself.
       5. The alpha / opacity value of the span's text, ``0 <= opacity <= 1``, 0 is invisible text, 1 (100%) is intransparent. Depending in ``span["type"]``, interpret this value as *fill* opacity or, resp. *stroke* opacity.
       6. *(Changd in v1.19.0)* This value is equal / close to the width of ``char["bbox"]``. However, on occasion you may find a small delta. In particular, the bbox **height** value is always computed as if **"small glyph heights"** had been requested.
 
@@ -33,7 +33,7 @@ Step 2: Download and Generate PyMuPDF
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Download the sources from https://pypi.org/project/PyMuPDF/#files and decompress them.
 
-Adjust the setup.py script when necessary. Especially make sure that ``include_dirs`` and ``library_dirs`` point to the folders of your MuPDF installation. The easiest way to do this is setting the environment variable ``"PYMUPDF_DIRS"`` to the name of a JSON file, that contains these two keys having a list of folder names as values::
+Adjust the setup.py script when necessary. Especially make sure that ``include_dirs`` and ``library_dirs`` point to the folders of your MuPDF installation. The easiest way to do this is setting the environment variable ``"PYMUPDF_DIRS"`` to the name of a JSON file, that contains a dictionary with these two keys having a list of folder names as values::
 
     {
       "include_dirs": ["folder1", "folder2", "folder3", ...],
@@ -44,3 +44,16 @@ Now perform a *python setup.py install*.
 
 .. note:: You can also install from sources of the Github repository. These **do not contain** the pre-generated files ``fitz.py`` or ``fitz_wrap.c``, which instead are generated by the installation script ``setup.py``. To use it, `SWIG <https://www.swig.org/>`_ must be installed on your system.
 
+Step 3: Enable Tesseract-OCR Support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+With the above steps, PyMuPDF contains all logic to support OCR functions. Tesseract is however not a Python package, but separate software that must be installed on the system.
+
+To use it, (Py-) MuPDF needs to be told the location of Tesseract's language support folder. This currently happens via storing this folder name in the environment variable ``"TESSDATA_PREFIX"``.
+
+In Windows, a typical way to define this name is::
+
+    set TESSDATA_PREFIX=C:\Program Files\Tesseract-OCR\tessdata
+
+On Unix systems one might execute::
+
+    export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
Original file line number	Diff line number	Diff line change
`@@ -43,7 +43,7 @@`
`43`	`43`	`# built documents.`
`44`	`44`	`#`
`45`	`45`	`# The full version, including alpha/beta/rc tags.`
`46`		`-release = "1.19.0"`
	`46`	`+release = "1.19.1"`
`47`	`47`
`48`	`48`	`# The short X.Y version`
`49`	`49`	`version = release`