Skip to content

Commit 0ffea05

Browse files
committed
upload v1.19.1
1 parent 62f5ba1 commit 0ffea05

20 files changed

Lines changed: 333 additions & 200 deletions

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# PyMuPDF 1.19.0
1+
# PyMuPDF 1.19.1
22

33
![logo](https://github.com/pymupdf/PyMuPDF/blob/master/demo/pymupdf.jpg)
44

5-
Release date: October 17, 2021
5+
Release date: October 23, 2021
66

77
On **[PyPI](https://pypi.org/project/PyMuPDF)** since August 2016: [![Downloads](https://static.pepy.tech/personalized-badge/pymupdf?period=total&units=international_system&left_color=black&right_color=orange&left_text=Downloads)](https://pepy.tech/project/pymupdf)
88

@@ -11,7 +11,7 @@ On **[PyPI](https://pypi.org/project/PyMuPDF)** since August 2016: [![Downloads]
1111

1212
# Introduction
1313

14-
PyMuPDF (current version 1.19.0) is a Python binding with support for [MuPDF](https://mupdf.com/) (current version 1.19.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.
14+
PyMuPDF (current version 1.19.1) is a Python binding with support for [MuPDF](https://mupdf.com/) (current version 1.19.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.
1515

1616
MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.
1717

changes.rst

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,19 @@
11
Change Log
22
===========
33

4+
------
5+
6+
**Changes in Version 1.19.1**
7+
8+
* **Fixed** `#1328 <https://github.com/pymupdf/PyMuPDF/issues/1328>`_. "words" text extraction again returns correct coordinates.
9+
10+
* **Changed** :meth:`Page.get_textpage_ocr` -- support specifying the desired OCR quality via parameter ``dpi``, support choice between full page OCR versus only OCRing displayed images.
11+
12+
* **Changed** :meth:`Page.get_drawings` and :meth:`Page.get_cdrawings` to automatically convert colors to RGB color tuples. Implements `#1332 <https://github.com/pymupdf/PyMuPDF/discussions/1332>`_. Similar change was applied to :meth:`Page.get_texttrace`.
13+
14+
* **Changed** :meth:`Page.get_text` to support a new parameter ``sort``. If set to ``True`` the output is conveniently sorted.
15+
16+
417
------
518

619
**Changes in Version 1.19.0**
@@ -9,21 +22,21 @@ This is the first version supporting MuPDF 1.19.*, published 2021-10-05. It intr
922

1023
PyMuPDF has now picked up integrated Tesseract OCR support, which was already present in MuPDF v1.18.0.
1124

12-
* Supported images can be OCR-ed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
13-
* All supported document pages (i.e. not only PDFs), can be OCR-ed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCR-ing) that can be searched and extracted.
25+
* Supported images can be OCRed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
26+
* All supported document pages (i.e. not only PDFs), can be OCRed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCRing) that can be searched and extracted without restrictions.
1427
* All this requires an independent installation of Tesseract. MuPDF actually (only) needs the location of Tesseract's ``"tessdata"`` folder, where its language support data are stored. This location must be available as environment variable ``TESSDATA_PREFIX``.
1528

1629
A new MuPDF feature is **journalling PDF updates**, which is also supported by this PyMuPDF version. Changes may be logged, rolled back or replayed, allowing to implement a whole new level of control over PDF document integrity -- similar to functions present in modern database systems.
1730

1831
A third feature (unrelated to the new MuPDF version) includes the ability to detect when page **objects cover or hide each other**. It is now e.g. possible to see that text is covered by a drawing or an image.
1932

20-
* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
33+
* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have also changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
2134

2235
* **Added** new parameter `"no_new_id"` to :meth:`Document.save` / :meth:`Document.tobytes` methods. Use it to suppress updating the second item of the document ``/ID`` which in PDF indicates that the original file has been updated. If the PDF has no ``/ID`` at all yet, then no new one will be created either.
2336

2437
* **Added** a **journalling facility** for PDF updates. This allows logging changes, undoing or redoing them, or saving the journal for later use. Refer to :meth:`Document.journal_enable` and friends.
2538

26-
* **Added** new :ref:`Pixmap` methods :meth:`Pixmap.ocr_save` and :meth:`Pixmap.ocr_tobytes`, which generate a 1-page PDF containing the pixmap as PNG image with OCR text layer.
39+
* **Added** new :ref:`Pixmap` methods :meth:`Pixmap.pdfocr_save` and :meth:`Pixmap.pdfocr_tobytes`, which generate a 1-page PDF containing the pixmap as PNG image with OCR text layer.
2740

2841
* **Added** :meth:`Page.get_textpage_ocr` which executes optical character recognition for the page, then extracts the results and stores them together with "normal" page content in a :ref:`TextPage`. Use or reuse this object in subsequent text extractions and text searches to avoid multiple efforts. The existing text search and text extraction methods have been extended to support a separately created textpage -- see next item.
2942

@@ -212,7 +225,7 @@ This is a bug fix version only. We are publishing early because of the potential
212225
* **Implemented** request `#843 <https://github.com/pymupdf/PyMuPDF/Discussions/843>`_: :meth:`Document.tobytes` now supports linearized PDF output. :meth:`Document.save` now also supports writing to Python **file objects**. In addition, the open function now also supports Python file objects.
213226
* **Fixed** issue `#844 <https://github.com/pymupdf/PyMuPDF/issues/844>`_.
214227
* **Fixed** issue `#838 <https://github.com/pymupdf/PyMuPDF/issues/838>`_.
215-
* **Fixed** issue `#823 <https://github.com/pymupdf/PyMuPDF/issues/823>`_. More logic for better support of OCR-ed text output (Tesseract, ABBYY).
228+
* **Fixed** issue `#823 <https://github.com/pymupdf/PyMuPDF/issues/823>`_. More logic for better support of OCRed text output (Tesseract, ABBYY).
216229
* **Fixed** issue `#818 <https://github.com/pymupdf/PyMuPDF/issues/818>`_.
217230
* **Fixed** issue `#814 <https://github.com/pymupdf/PyMuPDF/issues/814>`_.
218231
* **Added** :meth:`Document.get_page_labels` which returns a list of page label definitions of a PDF.

docs/changes.rst

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,21 @@
11
Change Log
22
===========
33

4+
------
5+
6+
**Changes in Version 1.19.1**
7+
8+
This is the first patch version to support MuPDF v1.19.0. Apart from one bug fix, it includes important improvements for OCR support and the option to **sort extracted text** to the standard reading order "from top-left to bottom-right".
9+
10+
* **Fixed** `#1328 <https://github.com/pymupdf/PyMuPDF/issues/1328>`_. "words" text extraction again returns correct ``(x0, y0)`` coordinates.
11+
12+
* **Changed** :meth:`Page.get_textpage_ocr`: it now supports parameter ``dpi`` to control OCR quality. It is also possible to choose whether the **full page** should be OCRed or **only the images displayed** by the page.
13+
14+
* **Changed** :meth:`Page.get_drawings` and :meth:`Page.get_cdrawings` to automatically convert colors to RGB color tuples. Implements `#1332 <https://github.com/pymupdf/PyMuPDF/discussions/1332>`_. Similar change was applied to :meth:`Page.get_texttrace`.
15+
16+
* **Changed** :meth:`Page.get_text` to support a parameter ``sort``. If set to ``True`` the output is conveniently sorted.
17+
18+
419
------
520

621
**Changes in Version 1.19.0**
@@ -9,15 +24,15 @@ This is the first version supporting MuPDF 1.19.*, published 2021-10-05. It intr
924

1025
PyMuPDF has now picked up integrated Tesseract OCR support, which was already present in MuPDF v1.18.0.
1126

12-
* Supported images can be OCR-ed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
13-
* All supported document pages (i.e. not only PDFs), can be OCR-ed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCR-ing) that can be searched and extracted.
27+
* Supported images can be OCRed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
28+
* All supported document pages (i.e. not only PDFs), can be OCRed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCRing) that can be searched and extracted without restrictions.
1429
* All this requires an independent installation of Tesseract. MuPDF actually (only) needs the location of Tesseract's ``"tessdata"`` folder, where its language support data are stored. This location must be available as environment variable ``TESSDATA_PREFIX``.
1530

1631
A new MuPDF feature is **journalling PDF updates**, which is also supported by this PyMuPDF version. Changes may be logged, rolled back or replayed, allowing to implement a whole new level of control over PDF document integrity -- similar to functions present in modern database systems.
1732

1833
A third feature (unrelated to the new MuPDF version) includes the ability to detect when page **objects cover or hide each other**. It is now e.g. possible to see that text is covered by a drawing or an image.
1934

20-
* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
35+
* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have also changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
2136

2237
* **Added** new parameter `"no_new_id"` to :meth:`Document.save` / :meth:`Document.tobytes` methods. Use it to suppress updating the second item of the document ``/ID`` which in PDF indicates that the original file has been updated. If the PDF has no ``/ID`` at all yet, then no new one will be created either.
2338

@@ -212,7 +227,7 @@ This is a bug fix version only. We are publishing early because of the potential
212227
* **Implemented** request `#843 <https://github.com/pymupdf/PyMuPDF/Discussions/843>`_: :meth:`Document.tobytes` now supports linearized PDF output. :meth:`Document.save` now also supports writing to Python **file objects**. In addition, the open function now also supports Python file objects.
213228
* **Fixed** issue `#844 <https://github.com/pymupdf/PyMuPDF/issues/844>`_.
214229
* **Fixed** issue `#838 <https://github.com/pymupdf/PyMuPDF/issues/838>`_.
215-
* **Fixed** issue `#823 <https://github.com/pymupdf/PyMuPDF/issues/823>`_. More logic for better support of OCR-ed text output (Tesseract, ABBYY).
230+
* **Fixed** issue `#823 <https://github.com/pymupdf/PyMuPDF/issues/823>`_. More logic for better support of OCRed text output (Tesseract, ABBYY).
216231
* **Fixed** issue `#818 <https://github.com/pymupdf/PyMuPDF/issues/818>`_.
217232
* **Fixed** issue `#814 <https://github.com/pymupdf/PyMuPDF/issues/814>`_.
218233
* **Added** :meth:`Document.get_page_labels` which returns a list of page label definitions of a PDF.

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@
4343
# built documents.
4444
#
4545
# The full version, including alpha/beta/rc tags.
46-
release = "1.19.0"
46+
release = "1.19.1"
4747

4848
# The short X.Y version
4949
version = release

docs/document.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -904,13 +904,13 @@ For details on **embedded files** refer to Appendix 3.
904904
* This list has no duplicate entries: the combination of :data:`xref`, *name* and *referencer* is unique.
905905
* In general, this is a superset of the fonts actually in use by this page. The PDF creator may e.g. have specified some global list, of which each page only makes partial use.
906906

907-
.. method:: get_page_text(pno, output="text")
907+
.. method:: get_page_text(pno, output="text", flags=3, textpage=None, sort=False)
908908

909909
Extracts the text of a page given its page number *pno* (zero-based). Invokes :meth:`Page.get_text`.
910910

911911
:arg int pno: page number, 0-based, any value *-inf < pno < page_count*.
912912

913-
:arg str output: A string specifying the requested output format: text, html, json or xml. Default is *text*.
913+
For other parameter refer to the page method.
914914

915915
:rtype: str
916916

@@ -1067,7 +1067,7 @@ For details on **embedded files** refer to Appendix 3.
10671067
:arg bool attached_files: Search for 'FileAttachment' annotations and remove the file content.
10681068
:arg bool clean_pages: Remove any comments from page painting sources. If this option is set to *False*, then this is also done for *hidden_text* and *redactions*.
10691069
:arg bool embedded_files: Remove embedded files.
1070-
:arg bool hidden_text: Remove OCR-ed text and invisible text [#f7]_.
1070+
:arg bool hidden_text: Remove OCRed text and invisible text [#f7]_.
10711071
:arg bool javascript: Remove JavaScript sources.
10721072
:arg bool metadata: Remove PDF standard metadata.
10731073
:arg bool redactions: Apply redaction annotations.

docs/faq.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2134,7 +2134,7 @@ Cause
21342134
Solution
21352135
^^^^^^^^
21362136
1. Use layout preserving text extraction: ``python -m fitz gettext file.pdf``.
2137-
2. If other text extraction tools also don't work, then the only solution again is OCR-ing the page.
2137+
2. If other text extraction tools also don't work, then the only solution again is OCRing the page.
21382138

21392139
--------------------------
21402140

docs/functions.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -421,7 +421,8 @@ Yet others are handy, general-purpose utilities.
421421
.. method:: Page.get_texttrace()
422422
423423
* New in v1.18.16
424-
* Changed in v1.19.0
424+
* Changed in v1.19.0: added key "seqno".
425+
* Changed in v1.19.1: stroke and fill colors now always are either RGB or GRAY
425426
426427
Return low-level text information of the page. The method is available for **all** document types. The result is a list of Python dictionaries with the following content::
427428
@@ -471,7 +472,7 @@ Yet others are handy, general-purpose utilities.
471472
- 1: Stroked text -- equivalent to ``1 Tr``, only the character borders are shown.
472473
- 3: Ignored text -- equivalent to ``3 Tr`` (hidden text).
473474

474-
3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of the character's border line. This value may not be provided at all with the text data. In this case, a value of 5% of the fontsize (``span["size"] * 0,05``) is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except for their types, which are 0, resp 1. It is your responsibility to handle this type of situation - in :meth:`Page.get_text`, MuPDF is doing it for you.
475+
3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of the character's border line. This value may not be provided at all with the text data. In this case, a value of 5% of the fontsize (``span["size"] * 0,05``) is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except for their types, which are 0, resp 1. It is your responsibility to handle this type of situation - in :meth:`Page.get_text`, MuPDF is doing this for you.
475476
4. For data compactness, the character's unicode is provided here. Use built-in function ``chr()`` for the character itself.
476477
5. The alpha / opacity value of the span's text, ``0 <= opacity <= 1``, 0 is invisible text, 1 (100%) is intransparent. Depending in ``span["type"]``, interpret this value as *fill* opacity or, resp. *stroke* opacity.
477478
6. *(Changd in v1.19.0)* This value is equal / close to the width of ``char["bbox"]``. However, on occasion you may find a small delta. In particular, the bbox **height** value is always computed as if **"small glyph heights"** had been requested.

docs/installation.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Step 2: Download and Generate PyMuPDF
3333
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3434
Download the sources from https://pypi.org/project/PyMuPDF/#files and decompress them.
3535

36-
Adjust the setup.py script when necessary. Especially make sure that ``include_dirs`` and ``library_dirs`` point to the folders of your MuPDF installation. The easiest way to do this is setting the environment variable ``"PYMUPDF_DIRS"`` to the name of a JSON file, that contains these two keys having a list of folder names as values::
36+
Adjust the setup.py script when necessary. Especially make sure that ``include_dirs`` and ``library_dirs`` point to the folders of your MuPDF installation. The easiest way to do this is setting the environment variable ``"PYMUPDF_DIRS"`` to the name of a JSON file, that contains a dictionary with these two keys having a list of folder names as values::
3737

3838
{
3939
"include_dirs": ["folder1", "folder2", "folder3", ...],
@@ -44,3 +44,16 @@ Now perform a *python setup.py install*.
4444

4545
.. note:: You can also install from sources of the Github repository. These **do not contain** the pre-generated files ``fitz.py`` or ``fitz_wrap.c``, which instead are generated by the installation script ``setup.py``. To use it, `SWIG <https://www.swig.org/>`_ must be installed on your system.
4646

47+
Step 3: Enable Tesseract-OCR Support
48+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49+
With the above steps, PyMuPDF contains all logic to support OCR functions. Tesseract is however not a Python package, but separate software that must be installed on the system.
50+
51+
To use it, (Py-) MuPDF needs to be told the location of Tesseract's language support folder. This currently happens via storing this folder name in the environment variable ``"TESSDATA_PREFIX"``.
52+
53+
In Windows, a typical way to define this name is::
54+
55+
set TESSDATA_PREFIX=C:\Program Files\Tesseract-OCR\tessdata
56+
57+
On Unix systems one might execute::
58+
59+
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata

0 commit comments

Comments
 (0)