You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On **[PyPI](https://pypi.org/project/PyMuPDF)** since August 2016: [](https://pepy.tech/project/pymupdf)
8
8
@@ -11,7 +11,7 @@ On **[PyPI](https://pypi.org/project/PyMuPDF)** since August 2016: [![Downloads]
11
11
12
12
# Introduction
13
13
14
-
PyMuPDF (current version 1.19.0) is a Python binding with support for [MuPDF](https://mupdf.com/) (current version 1.19.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.
14
+
PyMuPDF (current version 1.19.1) is a Python binding with support for [MuPDF](https://mupdf.com/) (current version 1.19.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.
15
15
16
16
MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.
Copy file name to clipboardExpand all lines: changes.rst
+18-5Lines changed: 18 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,19 @@
1
1
Change Log
2
2
===========
3
3
4
+
------
5
+
6
+
**Changes in Version 1.19.1**
7
+
8
+
* **Fixed** `#1328 <https://github.com/pymupdf/PyMuPDF/issues/1328>`_. "words" text extraction again returns correct coordinates.
9
+
10
+
* **Changed** :meth:`Page.get_textpage_ocr` -- support specifying the desired OCR quality via parameter ``dpi``, support choice between full page OCR versus only OCRing displayed images.
11
+
12
+
* **Changed** :meth:`Page.get_drawings` and :meth:`Page.get_cdrawings` to automatically convert colors to RGB color tuples. Implements `#1332 <https://github.com/pymupdf/PyMuPDF/discussions/1332>`_. Similar change was applied to :meth:`Page.get_texttrace`.
13
+
14
+
* **Changed** :meth:`Page.get_text` to support a new parameter ``sort``. If set to ``True`` the output is conveniently sorted.
15
+
16
+
4
17
------
5
18
6
19
**Changes in Version 1.19.0**
@@ -9,21 +22,21 @@ This is the first version supporting MuPDF 1.19.*, published 2021-10-05. It intr
9
22
10
23
PyMuPDF has now picked up integrated Tesseract OCR support, which was already present in MuPDF v1.18.0.
11
24
12
-
* Supported images can be OCR-ed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
13
-
* All supported document pages (i.e. not only PDFs), can be OCR-ed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCR-ing) that can be searched and extracted.
25
+
* Supported images can be OCRed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
26
+
* All supported document pages (i.e. not only PDFs), can be OCRed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCRing) that can be searched and extracted without restrictions.
14
27
* All this requires an independent installation of Tesseract. MuPDF actually (only) needs the location of Tesseract's ``"tessdata"`` folder, where its language support data are stored. This location must be available as environment variable ``TESSDATA_PREFIX``.
15
28
16
29
A new MuPDF feature is **journalling PDF updates**, which is also supported by this PyMuPDF version. Changes may be logged, rolled back or replayed, allowing to implement a whole new level of control over PDF document integrity -- similar to functions present in modern database systems.
17
30
18
31
A third feature (unrelated to the new MuPDF version) includes the ability to detect when page **objects cover or hide each other**. It is now e.g. possible to see that text is covered by a drawing or an image.
19
32
20
-
* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
33
+
* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have also changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
21
34
22
35
* **Added** new parameter `"no_new_id"` to :meth:`Document.save` / :meth:`Document.tobytes` methods. Use it to suppress updating the second item of the document ``/ID`` which in PDF indicates that the original file has been updated. If the PDF has no ``/ID`` at all yet, then no new one will be created either.
23
36
24
37
* **Added** a **journalling facility** for PDF updates. This allows logging changes, undoing or redoing them, or saving the journal for later use. Refer to :meth:`Document.journal_enable` and friends.
25
38
26
-
* **Added** new :ref:`Pixmap` methods :meth:`Pixmap.ocr_save` and :meth:`Pixmap.ocr_tobytes`, which generate a 1-page PDF containing the pixmap as PNG image with OCR text layer.
39
+
* **Added** new :ref:`Pixmap` methods :meth:`Pixmap.pdfocr_save` and :meth:`Pixmap.pdfocr_tobytes`, which generate a 1-page PDF containing the pixmap as PNG image with OCR text layer.
27
40
28
41
* **Added** :meth:`Page.get_textpage_ocr` which executes optical character recognition for the page, then extracts the results and stores them together with "normal" page content in a :ref:`TextPage`. Use or reuse this object in subsequent text extractions and text searches to avoid multiple efforts. The existing text search and text extraction methods have been extended to support a separately created textpage -- see next item.
29
42
@@ -212,7 +225,7 @@ This is a bug fix version only. We are publishing early because of the potential
212
225
* **Implemented** request `#843 <https://github.com/pymupdf/PyMuPDF/Discussions/843>`_: :meth:`Document.tobytes` now supports linearized PDF output. :meth:`Document.save` now also supports writing to Python **file objects**. In addition, the open function now also supports Python file objects.
Copy file name to clipboardExpand all lines: docs/changes.rst
+19-4Lines changed: 19 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,21 @@
1
1
Change Log
2
2
===========
3
3
4
+
------
5
+
6
+
**Changes in Version 1.19.1**
7
+
8
+
This is the first patch version to support MuPDF v1.19.0. Apart from one bug fix, it includes important improvements for OCR support and the option to **sort extracted text** to the standard reading order "from top-left to bottom-right".
9
+
10
+
* **Fixed** `#1328 <https://github.com/pymupdf/PyMuPDF/issues/1328>`_. "words" text extraction again returns correct ``(x0, y0)`` coordinates.
11
+
12
+
* **Changed** :meth:`Page.get_textpage_ocr`: it now supports parameter ``dpi`` to control OCR quality. It is also possible to choose whether the **full page** should be OCRed or **only the images displayed** by the page.
13
+
14
+
* **Changed** :meth:`Page.get_drawings` and :meth:`Page.get_cdrawings` to automatically convert colors to RGB color tuples. Implements `#1332 <https://github.com/pymupdf/PyMuPDF/discussions/1332>`_. Similar change was applied to :meth:`Page.get_texttrace`.
15
+
16
+
* **Changed** :meth:`Page.get_text` to support a parameter ``sort``. If set to ``True`` the output is conveniently sorted.
17
+
18
+
4
19
------
5
20
6
21
**Changes in Version 1.19.0**
@@ -9,15 +24,15 @@ This is the first version supporting MuPDF 1.19.*, published 2021-10-05. It intr
9
24
10
25
PyMuPDF has now picked up integrated Tesseract OCR support, which was already present in MuPDF v1.18.0.
11
26
12
-
* Supported images can be OCR-ed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
13
-
* All supported document pages (i.e. not only PDFs), can be OCR-ed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCR-ing) that can be searched and extracted.
27
+
* Supported images can be OCRed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer.
28
+
* All supported document pages (i.e. not only PDFs), can be OCRed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCRing) that can be searched and extracted without restrictions.
14
29
* All this requires an independent installation of Tesseract. MuPDF actually (only) needs the location of Tesseract's ``"tessdata"`` folder, where its language support data are stored. This location must be available as environment variable ``TESSDATA_PREFIX``.
15
30
16
31
A new MuPDF feature is **journalling PDF updates**, which is also supported by this PyMuPDF version. Changes may be logged, rolled back or replayed, allowing to implement a whole new level of control over PDF document integrity -- similar to functions present in modern database systems.
17
32
18
33
A third feature (unrelated to the new MuPDF version) includes the ability to detect when page **objects cover or hide each other**. It is now e.g. possible to see that text is covered by a drawing or an image.
19
34
20
-
* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
35
+
* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have also changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details.
21
36
22
37
* **Added** new parameter `"no_new_id"` to :meth:`Document.save` / :meth:`Document.tobytes` methods. Use it to suppress updating the second item of the document ``/ID`` which in PDF indicates that the original file has been updated. If the PDF has no ``/ID`` at all yet, then no new one will be created either.
23
38
@@ -212,7 +227,7 @@ This is a bug fix version only. We are publishing early because of the potential
212
227
* **Implemented** request `#843 <https://github.com/pymupdf/PyMuPDF/Discussions/843>`_: :meth:`Document.tobytes` now supports linearized PDF output. :meth:`Document.save` now also supports writing to Python **file objects**. In addition, the open function now also supports Python file objects.
Copy file name to clipboardExpand all lines: docs/document.rst
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -904,13 +904,13 @@ For details on **embedded files** refer to Appendix 3.
904
904
* This list has no duplicate entries: the combination of :data:`xref`, *name* and *referencer* is unique.
905
905
* In general, this is a superset of the fonts actually in use by this page. The PDF creator may e.g. have specified some global list, of which each page only makes partial use.
Extracts the text of a page given its page number *pno* (zero-based). Invokes :meth:`Page.get_text`.
910
910
911
911
:arg int pno: page number, 0-based, any value *-inf < pno < page_count*.
912
912
913
-
:arg str output: A string specifying the requested output format: text, html, json or xml. Default is *text*.
913
+
For other parameter refer to the page method.
914
914
915
915
:rtype: str
916
916
@@ -1067,7 +1067,7 @@ For details on **embedded files** refer to Appendix 3.
1067
1067
:arg bool attached_files: Search for 'FileAttachment' annotations and remove the file content.
1068
1068
:arg bool clean_pages: Remove any comments from page painting sources. If this option is set to *False*, then this is also done for *hidden_text* and *redactions*.
1069
1069
:arg bool embedded_files: Remove embedded files.
1070
-
:arg bool hidden_text: Remove OCR-ed text and invisible text [#f7]_.
1070
+
:arg bool hidden_text: Remove OCRed text and invisible text [#f7]_.
Copy file name to clipboardExpand all lines: docs/functions.rst
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -421,7 +421,8 @@ Yet others are handy, general-purpose utilities.
421
421
.. method:: Page.get_texttrace()
422
422
423
423
* New in v1.18.16
424
-
* Changed in v1.19.0
424
+
* Changed in v1.19.0: added key "seqno".
425
+
* Changed in v1.19.1: stroke and fill colors now always are either RGB or GRAY
425
426
426
427
Return low-level text information of the page. The method is available for **all** document types. The result is a list of Python dictionaries with the following content::
427
428
@@ -471,7 +472,7 @@ Yet others are handy, general-purpose utilities.
471
472
- 1: Stroked text -- equivalent to ``1 Tr``, only the character borders are shown.
472
473
- 3: Ignored text -- equivalent to ``3 Tr`` (hidden text).
473
474
474
-
3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of the character's border line. This value may not be provided at all with the text data. In this case, a value of 5% of the fontsize (``span["size"] * 0,05``) is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except for their types, which are 0, resp 1. It is your responsibility to handle this type of situation - in :meth:`Page.get_text`, MuPDF is doing it for you.
475
+
3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of the character's border line. This value may not be provided at all with the text data. In this case, a value of 5% of the fontsize (``span["size"] * 0,05``) is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except for their types, which are 0, resp 1. It is your responsibility to handle this type of situation - in :meth:`Page.get_text`, MuPDF is doing this for you.
475
476
4. For data compactness, the character's unicode is provided here. Use built-in function ``chr()`` for the character itself.
476
477
5. The alpha / opacity value of the span's text, ``0 <= opacity <= 1``, 0 is invisible text, 1 (100%) is intransparent. Depending in ``span["type"]``, interpret this value as *fill* opacity or, resp. *stroke* opacity.
477
478
6. *(Changd in v1.19.0)* This value is equal / close to the width of ``char["bbox"]``. However, on occasion you may find a small delta. In particular, the bbox **height** value is always computed as if **"small glyph heights"** had been requested.
Copy file name to clipboardExpand all lines: docs/installation.rst
+14-1Lines changed: 14 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ Step 2: Download and Generate PyMuPDF
33
33
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
34
34
Download the sources from https://pypi.org/project/PyMuPDF/#files and decompress them.
35
35
36
-
Adjust the setup.py script when necessary. Especially make sure that ``include_dirs`` and ``library_dirs`` point to the folders of your MuPDF installation. The easiest way to do this is setting the environment variable ``"PYMUPDF_DIRS"`` to the name of a JSON file, that contains these two keys having a list of folder names as values::
36
+
Adjust the setup.py script when necessary. Especially make sure that ``include_dirs`` and ``library_dirs`` point to the folders of your MuPDF installation. The easiest way to do this is setting the environment variable ``"PYMUPDF_DIRS"`` to the name of a JSON file, that contains a dictionary with these two keys having a list of folder names as values::
@@ -44,3 +44,16 @@ Now perform a *python setup.py install*.
44
44
45
45
.. note:: You can also install from sources of the Github repository. These **do not contain** the pre-generated files ``fitz.py`` or ``fitz_wrap.c``, which instead are generated by the installation script ``setup.py``. To use it, `SWIG <https://www.swig.org/>`_ must be installed on your system.
46
46
47
+
Step 3: Enable Tesseract-OCR Support
48
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49
+
With the above steps, PyMuPDF contains all logic to support OCR functions. Tesseract is however not a Python package, but separate software that must be installed on the system.
50
+
51
+
To use it, (Py-) MuPDF needs to be told the location of Tesseract's language support folder. This currently happens via storing this folder name in the environment variable ``"TESSDATA_PREFIX"``.
52
+
53
+
In Windows, a typical way to define this name is::
54
+
55
+
set TESSDATA_PREFIX=C:\Program Files\Tesseract-OCR\tessdata
0 commit comments