Skip to content

Commit 82fb1fc

Browse files
committed
Debugging Documentation
Corrections to document fitz.Matrix and updates on performance.
1 parent 51d7d9b commit 82fb1fc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+880
-758
lines changed

doc/PyMuPDF.pdf

6.28 KB
Binary file not shown.

doc/html/.buildinfo

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: 444d4db8bf57147137dc4b3675fd4de0
3+
config: 5870aaa6ecde0074afae47aa34341ab0
44
tags: 645f666f9bcd5a90fca523b33c5a78b7

doc/html/.doctrees/app1.doctree

45 KB
Binary file not shown.

doc/html/.doctrees/app2.doctree

29.8 KB
Binary file not shown.

doc/html/.doctrees/changes.doctree

9.95 KB
Binary file not shown.

doc/html/.doctrees/classes.doctree

3.14 KB
Binary file not shown.

doc/html/.doctrees/colorspace.doctree

8.74 KB
Binary file not shown.

doc/html/.doctrees/device.doctree

13.1 KB
Binary file not shown.
18.6 KB
Binary file not shown.

doc/html/.doctrees/document.doctree

96.3 KB
Binary file not shown.

doc/html/.doctrees/environment.pickle

55.3 KB
Binary file not shown.

doc/html/.doctrees/functions.doctree

11.2 KB
Binary file not shown.

doc/html/.doctrees/identity.doctree

6.7 KB
Binary file not shown.

doc/html/.doctrees/index.doctree

3.53 KB
Binary file not shown.
23.3 KB
Binary file not shown.

doc/html/.doctrees/intro.doctree

16.1 KB
Binary file not shown.

doc/html/.doctrees/irect.doctree

35.8 KB
Binary file not shown.

doc/html/.doctrees/link.doctree

14.7 KB
Binary file not shown.

doc/html/.doctrees/linkdest.doctree

38.5 KB
Binary file not shown.

doc/html/.doctrees/matrix.doctree

68.2 KB
Binary file not shown.

doc/html/.doctrees/outline.doctree

28.6 KB
Binary file not shown.

doc/html/.doctrees/page.doctree

46.1 KB
Binary file not shown.

doc/html/.doctrees/pixmap.doctree

124 KB
Binary file not shown.

doc/html/.doctrees/point.doctree

16.2 KB
Binary file not shown.

doc/html/.doctrees/rect.doctree

47 KB
Binary file not shown.

doc/html/.doctrees/textpage.doctree

28 KB
Binary file not shown.

doc/html/.doctrees/textsheet.doctree

3.77 KB
Binary file not shown.

doc/html/.doctrees/tutorial.doctree

95.3 KB
Binary file not shown.

doc/html/.doctrees/vars.doctree

32.3 KB
Binary file not shown.

doc/html/_images/render_speed.png

4.32 KB
Loading

doc/html/_images/textperformance.png

-2.79 KB
Loading

doc/html/_sources/app1.txt

+35-22
Original file line numberDiff line numberDiff line change
@@ -24,28 +24,28 @@ Here is the list of files we are using. Each file name is accompanied by further
2424
Part 1: Parsing
2525
~~~~~~~~~~~~~~~~
2626

27-
How fast is a PDF file read and its content parsed for further processing? The sheer parsing performance cannot directly be compared, because batch utilities always execute a requested task in one go, front to end. ``pdfrw`` too, has a ``lazy`` strategy for parsing, meaning it only parses those parts of a document that are required in any moment.
27+
How fast is a PDF file read and its content parsed for further processing? The sheer parsing performance cannot directly be compared, because batch utilities always execute a requested task completely, in one go, front to end. ``pdfrw`` too, has a ``lazy`` strategy for parsing, meaning it only parses those parts of a document that are required in any moment.
2828

29-
We therefore measure the time to copy a PDF file to an output file, and doing nothing else.
29+
In order to yet find an answer to the question, we therefore measure the time to copy a PDF file to an output file with each tool, and doing nothing else.
3030

3131
**These were the tools**
3232

33-
All tools are either platform independant, or at least can run on Windows and Unix / Linux (pdftk).
33+
All tools are either platform independent, or at least can run both, on Windows and Unix / Linux (pdftk).
3434

35-
**Poppler** is missing here, because it specifically is a Linux tool set, although we know there exist Windows ports (created with considerable effort apparently). Technically, it is a C/C++ library, for which a Python binding exists - in so far it is somewhat comparable to PyMuPDF. But Poppler in contrast is tightly coupled to **Qt** and **Cairo**. We may still include it in future, when a more handy Windows installation is available. We have seen however some `analysis <http://hzqtc.github.io/2012/04/poppler-vs-mupdf.html>`_, that hints at a much lower performance than MuPDF. Our comparison of text extraction speeds also show a much lower performance of Poppler's PDF code base **Xpdf**.
35+
**Poppler** is missing here, because it specifically is a Linux tool set, although we know there exist Windows ports (created with considerable effort apparently). Technically, it is a C/C++ library, for which a Python binding exists - in so far somewhat comparable to PyMuPDF. But Poppler in contrast is tightly coupled to **Qt** and **Cairo**. We may still include it in future, when a more handy Windows installation is available. We have seen however some `analysis <http://hzqtc.github.io/2012/04/poppler-vs-mupdf.html>`_, that hints at a much lower performance than MuPDF. Our comparison of text extraction speeds also show a much lower performance of Poppler's PDF code base **Xpdf**.
3636

3737
Image rendering of MuPDF also is about three times faster than the one of Xpdf when comparing the command line tools ``mudraw`` of MuPDF and ``pdftopng`` of Xpdf - see part 3 of this chapter.
3838

39-
========= =====================================================================
39+
========= ==========================================================================
4040
Tool Description
41-
========= =====================================================================
41+
========= ==========================================================================
4242
PyMuPDF tool of this manual, appearing as "fitz" in reports
43-
pdfrw a pure Python tool, can be used as frontend to ReportLab and rst2pdf
43+
pdfrw a pure Python tool, is being used by rst2pdf, has interface to ReportLab
4444
PyPDF2 a pure Python tool with a very complete function set
4545
pdftk a command line utility with numerous functions
46-
========= =====================================================================
46+
========= ==========================================================================
4747

48-
This is how each of the tools is being used with the test:
48+
This is how each of the tools was used:
4949

5050
**PyMuPDF**:
5151
::
@@ -81,17 +81,17 @@ If we leave out the Adobe manual, this table looks like
8181

8282
.. image:: copy_speed_2.png
8383

84-
PyMuPDF is by far the fastest: on average 2.4 times faster than the second best (the pure Python tool pdfrw, **chapeau pdfrw!**), and 10 times faster than the command line tool pdftk.
84+
PyMuPDF is by far the fastest: on average 4.5 times faster than the second best (the pure Python tool pdfrw, **chapeau pdfrw!**), and almost 20 times faster than the command line tool pdftk.
8585

86-
Where PyMuPDF only requires less than 24 seconds to process all files, pdftk affords itself almost 4 minutes.
86+
Where PyMuPDF only requires less than 13 seconds to process all files, pdftk affords itself almost 4 minutes.
8787

88-
By far the slowest tool is PyPDF2 - it is more than 35 times slower than PyMuPDF and 15 times slower than pdfrw! The main reason for PyPDF2's bad look comes from the Adobe manual. It obviously is slowed down by the linear file structure and the immense amount of bookmarks of this file. If we take out this special case, then PyPDF2 is 10.5 times slower than PyMuPDF, 4.4 times slower than pdfrw and 1.2 times slower than pdftk.
88+
By far the slowest tool is PyPDF2 - it is more than 66 times slower than PyMuPDF and 15 times slower than pdfrw! The main reason for PyPDF2's bad look comes from the Adobe manual. It obviously is slowed down by the linear file structure and the immense amount of bookmarks of this file. If we take out this special case, then PyPDF2 is only 21.5 times slower than PyMuPDF, 4.5 times slower than pdfrw and 1.2 times slower than pdftk.
8989

9090
If we look at the output PDFs, there is one surprise:
9191

9292
Each tool created a PDF of similar size as the original. Apart from the Adobe case, PyMuPDF always created the smallest output.
9393

94-
Adobe's manual is an exception: The pure Python tools **reduced** its size by more than 20% (and yielded a document which is no longer linearized)!
94+
Adobe's manual is an exception: The pure Python tools pdfrw and PyPDF2 **reduced** its size by more than 20% (and yielded a document which is no longer linearized)!
9595

9696
PyMuPDF and pdftk in contrast **drastically increased** the size by 40% to about 50 MB (also no longer linearized).
9797

@@ -122,11 +122,9 @@ Here are the results using the same test files as above (again: decimal point an
122122

123123
.. image:: textperformance.png
124124

125-
Again, (Py-) MuPDF is the fastest around. It is two times faster than xpdf.
125+
Again, (Py-) MuPDF is the fastest around. It is between 2.3 and 2.6 times faster than xpdf.
126126

127-
JSON output is 1.7 times faster than xpdf, and even the "re-arranging" version is 1.6 times faster.
128-
129-
``pdfminer``, as a pure Python solution, of course is comparatively slow: MuPDF is 75 (64, 58) times faster and xpdf is 37 times faster. These observations in order of magnitude coincide with the statements on this `web site <http://www.unixuser.org/~euske/python/pdfminer/>`_.
127+
``pdfminer``, as a pure Python solution, of course is comparatively slow: MuPDF is 50 to 60 times faster and xpdf is 23 times faster. These observations in order of magnitude coincide with the statements on this `web site <http://www.unixuser.org/~euske/python/pdfminer/>`_.
130128

131129

132130
.. raw:: pdf
@@ -136,19 +134,34 @@ JSON output is 1.7 times faster than xpdf, and even the "re-arranging" version i
136134

137135
Part 3: Image Rendering
138136
~~~~~~~~~~~~~~~~~~~~~~~~
139-
We have tested rendering speed of MuPDF against the ``pdftopng.exe``, a command lind tool of the **Xpdf** toolset, which is the PDF code basis of **Poppler**.
137+
We have tested rendering speed of MuPDF against the ``pdftopng.exe``, a command lind tool of the **Xpdf** toolset (the PDF code basis of **Poppler**).
140138

141-
MuPDF invocation using a resolution of 150 pixels (Xpdf default):
139+
**MuPDF invocation using a resolution of 150 pixels (Xpdf default):**
142140
::
143141
mutool draw -o t%d.png -r 150 file.pdf
144142

145-
146-
Xpdf invocation:
143+
**PyMuPDF invocation:**
144+
::
145+
zoom = 150.0 / 72.0
146+
mat = fitz.Matrix(1,1).preScale(zoom, zoom)
147+
def ProcessFile(datei):
148+
print "processing:", datei
149+
doc=fitz.Document(datei)
150+
for i in range(doc.pageCount):
151+
pix = doc.getPagePixmap(i, matrix=mat)
152+
pix.writePNG("t-%s.png" % i)
153+
pix = None
154+
doc.close()
155+
return
156+
157+
**Xpdf invocation:**
147158
::
148159
pdftopng.exe file.pdf ./
149160

150161
The resulting runtimes can be found here (again: meaning of decimal point and comma reversed):
151162

152163
.. image:: render_speed.png
153164

154-
MuPDF is between 2.7 and 4.7 (on average 3.0) times faster than Xpdf.
165+
* MuPDF and PyMuPDF are both about 3 times faster than Xpdf.
166+
167+
* The 2% speed difference between MuPDF (a utility written in C) and PyMuPDF is the Python overhead.

doc/html/_sources/app2.txt

+11-22
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,9 @@ A **span** consists of characters with the same properties. E.g. a different fon
3333
Output of ``getText(output="text")``
3434
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3535

36-
This is the plain text output of a page of this tutorial's PDF version:
36+
This function extracts a page's plain **text in original order** as specified by the creator of the document (which may not be equal to a natural reading order!).
37+
38+
An example output of this tutorial's PDF version:
3739
::
3840
Tutorial
3941

@@ -47,7 +49,7 @@ This is the plain text output of a page of this tutorial's PDF version:
4749
Output of ``getText(output="html")``
4850
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4951

50-
The HTML version looks like this:
52+
HTML output reflects the structure of the page's ``TextPage`` - without adding much other benefit. Again an example:
5153
::
5254
<div class="page">
5355
<div class="block"><p>
@@ -65,7 +67,7 @@ The HTML version looks like this:
6567
Output of ``getText(output="json")``
6668
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6769

68-
JSON output looks like so:
70+
JSON output reflects the structure of a ``TextPage`` and provides position details (``bbox`` - boundary boxes in pixel units) for every block, line and span. This is enough information to present a page's text in any required reading order (e.g. from top-left to bottom-right). The output can obviously be made usable by ``text_dict = json.loads(text)``. Have a look at our example program ``PDF2textJS.py``. Here is how it looks like:
6971
::
7072
{
7173
"len":35,"width":595.2756,"height":841.8898,
@@ -98,7 +100,7 @@ JSON output looks like so:
98100
Output of ``getText(output="xml")``
99101
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100102

101-
Now the XML version:
103+
The XML version takes the level of detail even a lot deeper: every single character is provided with its position detail, and every span also contains font information:
102104
::
103105
<page width="595.2756" height="841.8898">
104106
<block bbox="40.01575 53.730354 98.68775 76.08236">
@@ -131,27 +133,14 @@ Now the XML version:
131133
<char bbox="81.695755 79.300354 83.91576 93.04035" x="81.695755" y="90.050354" c="i"/>
132134
...
133135

134-
135-
Resource Requirements
136-
~~~~~~~~~~~~~~~~~~~~~
137-
The four text extraction methods of a :ref:`TextPage` differ significantly: in terms of information they supply (see above), and in terms of resource requirements. More information of course means that more processing is required and a higher data volume is generated.
138-
139-
For testing performance, we have run several example PDFs through these methods and found the following information. This data is not statistically secured in any way - just take it as an idea for what you should expect to see.
140-
141-
As a low end example we took this manual's PDF version (45+ pages, text oriented, 500 KB). The high end case was Adobe's PDF manual (1310 pages, text oriented, 32 MB). The other test cases were `Spektrum <http://www.spektrum.de/>`_ magazines of the year 2015 (the German version of Scientific American, 100+ pages, text with lots of complex interspersed images, 10 to 25 MB each).
136+
The method's output can be processed by one of Python's XML modules. We have successfully tested ``lxml``. See the demo program ``fontlister.py``. It creates a list of all fonts of a document including font size and where used on pages.
142137

143138
Performance
144139
~~~~~~~~~~~~
145-
Performance of text extraction has improved significantly in MuPDF 1.8! As of updating this documentation (mid November 2015), data hint at an improvement factor greater than 2. Especially the complex extraction methods now have a much lower effort penalty.
146-
147-
On a higher level Win10 machine (8 processors at 4 GHz, 8 GB RAM), ``extractXML()`` needs anything between 0.2 and 0.5 seconds per page. This means that you can extract extremely detailed text information of a complex 100-page magazine in less than a minute. This is faster than some other free text extraction tools like e.g. `Nitro 3 <https://www.gonitro.com/pdf-reader>`_.
148-
149-
With ``PDF2TextJS.py`` of the example directory, you have a high performance text extraction utility with a high layout faithfulness!
140+
The four text extraction methods of a :ref:`TextPage` differ significantly: in terms of information they supply (see above), and in terms of resource requirements. More information of course means that more processing is required and a higher data volume is generated.
150141

151-
Data Sizes
152-
~~~~~~~~~~~
153-
The sizes of the returned text strings follow this pattern (``extractText()`` is set to 1):
142+
To begin with, all four methods are **very** fast in relation to what is there on the market. In terms of processing speed, we couldn't find a faster (free) tool.
154143

155-
``(Text : HTML : JSON : XML) ~ (1 : 4 : 6 : 87)``
144+
Relative to each other, ``xml`` is about 2 times slower than ``text``, the other three range between them. E.g. ``json`` needs about 13% - 14% more time than ``text``.
156145

157-
The number 87 for ``extractXML()`` corresponds to values between 200 and 400 KB per page.
146+
Look into the previous chapter **Appendix 1** for more performance information.

doc/html/_sources/changes.txt

+4-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
=========================
22
Changes in Version 1.9.0
33
=========================
4-
This version of PyMuPDF is based on MuPDF library source code version 1.9 published on April 18, 2016.
4+
This version of PyMuPDF is based on MuPDF library source code version 1.9 published in April 18, 2016.
55

66
Please have a look at MuPDF's website to see which changes and enhancements contained herein.
77

@@ -12,5 +12,6 @@ Changes in these bindings compared to version 1.8.0 are the following:
1212
* The Pixmap constructor ``fitz.Pixmap(data, len(data))`` has been extended accordingly to support the above image formats as well (not just PNG as it did in version 1.8.0).
1313
* Various improvements and new members in our demo and examples collections have been applied or added. Perhaps most prominently: ``PDF_display`` now supports scrolling with the mouse wheel, and there is a new example program ``wxTableExtract`` which allows to graphically identify and extract table data in documents.
1414
* ``fitz.Rect`` objects can now be created with all possible combinations of points and coordinates.
15-
* PyMuPDF classes and methods now all contain __doc__ strings, which were automatically created by SWIG. While the PyMuPDF documentation certainly is more detailed, this feature should help a lot using the bindings as a programmer.
16-
* A new method of ``fitz.Document.getPermits()`` returns the permissions associated with the current access to the document (print, edit, annotate, copy), as a Python dictionary.
15+
* PyMuPDF classes and methods now all contain __doc__ strings, which were mostly automatically created by SWIG. While the PyMuPDF documentation certainly is more detailed, this feature should help a lot when programming in Python-aware IDEs.
16+
* A new method of ``fitz.Document.getPermits()`` returns the permissions associated with the current access to the document (print, edit, annotate, copy), as a Python dictionary.
17+
* The identity matrix ``fitz.Identity`` is now **immutable**.

doc/html/_sources/document.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,7 @@ This class represents a document. It can be constructed from a file or from memo
208208

209209
<TZ> is a time zone value (time intervall relative to GMT) containing a sign ('+' or '-'), the hour (``hh``), and the minute (``'mm'``, attention: enclose in apostrophies!).
210210

211-
For example, a Venezuelan value might look like ``D:20150415131602-04'30'``, which corresponds to the timestamp April 15, 2015, at 1:16:02 pm local time Venezuela.
211+
E.g a Venezuelan value might look like ``D:20150415131602-04'30'``, which corresponds to the timestamp April 15, 2015, at 1:16:02 pm local time Venezuela.
212212

213213
:rtype: dict
214214

doc/html/_sources/functions.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
============
66
Functions
77
============
8-
The following are miscelleneas functions directly available under the binding name, i.e. can be invoked as ``fitz.function``.
8+
The following are miscellaneous functions directly available under the binding name, i.e. can be invoked as ``fitz.function``.
99

1010
============================= ==============================================
1111
**Function** **Short Description**

doc/html/_sources/identity.txt

+6-9
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,12 @@ Identity
1010

1111
Identity is just a :ref:`Matrix` that performs no action, to be used whenever the syntax requires a :ref:`Matrix`, but no actual transformation should take place.
1212

13-
**Caution:** ``Identity`` is a constant in the C code and therefore **readonly, do not try to modify** its properties in any way, i.e. you must not manipulate its ``[a,b,c,d,e,f]``, neither apply any method.
13+
Identity is a constant, an "immutable" object. So, all of its matrix properties and methods are hidden.
1414

15-
``Matrix(1, 1)`` creates a matrix that acts like ``Identity``, but it may be changed. Use this when you need a starting point for further modification, e.g. by one of the :ref:`Matrix` methods.
16-
17-
In other words:
15+
If you need a do-nothing matrix as a starting point, use ``fitz.Matrix(1, 1)`` or ``fitz.Matrix(0)`` instead, like so:
1816
::
19-
# the following will not work - the interpreter will crash!
20-
m = fitz.Identity.preRotate(90)
21-
22-
# do this instead:
23-
m = fitz.Matrix(1, 1).preRotate(90)
17+
>>> m = fitz.Matrix(0).preRotate(45)
18+
>>> m
19+
fitz.Matrix(0.707106769085, 0.707106769085, -0.707106769085, 0.707106769085, 0.0, 0.0)
20+
>>>
2421

0 commit comments

Comments
 (0)