You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/html/_sources/app1.txt
+35-22
Original file line number
Diff line number
Diff line change
@@ -24,28 +24,28 @@ Here is the list of files we are using. Each file name is accompanied by further
24
24
Part 1: Parsing
25
25
~~~~~~~~~~~~~~~~
26
26
27
-
How fast is a PDF file read and its content parsed for further processing? The sheer parsing performance cannot directly be compared, because batch utilities always execute a requested task in one go, front to end. ``pdfrw`` too, has a ``lazy`` strategy for parsing, meaning it only parses those parts of a document that are required in any moment.
27
+
How fast is a PDF file read and its content parsed for further processing? The sheer parsing performance cannot directly be compared, because batch utilities always execute a requested task completely, in one go, front to end. ``pdfrw`` too, has a ``lazy`` strategy for parsing, meaning it only parses those parts of a document that are required in any moment.
28
28
29
-
We therefore measure the time to copy a PDF file to an output file, and doing nothing else.
29
+
In order to yet find an answer to the question, we therefore measure the time to copy a PDF file to an output file with each tool, and doing nothing else.
30
30
31
31
**These were the tools**
32
32
33
-
All tools are either platform independant, or at least can run on Windows and Unix / Linux (pdftk).
33
+
All tools are either platform independent, or at least can run both, on Windows and Unix / Linux (pdftk).
34
34
35
-
**Poppler** is missing here, because it specifically is a Linux tool set, although we know there exist Windows ports (created with considerable effort apparently). Technically, it is a C/C++ library, for which a Python binding exists - in so far it is somewhat comparable to PyMuPDF. But Poppler in contrast is tightly coupled to **Qt** and **Cairo**. We may still include it in future, when a more handy Windows installation is available. We have seen however some `analysis <http://hzqtc.github.io/2012/04/poppler-vs-mupdf.html>`_, that hints at a much lower performance than MuPDF. Our comparison of text extraction speeds also show a much lower performance of Poppler's PDF code base **Xpdf**.
35
+
**Poppler** is missing here, because it specifically is a Linux tool set, although we know there exist Windows ports (created with considerable effort apparently). Technically, it is a C/C++ library, for which a Python binding exists - in so far somewhat comparable to PyMuPDF. But Poppler in contrast is tightly coupled to **Qt** and **Cairo**. We may still include it in future, when a more handy Windows installation is available. We have seen however some `analysis <http://hzqtc.github.io/2012/04/poppler-vs-mupdf.html>`_, that hints at a much lower performance than MuPDF. Our comparison of text extraction speeds also show a much lower performance of Poppler's PDF code base **Xpdf**.
36
36
37
37
Image rendering of MuPDF also is about three times faster than the one of Xpdf when comparing the command line tools ``mudraw`` of MuPDF and ``pdftopng`` of Xpdf - see part 3 of this chapter.
This is how each of the tools is being used with the test:
48
+
This is how each of the tools was used:
49
49
50
50
**PyMuPDF**:
51
51
::
@@ -81,17 +81,17 @@ If we leave out the Adobe manual, this table looks like
81
81
82
82
.. image:: copy_speed_2.png
83
83
84
-
PyMuPDF is by far the fastest: on average 2.4 times faster than the second best (the pure Python tool pdfrw, **chapeau pdfrw!**), and 10 times faster than the command line tool pdftk.
84
+
PyMuPDF is by far the fastest: on average 4.5 times faster than the second best (the pure Python tool pdfrw, **chapeau pdfrw!**), and almost 20 times faster than the command line tool pdftk.
85
85
86
-
Where PyMuPDF only requires less than 24 seconds to process all files, pdftk affords itself almost 4 minutes.
86
+
Where PyMuPDF only requires less than 13 seconds to process all files, pdftk affords itself almost 4 minutes.
87
87
88
-
By far the slowest tool is PyPDF2 - it is more than 35 times slower than PyMuPDF and 15 times slower than pdfrw! The main reason for PyPDF2's bad look comes from the Adobe manual. It obviously is slowed down by the linear file structure and the immense amount of bookmarks of this file. If we take out this special case, then PyPDF2 is 10.5 times slower than PyMuPDF, 4.4 times slower than pdfrw and 1.2 times slower than pdftk.
88
+
By far the slowest tool is PyPDF2 - it is more than 66 times slower than PyMuPDF and 15 times slower than pdfrw! The main reason for PyPDF2's bad look comes from the Adobe manual. It obviously is slowed down by the linear file structure and the immense amount of bookmarks of this file. If we take out this special case, then PyPDF2 is only 21.5 times slower than PyMuPDF, 4.5 times slower than pdfrw and 1.2 times slower than pdftk.
89
89
90
90
If we look at the output PDFs, there is one surprise:
91
91
92
92
Each tool created a PDF of similar size as the original. Apart from the Adobe case, PyMuPDF always created the smallest output.
93
93
94
-
Adobe's manual is an exception: The pure Python tools **reduced** its size by more than 20% (and yielded a document which is no longer linearized)!
94
+
Adobe's manual is an exception: The pure Python tools pdfrw and PyPDF2 **reduced** its size by more than 20% (and yielded a document which is no longer linearized)!
95
95
96
96
PyMuPDF and pdftk in contrast **drastically increased** the size by 40% to about 50 MB (also no longer linearized).
97
97
@@ -122,11 +122,9 @@ Here are the results using the same test files as above (again: decimal point an
122
122
123
123
.. image:: textperformance.png
124
124
125
-
Again, (Py-) MuPDF is the fastest around. It is two times faster than xpdf.
125
+
Again, (Py-) MuPDF is the fastest around. It is between 2.3 and 2.6 times faster than xpdf.
126
126
127
-
JSON output is 1.7 times faster than xpdf, and even the "re-arranging" version is 1.6 times faster.
128
-
129
-
``pdfminer``, as a pure Python solution, of course is comparatively slow: MuPDF is 75 (64, 58) times faster and xpdf is 37 times faster. These observations in order of magnitude coincide with the statements on this `web site <http://www.unixuser.org/~euske/python/pdfminer/>`_.
127
+
``pdfminer``, as a pure Python solution, of course is comparatively slow: MuPDF is 50 to 60 times faster and xpdf is 23 times faster. These observations in order of magnitude coincide with the statements on this `web site <http://www.unixuser.org/~euske/python/pdfminer/>`_.
130
128
131
129
132
130
.. raw:: pdf
@@ -136,19 +134,34 @@ JSON output is 1.7 times faster than xpdf, and even the "re-arranging" version i
136
134
137
135
Part 3: Image Rendering
138
136
~~~~~~~~~~~~~~~~~~~~~~~~
139
-
We have tested rendering speed of MuPDF against the ``pdftopng.exe``, a command lind tool of the **Xpdf** toolset, which is the PDF code basis of **Poppler**.
137
+
We have tested rendering speed of MuPDF against the ``pdftopng.exe``, a command lind tool of the **Xpdf** toolset (the PDF code basis of **Poppler**).
140
138
141
-
MuPDF invocation using a resolution of 150 pixels (Xpdf default):
139
+
**MuPDF invocation using a resolution of 150 pixels (Xpdf default):**
142
140
::
143
141
mutool draw -o t%d.png -r 150 file.pdf
144
142
145
-
146
-
Xpdf invocation:
143
+
**PyMuPDF invocation:**
144
+
::
145
+
zoom = 150.0 / 72.0
146
+
mat = fitz.Matrix(1,1).preScale(zoom, zoom)
147
+
def ProcessFile(datei):
148
+
print "processing:", datei
149
+
doc=fitz.Document(datei)
150
+
for i in range(doc.pageCount):
151
+
pix = doc.getPagePixmap(i, matrix=mat)
152
+
pix.writePNG("t-%s.png" % i)
153
+
pix = None
154
+
doc.close()
155
+
return
156
+
157
+
**Xpdf invocation:**
147
158
::
148
159
pdftopng.exe file.pdf ./
149
160
150
161
The resulting runtimes can be found here (again: meaning of decimal point and comma reversed):
151
162
152
163
.. image:: render_speed.png
153
164
154
-
MuPDF is between 2.7 and 4.7 (on average 3.0) times faster than Xpdf.
165
+
* MuPDF and PyMuPDF are both about 3 times faster than Xpdf.
166
+
167
+
* The 2% speed difference between MuPDF (a utility written in C) and PyMuPDF is the Python overhead.
Copy file name to clipboardExpand all lines: doc/html/_sources/app2.txt
+11-22
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,9 @@ A **span** consists of characters with the same properties. E.g. a different fon
33
33
Output of ``getText(output="text")``
34
34
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
35
35
36
-
This is the plain text output of a page of this tutorial's PDF version:
36
+
This function extracts a page's plain **text in original order** as specified by the creator of the document (which may not be equal to a natural reading order!).
37
+
38
+
An example output of this tutorial's PDF version:
37
39
::
38
40
Tutorial
39
41
@@ -47,7 +49,7 @@ This is the plain text output of a page of this tutorial's PDF version:
47
49
Output of ``getText(output="html")``
48
50
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49
51
50
-
The HTML version looks like this:
52
+
HTML output reflects the structure of the page's ``TextPage`` - without adding much other benefit. Again an example:
51
53
::
52
54
<div class="page">
53
55
<div class="block"><p>
@@ -65,7 +67,7 @@ The HTML version looks like this:
65
67
Output of ``getText(output="json")``
66
68
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
67
69
68
-
JSON output looks like so:
70
+
JSON output reflects the structure of a ``TextPage`` and provides position details (``bbox`` - boundary boxes in pixel units) for every block, line and span. This is enough information to present a page's text in any required reading order (e.g. from top-left to bottom-right). The output can obviously be made usable by ``text_dict = json.loads(text)``. Have a look at our example program ``PDF2textJS.py``. Here is how it looks like:
69
71
::
70
72
{
71
73
"len":35,"width":595.2756,"height":841.8898,
@@ -98,7 +100,7 @@ JSON output looks like so:
98
100
Output of ``getText(output="xml")``
99
101
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100
102
101
-
Now the XML version:
103
+
The XML version takes the level of detail even a lot deeper: every single character is provided with its position detail, and every span also contains font information:
The four text extraction methods of a :ref:`TextPage` differ significantly: in terms of information they supply (see above), and in terms of resource requirements. More information of course means that more processing is required and a higher data volume is generated.
138
-
139
-
For testing performance, we have run several example PDFs through these methods and found the following information. This data is not statistically secured in any way - just take it as an idea for what you should expect to see.
140
-
141
-
As a low end example we took this manual's PDF version (45+ pages, text oriented, 500 KB). The high end case was Adobe's PDF manual (1310 pages, text oriented, 32 MB). The other test cases were `Spektrum <http://www.spektrum.de/>`_ magazines of the year 2015 (the German version of Scientific American, 100+ pages, text with lots of complex interspersed images, 10 to 25 MB each).
136
+
The method's output can be processed by one of Python's XML modules. We have successfully tested ``lxml``. See the demo program ``fontlister.py``. It creates a list of all fonts of a document including font size and where used on pages.
142
137
143
138
Performance
144
139
~~~~~~~~~~~~
145
-
Performance of text extraction has improved significantly in MuPDF 1.8! As of updating this documentation (mid November 2015), data hint at an improvement factor greater than 2. Especially the complex extraction methods now have a much lower effort penalty.
146
-
147
-
On a higher level Win10 machine (8 processors at 4 GHz, 8 GB RAM), ``extractXML()`` needs anything between 0.2 and 0.5 seconds per page. This means that you can extract extremely detailed text information of a complex 100-page magazine in less than a minute. This is faster than some other free text extraction tools like e.g. `Nitro 3 <https://www.gonitro.com/pdf-reader>`_.
148
-
149
-
With ``PDF2TextJS.py`` of the example directory, you have a high performance text extraction utility with a high layout faithfulness!
140
+
The four text extraction methods of a :ref:`TextPage` differ significantly: in terms of information they supply (see above), and in terms of resource requirements. More information of course means that more processing is required and a higher data volume is generated.
150
141
151
-
Data Sizes
152
-
~~~~~~~~~~~
153
-
The sizes of the returned text strings follow this pattern (``extractText()`` is set to 1):
142
+
To begin with, all four methods are **very** fast in relation to what is there on the market. In terms of processing speed, we couldn't find a faster (free) tool.
154
143
155
-
``(Text : HTML : JSON : XML) ~ (1 : 4 : 6 : 87)``
144
+
Relative to each other, ``xml`` is about 2 times slower than ``text``, the other three range between them. E.g. ``json`` needs about 13% - 14% more time than ``text``.
156
145
157
-
The number 87 for ``extractXML()`` corresponds to values between 200 and 400 KB per page.
146
+
Look into the previous chapter **Appendix 1** for more performance information.
This version of PyMuPDF is based on MuPDF library source code version 1.9 published on April 18, 2016.
4
+
This version of PyMuPDF is based on MuPDF library source code version 1.9 published in April 18, 2016.
5
5
6
6
Please have a look at MuPDF's website to see which changes and enhancements contained herein.
7
7
@@ -12,5 +12,6 @@ Changes in these bindings compared to version 1.8.0 are the following:
12
12
* The Pixmap constructor ``fitz.Pixmap(data, len(data))`` has been extended accordingly to support the above image formats as well (not just PNG as it did in version 1.8.0).
13
13
* Various improvements and new members in our demo and examples collections have been applied or added. Perhaps most prominently: ``PDF_display`` now supports scrolling with the mouse wheel, and there is a new example program ``wxTableExtract`` which allows to graphically identify and extract table data in documents.
14
14
* ``fitz.Rect`` objects can now be created with all possible combinations of points and coordinates.
15
-
* PyMuPDF classes and methods now all contain __doc__ strings, which were automatically created by SWIG. While the PyMuPDF documentation certainly is more detailed, this feature should help a lot using the bindings as a programmer.
16
-
* A new method of ``fitz.Document.getPermits()`` returns the permissions associated with the current access to the document (print, edit, annotate, copy), as a Python dictionary.
15
+
* PyMuPDF classes and methods now all contain __doc__ strings, which were mostly automatically created by SWIG. While the PyMuPDF documentation certainly is more detailed, this feature should help a lot when programming in Python-aware IDEs.
16
+
* A new method of ``fitz.Document.getPermits()`` returns the permissions associated with the current access to the document (print, edit, annotate, copy), as a Python dictionary.
17
+
* The identity matrix ``fitz.Identity`` is now **immutable**.
Copy file name to clipboardExpand all lines: doc/html/_sources/document.txt
+1-1
Original file line number
Diff line number
Diff line change
@@ -208,7 +208,7 @@ This class represents a document. It can be constructed from a file or from memo
208
208
209
209
<TZ> is a time zone value (time intervall relative to GMT) containing a sign ('+' or '-'), the hour (``hh``), and the minute (``'mm'``, attention: enclose in apostrophies!).
210
210
211
-
For example, a Venezuelan value might look like ``D:20150415131602-04'30'``, which corresponds to the timestamp April 15, 2015, at 1:16:02 pm local time Venezuela.
211
+
E.g a Venezuelan value might look like ``D:20150415131602-04'30'``, which corresponds to the timestamp April 15, 2015, at 1:16:02 pm local time Venezuela.
Copy file name to clipboardExpand all lines: doc/html/_sources/identity.txt
+6-9
Original file line number
Diff line number
Diff line change
@@ -10,15 +10,12 @@ Identity
10
10
11
11
Identity is just a :ref:`Matrix` that performs no action, to be used whenever the syntax requires a :ref:`Matrix`, but no actual transformation should take place.
12
12
13
-
**Caution:** ``Identity`` is a constant in the C code and therefore **readonly, do not try to modify** its properties in any way, i.e. you must not manipulate its ``[a,b,c,d,e,f]``, neither apply any method.
13
+
Identity is a constant, an "immutable" object. So, all of its matrix properties and methods are hidden.
14
14
15
-
``Matrix(1, 1)`` creates a matrix that acts like ``Identity``, but it may be changed. Use this when you need a starting point for further modification, e.g. by one of the :ref:`Matrix` methods.
16
-
17
-
In other words:
15
+
If you need a do-nothing matrix as a starting point, use ``fitz.Matrix(1, 1)`` or ``fitz.Matrix(0)`` instead, like so:
18
16
::
19
-
# the following will not work - the interpreter will crash!
0 commit comments