pymupdf
diff --git a/‎doc/PyMuPDF.pdf
6.28 KB b/‎doc/PyMuPDF.pdf
6.28 KB
diff --git a/‎doc/html/.buildinfo
+1-1 b/‎doc/html/.buildinfo
+1-1
diff --git a/‎doc/html/.doctrees/app1.doctree
45 KB b/‎doc/html/.doctrees/app1.doctree
45 KB
diff --git a/‎doc/html/.doctrees/app2.doctree
29.8 KB b/‎doc/html/.doctrees/app2.doctree
29.8 KB
diff --git a/‎doc/html/.doctrees/changes.doctree
9.95 KB b/‎doc/html/.doctrees/changes.doctree
9.95 KB
diff --git a/‎doc/html/.doctrees/classes.doctree
3.14 KB b/‎doc/html/.doctrees/classes.doctree
3.14 KB
diff --git a/‎doc/html/.doctrees/colorspace.doctree
8.74 KB b/‎doc/html/.doctrees/colorspace.doctree
8.74 KB
diff --git a/‎doc/html/.doctrees/device.doctree
13.1 KB b/‎doc/html/.doctrees/device.doctree
13.1 KB
diff --git a/‎doc/html/.doctrees/displaylist.doctree
18.6 KB b/‎doc/html/.doctrees/displaylist.doctree
18.6 KB
diff --git a/‎doc/html/.doctrees/document.doctree
96.3 KB b/‎doc/html/.doctrees/document.doctree
96.3 KB
diff --git a/‎doc/html/.doctrees/environment.pickle
55.3 KB b/‎doc/html/.doctrees/environment.pickle
55.3 KB
diff --git a/‎doc/html/.doctrees/functions.doctree
11.2 KB b/‎doc/html/.doctrees/functions.doctree
11.2 KB
diff --git a/‎doc/html/.doctrees/identity.doctree
6.7 KB b/‎doc/html/.doctrees/identity.doctree
6.7 KB
diff --git a/‎doc/html/.doctrees/index.doctree
3.53 KB b/‎doc/html/.doctrees/index.doctree
3.53 KB
diff --git a/‎doc/html/.doctrees/installation.doctree
23.3 KB b/‎doc/html/.doctrees/installation.doctree
23.3 KB
diff --git a/‎doc/html/.doctrees/intro.doctree
16.1 KB b/‎doc/html/.doctrees/intro.doctree
16.1 KB
diff --git a/‎doc/html/.doctrees/irect.doctree
35.8 KB b/‎doc/html/.doctrees/irect.doctree
35.8 KB
diff --git a/‎doc/html/.doctrees/link.doctree
14.7 KB b/‎doc/html/.doctrees/link.doctree
14.7 KB
diff --git a/‎doc/html/.doctrees/linkdest.doctree
38.5 KB b/‎doc/html/.doctrees/linkdest.doctree
38.5 KB
diff --git a/‎doc/html/.doctrees/matrix.doctree
68.2 KB b/‎doc/html/.doctrees/matrix.doctree
68.2 KB
diff --git a/‎doc/html/.doctrees/outline.doctree
28.6 KB b/‎doc/html/.doctrees/outline.doctree
28.6 KB
diff --git a/‎doc/html/.doctrees/page.doctree
46.1 KB b/‎doc/html/.doctrees/page.doctree
46.1 KB
diff --git a/‎doc/html/.doctrees/pixmap.doctree
124 KB b/‎doc/html/.doctrees/pixmap.doctree
124 KB
diff --git a/‎doc/html/.doctrees/point.doctree
16.2 KB b/‎doc/html/.doctrees/point.doctree
16.2 KB
diff --git a/‎doc/html/.doctrees/rect.doctree
47 KB b/‎doc/html/.doctrees/rect.doctree
47 KB
diff --git a/‎doc/html/.doctrees/textpage.doctree
28 KB b/‎doc/html/.doctrees/textpage.doctree
28 KB
diff --git a/‎doc/html/.doctrees/textsheet.doctree
3.77 KB b/‎doc/html/.doctrees/textsheet.doctree
3.77 KB
diff --git a/‎doc/html/.doctrees/tutorial.doctree
95.3 KB b/‎doc/html/.doctrees/tutorial.doctree
95.3 KB
diff --git a/‎doc/html/.doctrees/vars.doctree
32.3 KB b/‎doc/html/.doctrees/vars.doctree
32.3 KB
diff --git a/‎doc/html/_images/render_speed.png
4.32 KB b/‎doc/html/_images/render_speed.png
4.32 KB
diff --git a/‎doc/html/_images/textperformance.png
-2.79 KB b/‎doc/html/_images/textperformance.png
-2.79 KB
diff --git a/‎doc/html/_sources/app1.txt
+35-22 b/‎doc/html/_sources/app1.txt
+35-22
diff --git a/‎doc/html/_sources/app2.txt
+11-22 b/‎doc/html/_sources/app2.txt
+11-22
diff --git a/‎doc/html/_sources/changes.txt
+4-3 b/‎doc/html/_sources/changes.txt
+4-3
diff --git a/‎doc/html/_sources/document.txt
+1-1 b/‎doc/html/_sources/document.txt
+1-1
diff --git a/‎doc/html/_sources/functions.txt
+1-1 b/‎doc/html/_sources/functions.txt
+1-1
diff --git a/‎doc/html/_sources/identity.txt
+6-9 b/‎doc/html/_sources/identity.txt
+6-9
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 444d4db8bf57147137dc4b3675fd4de0
+config: 5870aaa6ecde0074afae47aa34341ab0
 tags: 645f666f9bcd5a90fca523b33c5a78b7
@@ -24,28 +24,28 @@ Here is the list of files we are using. Each file name is accompanied by further
 Part 1: Parsing
 ~~~~~~~~~~~~~~~~
 
-How fast is a PDF file read and its content parsed for further processing? The sheer parsing performance cannot directly be compared, because batch utilities always execute a requested task in one go, front to end. ``pdfrw`` too, has a ``lazy`` strategy for parsing, meaning it only parses those parts of a document that are required in any moment.
+How fast is a PDF file read and its content parsed for further processing? The sheer parsing performance cannot directly be compared, because batch utilities always execute a requested task completely, in one go, front to end. ``pdfrw`` too, has a ``lazy`` strategy for parsing, meaning it only parses those parts of a document that are required in any moment.
 
-We therefore measure the time to copy a PDF file to an output file, and doing nothing else.
+In order to yet find an answer to the question, we therefore measure the time to copy a PDF file to an output file with each tool, and doing nothing else.
 
 **These were the tools**
 
-All tools are either platform independant, or at least can run on Windows and Unix / Linux (pdftk).
+All tools are either platform independent, or at least can run both, on Windows and Unix / Linux (pdftk).
 
-**Poppler** is missing here, because it specifically is a Linux tool set, although we know there exist Windows ports (created with considerable effort apparently). Technically, it is a C/C++ library, for which a Python binding exists - in so far it is somewhat comparable to PyMuPDF. But Poppler in contrast is tightly coupled to **Qt** and **Cairo**. We may still include it in future, when a more handy Windows installation is available. We have seen however some `analysis  <http://hzqtc.github.io/2012/04/poppler-vs-mupdf.html>`_, that hints at a much lower performance than MuPDF. Our comparison of text extraction speeds also show a much lower performance of Poppler's PDF code base **Xpdf**.
+**Poppler** is missing here, because it specifically is a Linux tool set, although we know there exist Windows ports (created with considerable effort apparently). Technically, it is a C/C++ library, for which a Python binding exists - in so far somewhat comparable to PyMuPDF. But Poppler in contrast is tightly coupled to **Qt** and **Cairo**. We may still include it in future, when a more handy Windows installation is available. We have seen however some `analysis  <http://hzqtc.github.io/2012/04/poppler-vs-mupdf.html>`_, that hints at a much lower performance than MuPDF. Our comparison of text extraction speeds also show a much lower performance of Poppler's PDF code base **Xpdf**.
 
 Image rendering of MuPDF also is about three times faster than the one of Xpdf when comparing the command line tools ``mudraw`` of MuPDF and ``pdftopng`` of Xpdf - see part 3 of this chapter.
 
-========= =====================================================================
+========= ==========================================================================
 Tool      Description
-========= =====================================================================
+========= ==========================================================================
 PyMuPDF   tool of this manual, appearing as "fitz" in reports
-pdfrw     a pure Python tool, can be used as frontend to ReportLab and rst2pdf
+pdfrw     a pure Python tool, is being used by rst2pdf, has interface to ReportLab
 PyPDF2    a pure Python tool with a very complete function set
 pdftk     a command line utility with numerous functions
-========= =====================================================================
+========= ==========================================================================
 
-This is how each of the tools is being used with the test:
+This is how each of the tools was used:
 
 **PyMuPDF**:
 ::
@@ -81,17 +81,17 @@ If we leave out the Adobe manual, this table looks like
 
 .. image:: copy_speed_2.png
 
-PyMuPDF is by far the fastest: on average 2.4 times faster than the second best (the pure Python tool pdfrw, **chapeau pdfrw!**), and 10 times faster than the command line tool pdftk.
+PyMuPDF is by far the fastest: on average 4.5 times faster than the second best (the pure Python tool pdfrw, **chapeau pdfrw!**), and almost 20 times faster than the command line tool pdftk.
 
-Where PyMuPDF only requires less than 24 seconds to process all files, pdftk affords itself almost 4 minutes.
+Where PyMuPDF only requires less than 13 seconds to process all files, pdftk affords itself almost 4 minutes.
 
-By far the slowest tool is PyPDF2 - it is more than 35 times slower than PyMuPDF and 15 times slower than pdfrw! The main reason for PyPDF2's bad look comes from the Adobe manual. It obviously is slowed down by the linear file structure and the immense amount of bookmarks of this file. If we take out this special case, then PyPDF2 is 10.5 times slower than PyMuPDF, 4.4 times slower than pdfrw and 1.2 times slower than pdftk.
+By far the slowest tool is PyPDF2 - it is more than 66 times slower than PyMuPDF and 15 times slower than pdfrw! The main reason for PyPDF2's bad look comes from the Adobe manual. It obviously is slowed down by the linear file structure and the immense amount of bookmarks of this file. If we take out this special case, then PyPDF2 is only 21.5 times slower than PyMuPDF, 4.5 times slower than pdfrw and 1.2 times slower than pdftk.
 
 If we look at the output PDFs, there is one surprise:
 
 Each tool created a PDF of similar size as the original. Apart from the Adobe case, PyMuPDF always created the smallest output.
 
-Adobe's manual is an exception: The pure Python tools **reduced** its size by more than 20% (and yielded a document which is no longer linearized)!
+Adobe's manual is an exception: The pure Python tools pdfrw and PyPDF2 **reduced** its size by more than 20% (and yielded a document which is no longer linearized)!
 
 PyMuPDF and pdftk in contrast **drastically increased** the size by 40% to about 50 MB (also no longer linearized).
 
@@ -122,11 +122,9 @@ Here are the results using the same test files as above (again: decimal point an
 
 .. image:: textperformance.png
 
-Again, (Py-) MuPDF is the fastest around. It is two times faster than xpdf.
+Again, (Py-) MuPDF is the fastest around. It is between 2.3 and 2.6 times faster than xpdf.
 
-JSON output is 1.7 times faster than xpdf, and even the "re-arranging" version is 1.6 times faster.
-
-``pdfminer``, as a pure Python solution, of course is comparatively slow: MuPDF is 75 (64, 58) times faster and xpdf is 37 times faster. These observations in order of magnitude coincide with the statements on this `web site <http://www.unixuser.org/~euske/python/pdfminer/>`_.
+``pdfminer``, as a pure Python solution, of course is comparatively slow: MuPDF is 50 to 60 times faster and xpdf is 23 times faster. These observations in order of magnitude coincide with the statements on this `web site <http://www.unixuser.org/~euske/python/pdfminer/>`_.
 
 
 .. raw:: pdf
@@ -136,19 +134,34 @@ JSON output is 1.7 times faster than xpdf, and even the "re-arranging" version i
 
 Part 3: Image Rendering
 ~~~~~~~~~~~~~~~~~~~~~~~~
-We have tested rendering speed of MuPDF against the ``pdftopng.exe``, a command lind tool of the **Xpdf** toolset, which is the PDF code basis of **Poppler**.
+We have tested rendering speed of MuPDF against the ``pdftopng.exe``, a command lind tool of the **Xpdf** toolset (the PDF code basis of **Poppler**).
 
-MuPDF invocation using a resolution of 150 pixels (Xpdf default):
+**MuPDF invocation using a resolution of 150 pixels (Xpdf default):**
 ::
  mutool draw -o t%d.png -r 150 file.pdf
 
-
-Xpdf invocation:
+**PyMuPDF invocation:**
+::
+ zoom = 150.0 / 72.0
+ mat = fitz.Matrix(1,1).preScale(zoom, zoom)
+ def ProcessFile(datei):
+     print "processing:", datei
+     doc=fitz.Document(datei)
+     for i in range(doc.pageCount):
+         pix = doc.getPagePixmap(i, matrix=mat)
+         pix.writePNG("t-%s.png" % i)
+         pix = None
+     doc.close()
+     return
+
+**Xpdf invocation:**
 ::
  pdftopng.exe file.pdf ./
 
 The resulting runtimes can be found here (again: meaning of decimal point and comma reversed):
 
 .. image:: render_speed.png
 
-MuPDF is between 2.7 and 4.7 (on average 3.0) times faster than Xpdf.
+* MuPDF and PyMuPDF are both about 3 times faster than Xpdf.
+
+* The 2% speed difference between MuPDF (a utility written in C) and PyMuPDF is the Python overhead.
@@ -33,7 +33,9 @@ A **span** consists of characters with the same properties. E.g. a different fon
 Output of ``getText(output="text")``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-This is the plain text output of a page of this tutorial's PDF version:
+This function extracts a page's plain **text in original order** as specified by the creator of the document (which may not be equal to a natural reading order!).
+
+An example output of this tutorial's PDF version:
 ::
  Tutorial
 
@@ -47,7 +49,7 @@ This is the plain text output of a page of this tutorial's PDF version:
 Output of ``getText(output="html")``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The HTML version looks like this:
+HTML output reflects the structure of the page's ``TextPage`` - without adding much other benefit. Again an example:
 ::
  <div class="page">
  <div class="block"><p>
@@ -65,7 +67,7 @@ The HTML version looks like this:
 Output of ``getText(output="json")``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-JSON output looks like so:
+JSON output reflects the structure of a ``TextPage`` and provides position details (``bbox`` - boundary boxes in pixel units) for every block, line and span. This is enough information to present a page's text in any required reading order (e.g. from top-left to bottom-right). The output can obviously be made usable by ``text_dict = json.loads(text)``. Have a look at our example program ``PDF2textJS.py``. Here is how it looks like:
 ::
  {
   "len":35,"width":595.2756,"height":841.8898,
@@ -98,7 +100,7 @@ JSON output looks like so:
 Output of ``getText(output="xml")``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Now the XML version:
+The XML version takes the level of detail even a lot deeper: every single character is provided with its position detail, and every span also contains font information:
 ::
  <page width="595.2756" height="841.8898">
  <block bbox="40.01575 53.730354 98.68775 76.08236">
@@ -131,27 +133,14 @@ Now the XML version:
  <char bbox="81.695755 79.300354 83.91576 93.04035" x="81.695755" y="90.050354" c="i"/>
  ...
 
-
-Resource Requirements
-~~~~~~~~~~~~~~~~~~~~~
-The four text extraction methods of a :ref:`TextPage` differ significantly: in terms of information they supply (see above), and in terms of resource requirements. More information of course means that more processing is required and a higher data volume is generated.
-
-For testing performance, we have run several example PDFs through these methods and found the following information. This  data is not statistically secured in any way - just take it as an idea for what you should expect to see.
-
-As a low end example we took this manual's PDF version (45+ pages, text oriented, 500 KB). The high end case was Adobe's PDF manual (1310 pages, text oriented, 32 MB). The other test cases were `Spektrum <http://www.spektrum.de/>`_ magazines of the year 2015 (the German version of Scientific American, 100+ pages, text with lots of complex interspersed images, 10 to 25 MB each).
+The method's output can be processed by one of Python's XML modules. We have successfully tested ``lxml``. See the demo program ``fontlister.py``. It creates a list of all fonts of a document including font size and where used on pages.
 
 Performance
 ~~~~~~~~~~~~
-Performance of text extraction has improved significantly in MuPDF 1.8! As of updating this documentation (mid November 2015), data hint at an improvement factor greater than 2. Especially the complex extraction methods now have a much lower effort penalty.
-
-On a higher level Win10 machine (8 processors at 4 GHz, 8 GB RAM), ``extractXML()`` needs anything between 0.2 and 0.5 seconds per page. This means that you can extract extremely detailed text information of a complex 100-page magazine in less than a minute. This is faster than some other free text extraction tools like e.g. `Nitro 3 <https://www.gonitro.com/pdf-reader>`_.
-
-With ``PDF2TextJS.py`` of the example directory, you have a high performance text extraction utility with a high layout faithfulness!
+The four text extraction methods of a :ref:`TextPage` differ significantly: in terms of information they supply (see above), and in terms of resource requirements. More information of course means that more processing is required and a higher data volume is generated.
 
-Data Sizes
-~~~~~~~~~~~
-The sizes of the returned text strings follow this pattern (``extractText()`` is set to 1):
+To begin with, all four methods are **very** fast in relation to what is there on the market. In terms of processing speed, we couldn't find a faster (free) tool.
 
-``(Text : HTML : JSON : XML) ~ (1 : 4 : 6 : 87)``
+Relative to each other, ``xml`` is about 2 times slower than ``text``, the other three range between them. E.g. ``json`` needs about 13% - 14% more time than ``text``.
 
-The number 87 for ``extractXML()`` corresponds to values between 200 and 400 KB per page.
+Look into the previous chapter **Appendix 1** for more performance information.
@@ -1,7 +1,7 @@
 =========================
 Changes in Version 1.9.0
 =========================
-This version of PyMuPDF is based on MuPDF library source code version 1.9 published on April 18, 2016.
+This version of PyMuPDF is based on MuPDF library source code version 1.9 published in April 18, 2016.
 
 Please have a look at MuPDF's website to see which changes and enhancements contained herein.
 
@@ -12,5 +12,6 @@ Changes in these bindings compared to version 1.8.0 are the following:
 * The Pixmap constructor ``fitz.Pixmap(data, len(data))`` has been extended accordingly to support the above image formats as well (not just PNG as it did in version 1.8.0).
 * Various improvements and new members in our demo and examples collections have been applied or added. Perhaps most prominently: ``PDF_display`` now supports scrolling with the mouse wheel, and there is a new example program ``wxTableExtract`` which allows to graphically identify and extract table data in documents.
 * ``fitz.Rect`` objects can now be created with all possible combinations of points and coordinates.
-* PyMuPDF classes and methods now all contain  __doc__ strings, which were automatically created by SWIG. While the PyMuPDF documentation certainly is more detailed, this feature should help a lot using the bindings as a programmer.
-* A new method of ``fitz.Document.getPermits()`` returns the permissions associated with the current access to the document (print, edit, annotate, copy), as a Python dictionary.
+* PyMuPDF classes and methods now all contain  __doc__ strings, which were mostly automatically created by SWIG. While the PyMuPDF documentation certainly is more detailed, this feature should help a lot when programming in Python-aware IDEs.
+* A new method of ``fitz.Document.getPermits()`` returns the permissions associated with the current access to the document (print, edit, annotate, copy), as a Python dictionary.
+* The identity matrix ``fitz.Identity`` is now **immutable**.
@@ -208,7 +208,7 @@ This class represents a document. It can be constructed from a file or from memo
 
       <TZ> is a time zone value (time intervall relative to GMT) containing a sign ('+' or '-'), the hour (``hh``), and the minute (``'mm'``, attention: enclose in apostrophies!).
 
-      For example, a Venezuelan value might look like ``D:20150415131602-04'30'``, which corresponds to the timestamp April 15, 2015, at 1:16:02 pm local time Venezuela.
+      E.g a Venezuelan value might look like ``D:20150415131602-04'30'``, which corresponds to the timestamp April 15, 2015, at 1:16:02 pm local time Venezuela.
 
       :rtype: dict
 
 
@@ -5,7 +5,7 @@
 ============
 Functions
 ============
-The following are miscelleneas functions directly available under the binding name, i.e. can be invoked as ``fitz.function``.
+The following are miscellaneous functions directly available under the binding name, i.e. can be invoked as ``fitz.function``.
 
 ============================= ==============================================
 **Function**                  **Short Description**
 
@@ -10,15 +10,12 @@ Identity
 
 Identity is just a :ref:`Matrix` that performs no action, to be used whenever the syntax requires a :ref:`Matrix`, but no actual transformation should take place.
 
-**Caution:** ``Identity`` is a constant in the C code and therefore **readonly, do not try to modify** its properties in any way, i.e. you must not manipulate its ``[a,b,c,d,e,f]``, neither apply any method.
+Identity is a constant, an "immutable" object. So, all of its matrix properties and methods are hidden.
 
-``Matrix(1, 1)`` creates a matrix that acts like ``Identity``, but it may be changed. Use this when you need a starting point for further modification, e.g. by one of the :ref:`Matrix` methods.
-
-In other words:
+If you need a do-nothing matrix as a starting point, use ``fitz.Matrix(1, 1)`` or ``fitz.Matrix(0)`` instead, like so:
 ::
- # the following will not work - the interpreter will crash!
- m = fitz.Identity.preRotate(90)
-
- # do this instead:
- m = fitz.Matrix(1, 1).preRotate(90)
+ >>> m = fitz.Matrix(0).preRotate(45)
+ >>> m
+ fitz.Matrix(0.707106769085, 0.707106769085, -0.707106769085, 0.707106769085, 0.0, 0.0)
+ >>>