Skip to content

Commit 00f2309

Browse files
committed
Document mask in TextPage
This is a document only PR to show how transparent images are identified in TextPage output
1 parent 423e059 commit 00f2309

File tree

2 files changed

+50
-44
lines changed

2 files changed

+50
-44
lines changed

docs/page.rst

Lines changed: 32 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -306,7 +306,7 @@ In a nutshell, this is what you can do with PyMuPDF:
306306

307307
:arg int align: the horizontal alignment for the replacing text. See :meth:`insert_textbox` for available values. The vertical alignment is (approximately) centered if a PDF built-in font is used (CJK or :ref:`Base-14-Fonts`). (New in v1.16.12)
308308

309-
:arg sequence fill: the fill color of the rectangle **after applying** the redaction. The default is *white = (1, 1, 1)*, which is also taken if *None* is specified. To suppress a fill color altogether, specify *False*. In this cases the rectangle remains transparent. (New in v1.16.12)
309+
:arg sequence fill: the fill color of the rectangle **after applying** the redaction. The default is *white = (1, 1, 1)*, which is also taken if ``None`` is specified. To suppress a fill color altogether, specify ``False``. In this cases the rectangle remains transparent. (New in v1.16.12)
310310

311311
:arg sequence text_color: the color of the replacing text. Default is *black = (0, 0, 0)*. (New in v1.16.12)
312312

@@ -349,7 +349,7 @@ In a nutshell, this is what you can do with PyMuPDF:
349349

350350
* For option `images=PDF_REDACT_IMAGE_PIXELS` a new image of format PNG is created, which the page will use in place of the original one. The original image is not deleted or replaced as part of this process, so other pages may still show the original. In addition, the new, modified PNG image currently is **stored uncompressed**. Do keep these aspects in mind when choosing the right garbage collection method and compression options during save.
351351

352-
* **Text removal** is done by character: A character is removed if its bbox has a **non-empty overlap** with a redaction rectangle (changed in MuPDF v1.17). Depending on the font properties and / or the chosen line height, deletion may occur for undesired text parts. Using :meth:`Tools.set_small_glyph_heights` with a *True* argument before text search may help to prevent this.
352+
* **Text removal** is done by character: A character is removed if its bbox has a **non-empty overlap** with a redaction rectangle (changed in MuPDF v1.17). Depending on the font properties and / or the chosen line height, deletion may occur for undesired text parts. Using :meth:`Tools.set_small_glyph_heights` with a ``True`` argument before text search may help to prevent this.
353353

354354
* Redactions are a simple way to replace single words in a PDF, or to just physically remove them. Locate the word "secret" using some text extraction or search method and insert a redaction using "xxxxxx" as replacement text for each occurrence.
355355

@@ -414,14 +414,14 @@ In a nutshell, this is what you can do with PyMuPDF:
414414
the location(s) -- rectangle(s) or quad(s) -- to be marked. (Changed in v1.14.20)
415415
A list or tuple must consist of :data:`rect_like` or :data:`quad_like` items (or even a mixture of either).
416416
Every item must be finite, convex and not empty (as applicable).
417-
**Set this parameter to** *None* if you want to use the following arguments (Changed in v1.16.14).
418-
And vice versa: if not *None*, the remaining parameters must be *None*.
417+
**Set this parameter to** ``None`` if you want to use the following arguments (Changed in v1.16.14).
418+
And vice versa: if not ``None``, the remaining parameters must be ``None``.
419419

420-
:arg point_like start: start text marking at this point. Defaults to the top-left point of *clip*. Must be provided if `quads` is *None*. (New in v1.16.14)
421-
:arg point_like stop: stop text marking at this point. Defaults to the bottom-right point of *clip*. Must be used if `quads` is *None*. (New in v1.16.14)
420+
:arg point_like start: start text marking at this point. Defaults to the top-left point of *clip*. Must be provided if `quads` is ``None``. (New in v1.16.14)
421+
:arg point_like stop: stop text marking at this point. Defaults to the bottom-right point of *clip*. Must be used if `quads` is ``None``. (New in v1.16.14)
422422
:arg rect_like clip: only consider text lines intersecting this area. Defaults to the page rectangle. Only use if `start` and `stop` are provided. (New in v1.16.14)
423423

424-
:rtype: :ref:`Annot` or *None* (changed in v1.16.14).
424+
:rtype: :ref:`Annot` or ``None`` (changed in v1.16.14).
425425
:returns: the created annotation. If *quads* is an empty list, **no annotation** is created (changed in v1.16.14).
426426

427427
.. note::
@@ -1544,8 +1544,8 @@ In a nutshell, this is what you can do with PyMuPDF:
15441544

15451545
For paths other than groups or clips, key `"type"` takes one of the following values:
15461546

1547-
* **"f"** -- this is a *fill-only* path. Only key-values relevant for this operation have a meaning, not applicable ones are present with a value of *None*: `"color"`, `"lineCap"`, `"lineJoin"`, `"width"`, `"closePath"`, `"dashes"` and should be ignored.
1548-
* **"s"** -- this is a *stroke-only* path. Similar to previous, key `"fill"` is present with value *None*.
1547+
* **"f"** -- this is a *fill-only* path. Only key-values relevant for this operation have a meaning, not applicable ones are present with a value of ``None``: `"color"`, `"lineCap"`, `"lineJoin"`, `"width"`, `"closePath"`, `"dashes"` and should be ignored.
1548+
* **"s"** -- this is a *stroke-only* path. Similar to previous, key `"fill"` is present with value ``None``.
15491549
* **"fs"** -- this is a path performing combined *fill* and *stroke* operations.
15501550

15511551
Each item in `path["items"]` is one of the following:
@@ -1670,24 +1670,27 @@ In a nutshell, this is what you can do with PyMuPDF:
16701670
:arg bool xrefs: **PDF only.** Try to find the :data:`xref` for each image. Implies `hashes=True`. Adds the `"xref"` key to the dictionary. If not found, the value is 0, which means, the image is either "inline" or its xref is undetectable for some reason. Please note that this option has an extended response time, because the MD5 hashcode will be computed at least two times for each image with an xref. (New in v1.18.13)
16711671

16721672
:rtype: list[dict]
1673-
:returns: A list of dictionaries. This includes information for **exactly those** images, that are shown on the page -- including *"inline images"*. In contrast to images included in :meth:`Page.get_text`, image **binary content** is not loaded, which drastically reduces memory usage. The dictionary layout is similar to that of image blocks in `page.get_text("dict")`.
1673+
:returns: A list of dictionaries. This includes information for **exactly those** images, that are shown on the page -- including *"inline images"*. The dictionary layout is similar to that of image blocks in `page.get_text("dict")`.
1674+
1675+
In contrast to images included in :meth:`Page.get_text`, image **binary content** is not loaded by this method, which drastically reduces memory usage. Another difference is that image detection is not restricted to the visible part of the page or any ``clip`` parameter: method :meth:`Page.get_text` will only extract images **fully contained** in the provided ``clip``.
16741676

16751677
=============== ===============================================================
16761678
**Key** **Value**
16771679
=============== ===============================================================
1678-
number block number *(int)*
1680+
number block number (``int``)
16791681
bbox image bbox on page, :data:`rect_like`
1680-
width original image width *(int)*
1681-
height original image height *(int)*
1682-
cs-name colorspace name *(str)*
1683-
colorspace colorspace.n *(int)*
1684-
xres resolution in x-direction *(int)*
1685-
yres resolution in y-direction *(int)*
1686-
bpc bits per component *(int)*
1687-
size storage occupied by image *(int)*
1688-
digest MD5 hashcode *(bytes)*, if *hashes* is true
1682+
width original image width (``int``)
1683+
height original image height (``int``)
1684+
cs-name colorspace name (``str``)
1685+
colorspace colorspace.n (``int``)
1686+
xres resolution in x-direction (``int``)
1687+
yres resolution in y-direction (``int``)
1688+
bpc bits per component (``int``)
1689+
size storage occupied by image (``int``)
1690+
digest MD5 hashcode (``bytes``), if ``hashes`` is true
16891691
xref image :data:`xref` or 0, if *xrefs* is true
16901692
transform matrix transforming image rect to bbox, :data:`matrix_like`
1693+
has-mask whether the image is transparent and has a mask (``bool``)
16911694
=============== ===============================================================
16921695

16931696
Multiple occurrences of the same image are always reported. You can detect duplicates by comparing their `digest` values.
@@ -1771,7 +1774,7 @@ In a nutshell, this is what you can do with PyMuPDF:
17711774
Create an SVG image from the page. Only full page images are currently supported.
17721775

17731776
:arg matrix_like matrix: a matrix, default is :ref:`Identity`.
1774-
:arg bool text_as_path: -- controls how text is represented. *True* outputs each character as a series of elementary draw commands, which leads to a more precise text display in browsers, but a **very much larger** output for text-oriented pages. Display quality for *False* relies on the presence of the referenced fonts on the current system. For missing fonts, the internet browser will fall back to some default -- leading to unpleasant appearances. Choose *False* if you want to parse the text of the SVG. (New in v1.17.5)
1777+
:arg bool text_as_path: -- controls how text is represented. ``True`` outputs each character as a series of elementary draw commands, which leads to a more precise text display in browsers, but a **very much larger** output for text-oriented pages. Display quality for ``False`` relies on the presence of the referenced fonts on the current system. For missing fonts, the internet browser will fall back to some default -- leading to unpleasant appearances. Choose ``False`` if you want to parse the text of the SVG. (New in v1.17.5)
17751778

17761779
:returns: a UTF-8 encoded string that contains the image. Because SVG has XML syntax it can be saved in a text file, the standard extension is `.svg`.
17771780

@@ -1796,12 +1799,12 @@ In a nutshell, this is what you can do with PyMuPDF:
17961799
:arg colorspace: The desired colorspace, one of "GRAY", "RGB" or "CMYK" (case insensitive). Or specify a :ref:`Colorspace`, ie. one of the predefined ones: :data:`csGRAY`, :data:`csRGB` or :data:`csCMYK`.
17971800
:type colorspace: str or :ref:`Colorspace`
17981801
:arg irect_like clip: restrict rendering to the intersection of this area with the page's rectangle.
1799-
:arg bool alpha: whether to add an alpha channel. Always accept the default *False* if you do not really need transparency. This will save a lot of memory (25% in case of RGB ... and pixmaps are typically **large**!), and also processing time. Also note an **important difference** in how the image will be rendered: with *True* the pixmap's samples area will be pre-cleared with *0x00*. This results in **transparent** areas where the page is empty. With *False* the pixmap's samples will be pre-cleared with *0xff*. This results in **white** where the page has nothing to show.
1802+
:arg bool alpha: whether to add an alpha channel. Always accept the default ``False`` if you do not really need transparency. This will save a lot of memory (25% in case of RGB ... and pixmaps are typically **large**!), and also processing time. Also note an **important difference** in how the image will be rendered: with ``True`` the pixmap's samples area will be pre-cleared with *0x00*. This results in **transparent** areas where the page is empty. With ``False`` the pixmap's samples will be pre-cleared with *0xff*. This results in **white** where the page has nothing to show.
18001803

18011804
|history_begin|
18021805

18031806
Changed in v1.14.17
1804-
The default alpha value is now *False*.
1807+
The default alpha value is now ``False``.
18051808

18061809
* Generated with *alpha=True*
18071810

@@ -1881,7 +1884,7 @@ In a nutshell, this is what you can do with PyMuPDF:
18811884
:arg str,int ident: the annotation name or xref.
18821885

18831886
:rtype: :ref:`Annot`
1884-
:returns: the annotation or *None*.
1887+
:returns: the annotation or ``None``.
18851888

18861889
.. note:: Methods :meth:`Page.annot_names`, :meth:`Page.annot_xrefs` provide lists of names or xrefs, respectively, from where an item may be picked and loaded via this method.
18871890

@@ -1898,7 +1901,7 @@ In a nutshell, this is what you can do with PyMuPDF:
18981901
:arg int xref: the field's xref.
18991902

19001903
:rtype: :ref:`Widget`
1901-
:returns: the field or *None*.
1904+
:returns: the field or ``None``.
19021905

19031906
.. note:: This is similar to the analogous method :meth:`Page.load_annot` -- except that here only the xref is supported as identifier.
19041907

@@ -1913,7 +1916,7 @@ In a nutshell, this is what you can do with PyMuPDF:
19131916
Return the first link on a page. Synonym of property :attr:`first_link`.
19141917

19151918
:rtype: :ref:`Link`
1916-
:returns: first link on the page (or *None*).
1919+
:returns: first link on the page (or ``None``).
19171920

19181921
.. index::
19191922
pair: rotate; set_rotation
@@ -2187,19 +2190,19 @@ In a nutshell, this is what you can do with PyMuPDF:
21872190

21882191
.. attribute:: first_link
21892192

2190-
Contains the first :ref:`Link` of a page (or *None*).
2193+
Contains the first :ref:`Link` of a page (or ``None``).
21912194

21922195
:type: :ref:`Link`
21932196

21942197
.. attribute:: first_annot
21952198

2196-
Contains the first :ref:`Annot` of a page (or *None*).
2199+
Contains the first :ref:`Annot` of a page (or ``None``).
21972200

21982201
:type: :ref:`Annot`
21992202

22002203
.. attribute:: first_widget
22012204

2202-
Contains the first :ref:`Widget` of a page (or *None*).
2205+
Contains the first :ref:`Widget` of a page (or ``None``).
22032206

22042207
:type: :ref:`Widget`
22052208

docs/textpage.rst

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -202,28 +202,25 @@ Block Dictionaries
202202
~~~~~~~~~~~~~~~~~~
203203
Block dictionaries come in two different formats for **image blocks** and for **text blocks**.
204204

205-
* *(Changed in v1.18.0)* -- new dict key *number*, the block number.
206-
* *(Changed in v1.18.11)* -- new dict key *transform*, the image transformation matrix for image blocks.
207-
* *(Changed in v1.18.11)* -- new dict key *size*, the size of the image in bytes for image blocks.
208-
209205
**Image block:**
210206

211207
=============== ===============================================================
212208
**Key** **Value**
213209
=============== ===============================================================
214-
type 1 = image *(int)*
210+
type 1 = image (``int``)
215211
bbox image bbox on page (:data:`rect_like`)
216-
number block count *(int)*
217-
ext image type *(str)*, as file extension, see below
218-
width original image width *(int)*
219-
height original image height *(int)*
220-
colorspace colorspace component count *(int)*
221-
xres resolution in x-direction *(int)*
222-
yres resolution in y-direction *(int)*
223-
bpc bits per component *(int)*
212+
number block count (``int``)
213+
ext image type (``str``), as file extension, see below
214+
width original image width (``int``)
215+
height original image height (``int``)
216+
colorspace colorspace component count (``int``)
217+
xres resolution in x-direction (``int``)
218+
yres resolution in y-direction (``int``)
219+
bpc bits per component (``int``)
224220
transform matrix transforming image rect to bbox (:data:`matrix_like`)
225-
size size of the image in bytes *(int)*
226-
image image content *(bytes)*
221+
size size of the image in bytes (``int``)
222+
image image content (``bytes``)
223+
mask image mask content (``bytes``) for transparent images
227224
=============== ===============================================================
228225

229226
Possible values of the "ext" key are "bmp", "gif", "jpeg", "jpx" (JPEG 2000), "jxr" (JPEG XR), "png", "pnm", and "tiff".
@@ -241,6 +238,12 @@ Possible values of the "ext" key are "bmp", "gif", "jpeg", "jpx" (JPEG 2000), "j
241238

242239
3. The image's "transformation matrix" is defined as the matrix, for which the expression `bbox / transform == pymupdf.Rect(0, 0, 1, 1)` is true, lookup details here: :ref:`ImageTransformation`.
243240

241+
4. A transparent image may be accompanied by a mask image. This is stored under key `"mask"` and has the format of a `DeviceGray` PNG image. Otherwise the value of this key is ``None``. If present, you may be able to recover (an equivalent of) the original image -- i.e. with transparency -- by creating :ref:`Pixmap` objects from the "image", respectively "mask" values and overlay them. This is not guaranteed to always work because mask images come in multiple formats, of which not all qualify for the conditions under which overlaying Pixmaps are supported. Here is a code snippet:
242+
243+
>>> base = pymupdf.Pixmap(block["image"])
244+
>>> mask = pymupdf.Pixmap(block["mask"])
245+
>>> result = pymupdf.Pixmap(base, mask)
246+
244247

245248
**Text block:**
246249

0 commit comments

Comments
 (0)