Skip to content

Commit 6994abd

Browse files
committed
some docu updates
1 parent 5034c88 commit 6994abd

File tree

2 files changed

+12
-55
lines changed

2 files changed

+12
-55
lines changed

docs/document.rst

+10-10
Original file line numberDiff line numberDiff line change
@@ -404,16 +404,16 @@ For details on **embedded files** refer to Appendix 3.
404404

405405
.. method:: getPageImageList(pno, full=False)
406406

407-
PDF only: Return a list of all images referenced by the page.
407+
PDF only: Return a list of all images (directly or indirectly) referenced by the page.
408408

409409
:arg int pno: page number, 0-based, *-inf < pno < pageCount*.
410-
:arg bool full: whether to also include the invoker's :data:`xref` (which is zero if this is the page).
410+
:arg bool full: whether to also include the referencer's :data:`xref` (which is zero if this is the page).
411411

412412
:rtype: list
413413

414414
:returns: a list of images shown on this page. Each item looks like
415415

416-
**(xref, smask, width, height, bpc, colorspace, alt. colorspace, name, filter, invoker)**
416+
**(xref, smask, width, height, bpc, colorspace, alt. colorspace, name, filter, referencer)**
417417

418418
Where
419419

@@ -425,7 +425,7 @@ For details on **embedded files** refer to Appendix 3.
425425
* **alt. colorspace** (*str*) is any alternate colorspace depending on the value of **colorspace**
426426
* **name** (*str*) is the symbolic name by which the image is referenced
427427
* **filter** (*str*) is the decode filter of the image (:ref:`AdobeManual`, pp. 65).
428-
* **invoker** (*int*) the :data:`xref` of the invoker. Zero if directly referenced by the page. Only present if *full=True*.
428+
* **referencer** (*int*) the :data:`xref` of the referencer. Zero if directly referenced by the page. Only present if *full=True*.
429429

430430
See below how this information can be used to extract PDF images as separate files. Another demonstration::
431431

@@ -438,16 +438,16 @@ For details on **embedded files** refer to Appendix 3.
438438

439439
.. method:: getPageFontList(pno, full=False)
440440

441-
PDF only: Return a list of all fonts referenced by the page.
441+
PDF only: Return a list of all fonts (directly or indirectly) referenced by the page.
442442

443443
:arg int pno: page number, 0-based, -inf < pno < pageCount.
444-
:arg bool full: whether to also include the invoker's :data:`xref` (which is zero if directly referenced by the page).
444+
:arg bool full: whether to also include the referencer's :data:`xref`. If *True*, the returned items are one entry longer. Use this option if you need to know, whether the page directly references the font. In this case the last entry is 0. If the font is referenced by an ``/XObject`` of the page, you will find its :data:`xref` here.
445445

446446
:rtype: list
447447

448448
:returns: a list of fonts referenced by this page. Each entry looks like
449-
450-
**(xref, ext, type, basefont, name, encoding, invoker)**,
449+
450+
**(xref, ext, type, basefont, name, encoding, referencer)**,
451451

452452
where
453453

@@ -457,7 +457,7 @@ For details on **embedded files** refer to Appendix 3.
457457
* **basefont** (*str*) is the base font name,
458458
* **name** (*str*) is the symbolic name, by which the font is referenced
459459
* **encoding** (*str*) the font's character encoding if different from its built-in encoding (:ref:`AdobeManual`, p. 414):
460-
* **invoker** (*int* optional) the :data:`xref` of the invoker. Zero if directly referenced by the page. Only present if *full=True*.
460+
* **referencer** (*int* optional) the :data:`xref` of the referencer. Zero if directly referenced by the page, otherwise the xref of an XObject. Only present if *full=True*.
461461

462462
Example::
463463

@@ -469,7 +469,7 @@ For details on **embedded files** refer to Appendix 3.
469469
[28, 'ttf', 'TrueType', 'NOHSJV+Calibri-Light', 'R12', '']
470470
[8, 'ttf', 'Type0', 'ECPLRU+Calibri', 'R23', 'Identity-H']
471471

472-
.. note:: This list has no duplicate entries: the combination of :data:`xref` and *name* is unique. But by themselves, each of the two may occur multiple times. Duplicate *name* entries indicate the presence of "Form XObjects" on the page, e.g. generated by :meth:`Page.showPDFpage`.
472+
.. note:: This list has no duplicate entries: the combination of :data:`xref`, *name* and *referencer* is unique.
473473

474474
.. method:: getPageText(pno, output="text")
475475

docs/faq.rst

+2-45
Original file line numberDiff line numberDiff line change
@@ -560,7 +560,7 @@ This script will take a document filename and generate a text file from all of i
560560

561561
The document can be any supported type like PDF, XPS, etc.
562562

563-
The script works as a command line tool which expects the document filename supplied as a parameter. It generates one text file named "filename.txt" in the script directory. Text of pages is separated by a line "-----"::
563+
The script works as a command line tool which expects the document filename supplied as a parameter. It generates one text file named "filename.txt" in the script directory. Text of pages is separated by a form feed character::
564564

565565
import sys, fitz
566566
fname = sys.argv[1] # get document filename
@@ -588,50 +588,7 @@ See the following two section for examples and further explanations.
588588

589589
How to Extract Text from within a Rectangle
590590
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
591-
Please refer to the script `textboxtract.py <https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/textboxtract.py>`_.
592-
593-
It demonstrates ways to extract text contained in the following red rectangle,
594-
595-
.. image:: images/img-textboxtract.png
596-
:scale: 75
597-
598-
.. highlight:: text
599-
600-
by using more or less restrictive conditions to find the relevant words::
601-
602-
Select the words strictly contained in rectangle
603-
------------------------------------------------
604-
Die Altersübereinstimmung deutete darauf hin,
605-
engen, nur 50 Millionen Jahre großen
606-
Gesteinshagel auf den Mond traf und dabei
607-
hinterließ – einige größer als Frankreich.
608-
es sich um eine letzte, infernalische Welle
609-
Geburt des Sonnensystems. Daher tauften die
610-
das Ereignis »lunare Katastrophe«. Später
611-
die Bezeichnung Großes Bombardement durch.
612-
613-
Or, more forgiving, respectively::
614-
615-
Select the words intersecting the rectangle
616-
-------------------------------------------
617-
Die Altersübereinstimmung deutete darauf hin, dass
618-
einem engen, nur 50 Millionen Jahre großen Zeitfenster
619-
ein Gesteinshagel auf den Mond traf und dabei unzählige
620-
Krater hinterließ – einige größer als Frankreich. Offenbar
621-
handelte es sich um eine letzte, infernalische Welle nach
622-
der Geburt des Sonnensystems. Daher tauften die Caltech-
623-
Forscher das Ereignis »lunare Katastrophe«. Später setzte
624-
sich die Bezeichnung Großes Bombardement durch.
625-
626-
The latter output also includes words *intersecting* the rectangle.
627-
628-
.. highlight:: python
629-
630-
What if your **rectangle spans across more than one page**? Follow this recipe:
631-
632-
* Create a common list of all words of all pages which your rectangle intersects.
633-
* When adding word items to this common list, increase their **y-coordinates** by the accumulated height of all previous pages.
634-
591+
There is now (v1.18.0) more than one way to achieve this. We therefore have created a `folder <https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction>`_ in the PyMuPDF-Utilities repository specifically dealing with this topic.
635592

636593
----------
637594

0 commit comments

Comments
 (0)