You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/document.rst
+2-2
Original file line number
Diff line number
Diff line change
@@ -1366,13 +1366,13 @@ For details on **embedded files** refer to Appendix 3.
1366
1366
1367
1367
PDF only: Insert an empty page.
1368
1368
1369
-
:arg int pno: page number in front of which the new page should be inserted. Must be in *1 < pno <= page_count*. Special values -1 and *doc.page_count* insert **after** the last page.
1369
+
:arg int pno: page number in front of which the new page should be inserted. Must be in `1 < pno <= page_count`. Special values -1 and *doc.page_count* insert **after** the last page.
1370
1370
1371
1371
:arg float width: page width.
1372
1372
:arg float height: page height.
1373
1373
1374
1374
:rtype::ref:`Page`
1375
-
:returns: the created page object.
1375
+
:returns: the created page object. Be aware that the page numbers of pages after the inserted one will have changed after method execution. For the same reason, **all existing page objects will be invalidated.** Using them will lead to exceptions.
Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists **support for building page chunks** from the |Markdown| text.
* To always read full pages **(default)**, use `margins=0`.
48
48
49
49
:arg bool page_chunks: if `True` the output will be a list of `Document.page_count` dictionaries (one per page). Each dictionary has the following structure:
50
50
@@ -62,23 +62,27 @@ The |PyMuPDF4LLM| API
62
62
63
63
- **"words"** - if `extract_words=True` was used. This is a list of tuples `(x0, y0, x1, y1, "wordstring", bno, lno, wno)` as delivered by `page.get_text("words")`. The **sequence** of these tuples however is the same as produced in the markdown text string and thus honors multi-column text. This is also true for text in tables: words are extracted in the sequence of table row cells.
64
64
65
+
:arg str filename: (New in v.0.0.19) Overwrites or sets the desired image file name of written images. Useful when the document is provided as a memory object (which has no inherent name).
66
+
65
67
:arg float page_width: specify a desired page width. This is ignored for documents with a fixed page width like PDF, XPS etc. **Reflowable** documents however, like e-books, office or text files have no fixed page dimensions and by default are assumed to have Letter format width (612) and an **"infinite"** page height. This means that the full document is treated as one large page.
66
68
67
69
:arg float page_height: specify a desired page height. For relevance see the `page_width` parameter. If using the default `None`, the document will appear as one large page with a width of `page_width`. Consequently in this case, no markdown page separators will occur (except the final one), respectively only one page chunk will be returned.
68
70
69
71
:arg str table_strategy: table detection strategy. Default is `"lines_strict"` which ignores background colors. In some occasions, other strategies may be more successful, for example `"lines"` which uses all vector graphics objects for detection.
70
72
71
-
:arg int graphics_limit: use this to limit dealing with excess amounts of vector graphics elements. Typically, scientific documents or pages simulating text using graphics commands may contain tens of thousands of these objects. As vector graphics are used for table detection mainly, analyzing pages of this kind may result in excessive runtimes. You can exclude problematic pages via for instance `graphics_limit=5000` or even a smaller value if desired. The respective pages will then be ignored and be represented by one message line in the output text.
73
+
:arg int graphics_limit: use this to limit dealing with excess amounts of vector graphics elements. Scientific documents, or pages simulating text via graphics commands may contain tens of thousands of these objects. As vector graphics are analyzed for multiple purposes, runtime may quickly become intolerable. With this parameter, all vector graphics will be ignored if their count exceeds the threshold. **Changed in v0.0.19:** The page will still be processed, and text, tables and images should be extracted.
72
74
73
-
:arg bool ignore_code: if `True` then mono-spaced text does not receive special formatting treatment. Code blocks will no longer be generated. This value is set to `True` if `extract_words=True` is used.
75
+
:arg bool ignore_code: if `True` then mono-spaced text does not receive special formatting. Code blocks will no longer be generated. This value is set to `True` if `extract_words=True` is used.
74
76
75
77
:arg bool extract_words: a value of `True` enforces `page_chunks=True` and adds key "words" to each page dictionary. Its value is a list of words as delivered by PyMuPDF's `Page` method `get_text("words")`. The sequence of the words in this list is the same as the extracted text.
76
78
77
-
:arg bool show_progress: a value of `True` (the default) displays a text-based progress bar as pages are being converted to Markdown. It will look similar to the following::
79
+
:arg bool show_progress: Default is `False`. A value of `True` displays a text-based progress bar as pages are being converted to Markdown. It will look similar to the following::
78
80
79
81
Processing input.pdf...
80
82
[==================== ] (148/291)
81
83
84
+
:arg bool use_glyphs: (New in v.0.0.19) Default is `False`. A value of `True` will use the glyph number of the characters instead of the character itself.
85
+
82
86
:returns: Either a string of the combined text of all selected document pages, or a list of dictionaries.
83
87
84
88
.. method:: LlamaMarkdownReader(*args, **kwargs)
@@ -103,6 +107,9 @@ The |PyMuPDF4LLM| API
103
107
104
108
:returns: a list of `LlamaIndexDocument` documents - one for each page.
105
109
110
+
-----
111
+
112
+
For a list of changes, please see file `CHANGES.md <https://github.com/pymupdf/RAG/blob/main/CHANGES.md>`_.
0 commit comments