You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-6Lines changed: 5 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,8 +80,8 @@ pip install marker-pdf[full]
80
80
First, some configuration:
81
81
82
82
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
83
-
- Some PDFs, even digital ones, have bad text in them. Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
84
-
- If you care about inline math, set `format_lines` to automatically convert inline math to LaTeX.
83
+
- Some PDFs, even digital ones, have bad text in them. Set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
84
+
- If you care about inline math, set `force_ocr` to convert inline math to LaTeX.
85
85
86
86
## Interactive App
87
87
@@ -106,8 +106,7 @@ Options:
106
106
-`--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
107
107
-`--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
108
108
-`--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
109
-
-`--format_lines`: Reformat all lines using a local OCR model (inline math, underlines, bold, etc.). This will give very good quality math output.
110
-
-`--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.
109
+
-`--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text. This will also format inline math properly.
111
110
-`--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker. This is useful for custom formatting or logic that you want to apply to the output.
112
111
-`--strip_existing_ocr`: Remove all existing OCR text in the document and re-OCR with surya.
113
112
-`--redo_inline_math`: If you want the absolute highest quality inline math conversion, use this along with `--use_llm`.
If you only want to run OCR, you can also do that through the `OCRConverter`. Set `--keep_chars` to keep individual characters and bounding boxes. You can also set `--force_ocr` and `--format_lines` with this converter.
234
+
If you only want to run OCR, you can also do that through the `OCRConverter`. Set `--keep_chars` to keep individual characters and bounding boxes.
236
235
237
236
```python
238
237
from marker.converters.ocr import OCRConverter
@@ -556,4 +555,4 @@ PDF is a tricky format, so marker will not always work perfectly. Here are some
556
555
- Very complex layouts, with nested tables and forms, may not work
557
556
- Forms may not be rendered well
558
557
559
-
Note: Passing the `--use_llm` and `--format_lines` flags will mostly solve these issues.
558
+
Note: Passing the `--use_llm` and `--force_ocr` flags will mostly solve these issues.
0 commit comments