Skip to content

Commit e4a2d2b

Browse files
authored
Merge pull request #821 from datalab-to/dev
Bump surya version
2 parents 2f7f197 + 0cf939b commit e4a2d2b

26 files changed

+166
-638
lines changed

README.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -80,8 +80,8 @@ pip install marker-pdf[full]
8080
First, some configuration:
8181

8282
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
83-
- Some PDFs, even digital ones, have bad text in them. Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
84-
- If you care about inline math, set `format_lines` to automatically convert inline math to LaTeX.
83+
- Some PDFs, even digital ones, have bad text in them. Set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
84+
- If you care about inline math, set `force_ocr` to convert inline math to LaTeX.
8585

8686
## Interactive App
8787

@@ -106,8 +106,7 @@ Options:
106106
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
107107
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
108108
- `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
109-
- `--format_lines`: Reformat all lines using a local OCR model (inline math, underlines, bold, etc.). This will give very good quality math output.
110-
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.
109+
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text. This will also format inline math properly.
111110
- `--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker. This is useful for custom formatting or logic that you want to apply to the output.
112111
- `--strip_existing_ocr`: Remove all existing OCR text in the document and re-OCR with surya.
113112
- `--redo_inline_math`: If you want the absolute highest quality inline math conversion, use this along with `--use_llm`.
@@ -232,7 +231,7 @@ marker_single FILENAME --use_llm --force_layout_block Table --converter_cls mark
232231

233232
### OCR Only
234233

235-
If you only want to run OCR, you can also do that through the `OCRConverter`. Set `--keep_chars` to keep individual characters and bounding boxes. You can also set `--force_ocr` and `--format_lines` with this converter.
234+
If you only want to run OCR, you can also do that through the `OCRConverter`. Set `--keep_chars` to keep individual characters and bounding boxes.
236235

237236
```python
238237
from marker.converters.ocr import OCRConverter
@@ -556,4 +555,4 @@ PDF is a tricky format, so marker will not always work perfectly. Here are some
556555
- Very complex layouts, with nested tables and forms, may not work
557556
- Forms may not be rendered well
558557

559-
Note: Passing the `--use_llm` and `--format_lines` flags will mostly solve these issues.
558+
Note: Passing the `--use_llm` and `--force_ocr` flags will mostly solve these issues.

benchmarks/throughput/main.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ def get_next_pdf(ds: datasets.Dataset, i: int):
2525

2626
def single_batch(
2727
batch_size: int,
28-
format_lines: bool,
2928
num_threads: int,
3029
force_ocr: bool,
3130
quantize: bool,
@@ -83,7 +82,6 @@ def single_batch(
8382
artifact_dict=model_dict,
8483
config={
8584
"disable_tqdm": worker_id > 0,
86-
"format_lines": format_lines,
8785
"page_range": page_range,
8886
"force_ocr": force_ocr,
8987
},
@@ -104,14 +102,12 @@ def single_batch(
104102
@click.command(help="Benchmark PDF to MD conversion throughput.")
105103
@click.option("--workers", default=1, help="Number of workers to use.")
106104
@click.option("--batch_size", default=1, help="Batch size for inference.")
107-
@click.option("--format_lines", is_flag=True, help="Format lines in the output.")
108105
@click.option("--force_ocr", is_flag=True, help="Force OCR on all pages.")
109106
@click.option("--quantize", is_flag=True, help="Use quantized model.")
110107
@click.option("--compile", is_flag=True, help="Use compiled model.")
111108
def main(
112109
workers: int,
113110
batch_size: int,
114-
format_lines: bool,
115111
force_ocr: bool,
116112
quantize: bool,
117113
compile: bool,
@@ -127,7 +123,6 @@ def main(
127123
executor.submit(
128124
single_batch,
129125
batch_size,
130-
format_lines,
131126
cpus_per_worker,
132127
force_ocr,
133128
quantize,

0 commit comments

Comments
 (0)