Skip to content

Commit 9c7c8d3

Browse files
authored
Merge pull request #132 from RapidAI/align_to_rapid_table
Align to rapid table
2 parents f492e8f + 0bb5bb0 commit 9c7c8d3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+3108
-1544
lines changed

.github/workflows/lineless_table_rec.yml

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,7 @@ jobs:
2929
run: |
3030
pip install -r requirements.txt
3131
pip install pytest
32-
33-
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/lineless_table_rec_models.zip
34-
unzip lineless_table_rec_models.zip
35-
mv lineless_table_rec_models/*.onnx lineless_table_rec/models/
36-
32+
pip install rapidocr
3733
pytest tests/test_lineless_table_rec.py
3834
3935
GenerateWHL_PushPyPi:
@@ -54,11 +50,7 @@ jobs:
5450
pip install -r requirements.txt
5551
python -m pip install --upgrade pip
5652
pip install wheel get_pypi_latest_version
57-
58-
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/lineless_table_rec_models.zip
59-
unzip lineless_table_rec_models.zip
60-
mv lineless_table_rec_models/*.onnx lineless_table_rec/models/
61-
53+
pip install rapidocr
6254
python setup_lineless.py bdist_wheel "${{ github.ref_name }}"
6355
6456
# - name: Publish distribution 📦 to Test PyPI

.github/workflows/table_cls.yml

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,6 @@ jobs:
2929
pip install -r requirements.txt
3030
pip install pytest beautifulsoup4
3131
32-
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/table_cls_models.zip
33-
unzip table_cls_models.zip
34-
mv table_cls_models/*.onnx table_cls/models/
35-
3632
pytest tests/test_table_cls.py
3733
3834
GenerateWHL_PushPyPi:
@@ -54,10 +50,6 @@ jobs:
5450
python -m pip install --upgrade pip
5551
pip install wheel get_pypi_latest_version
5652
57-
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/table_cls_models.zip
58-
unzip table_cls_models.zip
59-
mv table_cls_models/*.onnx table_cls/models/
60-
6153
python setup_table_cls.py bdist_wheel "${{ github.ref_name }}"
6254
6355
- name: Publish distribution 📦 to PyPI

.github/workflows/wired_table_rec.yml

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,7 @@ jobs:
2828
run: |
2929
pip install -r requirements.txt
3030
pip install pytest beautifulsoup4
31-
32-
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/wired_table_rec_models.zip
33-
unzip wired_table_rec_models.zip
34-
mv wired_table_rec_models/*.onnx wired_table_rec/models/
35-
31+
pip install rapidocr
3632
pytest tests/test_wired_table_rec.py
3733
3834
GenerateWHL_PushPyPi:
@@ -53,11 +49,7 @@ jobs:
5349
pip install -r requirements.txt
5450
python -m pip install --upgrade pip
5551
pip install wheel get_pypi_latest_version
56-
57-
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/wired_table_rec_models.zip
58-
unzip wired_table_rec_models.zip
59-
mv wired_table_rec_models/*.onnx wired_table_rec/models/
60-
52+
pip install rapidocr
6153
python setup_wired.py bdist_wheel "${{ github.ref_name }}"
6254
6355
- name: Publish distribution 📦 to PyPI

README.md

Lines changed: 130 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,16 @@
1515
</div>
1616

1717
### 最近更新
18-
- **2024.11.22**
19-
- 支持单字符匹配方案,需要RapidOCR>=1.4.0
2018
- **2024.12.25**
2119
- 补充文档扭曲矫正/去模糊/去阴影/二值化方案,可作为前置处理 [RapidUnDistort](https://github.com/Joker1212/RapidUnWrap)
2220
- **2025.1.9**
23-
- RapidTable支持了 unitable 模型,精度更高支持torch推理,补充测评数据
21+
- RapidTable支持了 unitable 模型,精度更高支持torch推理,补充测评数据
22+
- **2025.3.30**
23+
- 输入输出格式对齐RapidTable
24+
- 支持模型自动下载
25+
- 增加来自paddle的新表格分类模型
26+
- 增加最新PaddleX表格识别模型测评值
27+
- 支持 rapidocr 2.0 取消重复ocr检测
2428

2529
### 简介
2630
💖该仓库是用来对文档中表格做结构化识别的推理库,包括来自阿里读光有线和无线表格识别模型,llaipython(微信)贡献的有线表格模型,网易Qanything内置表格分类模型等。\
@@ -54,18 +58,19 @@
5458
Surya-Tabled 使用内置ocr模块,表格模型为行列识别模型,无法识别单元格合并,导致分数较低
5559

5660
| 方法 | TEDS | TEDS-only-structure |
57-
|:---------------------------------------------------------------------------------------------------------|:-----------:|:-------------------:|
58-
| [surya-tabled(--skip-detect)](https://github.com/VikParuchuri/tabled) | 0.33437 | 0.65865 |
59-
| [surya-tabled](https://github.com/VikParuchuri/tabled) | 0.33940 | 0.67103 |
60-
| [deepdoctection(table-transformer)](https://github.com/deepdoctection/deepdoctection?tab=readme-ov-file) | 0.59975 | 0.69918 |
61-
| [ppstructure_table_master](https://github.com/PaddlePaddle/PaddleOCR/tree/main/ppstructure) | 0.61606 | 0.73892 |
62-
| [ppsturcture_table_engine](https://github.com/PaddlePaddle/PaddleOCR/tree/main/ppstructure) | 0.67924 | 0.78653 |
63-
| [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) | 0.67310 | 0.81210 |
64-
| [RapidTable(SLANet)](https://github.com/RapidAI/RapidTable) | 0.71654 | 0.81067 |
65-
| table_cls + wired_table_rec v1 + lineless_table_rec | 0.75288 | 0.82574 |
66-
| table_cls + wired_table_rec v2 + lineless_table_rec | 0.77676 | 0.84580 |
67-
| [RapidTable(SLANet-plus)](https://github.com/RapidAI/RapidTable) | 0.84481 | 0.91369 |
68-
| [RapidTable(unitable)](https://github.com/RapidAI/RapidTable) | **0.86200** | **0.91813** |
61+
|:---------------------------------------------------------------------------------------------------------|:-----------:|:-----------------:|
62+
| [surya-tabled(--skip-detect)](https://github.com/VikParuchuri/tabled) | 0.33437 | 0.65865 |
63+
| [surya-tabled](https://github.com/VikParuchuri/tabled) | 0.33940 | 0.67103 |
64+
| [deepdoctection(table-transformer)](https://github.com/deepdoctection/deepdoctection?tab=readme-ov-file) | 0.59975 | 0.69918 |
65+
| [ppstructure_table_master](https://github.com/PaddlePaddle/PaddleOCR/tree/main/ppstructure) | 0.61606 | 0.73892 |
66+
| [ppsturcture_table_engine](https://github.com/PaddlePaddle/PaddleOCR/tree/main/ppstructure) | 0.67924 | 0.78653 |
67+
| [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) | 0.67310 | 0.81210 |
68+
| [RapidTable(SLANet)](https://github.com/RapidAI/RapidTable) | 0.71654 | 0.81067 |
69+
| table_cls + wired_table_rec v1 + lineless_table_rec | 0.75288 | 0.82574 |
70+
| table_cls + wired_table_rec v2 + lineless_table_rec | 0.77676 | 0.84580 |
71+
| [PaddleX(SLANetXt+RT-DERT)](https://github.com/PaddlePaddle/PaddleX) | 0.79900 | **0.92222** |
72+
| [RapidTable(SLANet-plus)](https://github.com/RapidAI/RapidTable) | 0.84481 | 0.91369 |
73+
| [RapidTable(unitable)](https://github.com/RapidAI/RapidTable) | **0.86200** | 0.91813 |
6974

7075
### 使用建议
7176
wired_table_rec_v2(有线表格精度最高): 通用场景有线表格(论文,杂志,期刊, 收据,单据,账单)
@@ -75,63 +80,93 @@ wired_table_rec_v2 对1500px内大小的图片效果最好,所以分辨率超
7580
SLANet-plus/unitable (综合精度最高): 文档场景表格(论文,杂志,期刊中的表格)
7681

7782
### 安装
78-
83+
rapidocr2.0以上版本支持torch,onnx,paddle,openvino等多引擎切换,详情参考[rapidocr文档](https://rapidai.github.io/RapidOCRDocs/main/install_usage/rapidocr/usage/)
7984
``` python {linenos=table}
8085
pip install wired_table_rec lineless_table_rec table_cls
86+
pip install rapidocr
8187
```
8288

8389
### 快速使用
84-
90+
> ⚠️注意:在`wired_table_rec/table_cls`>=1.2.0` `lineless_table_rec` > 0.1.0 后,采用同RapidTable完全一致格式的输入输出
8591
``` python {linenos=table}
86-
import os
92+
from pathlib import Path
8793

88-
from lineless_table_rec import LinelessTableRecognition
89-
from lineless_table_rec.utils_table_recover import format_html, plot_rec_box_with_logic_info, plot_rec_box
94+
from wired_table_rec.utils.utils import VisTable
9095
from table_cls import TableCls
91-
from wired_table_rec import WiredTableRecognition
92-
from rapidocr_onnxruntime import RapidOCR
93-
94-
lineless_engine = LinelessTableRecognition()
95-
wired_engine = WiredTableRecognition()
96-
# 默认小yolo模型(0.1s),可切换为精度更高yolox(0.25s),更快的qanything(0.07s)模型
97-
table_cls = TableCls() # TableCls(model_type="yolox"),TableCls(model_type="q")
98-
img_path = f'images/img14.jpg'
99-
100-
cls,elasp = table_cls(img_path)
101-
if cls == 'wired':
102-
table_engine = wired_engine
103-
else:
104-
table_engine = lineless_engine
105-
106-
html, elasp, polygons, logic_points, ocr_res = table_engine(img_path)
107-
print(f"elasp: {elasp}")
108-
109-
# 使用其他ocr模型
110-
#ocr_engine =RapidOCR(det_model_path="xxx/det_server_infer.onnx",rec_model_path="xxx/rec_server_infer.onnx")
111-
#ocr_res, _ = ocr_engine(img_path)
112-
#html, elasp, polygons, logic_points, ocr_res = table_engine(img_path, ocr_result=ocr_res)
113-
# output_dir = f'outputs'
114-
# complete_html = format_html(html)
115-
# os.makedirs(os.path.dirname(f"{output_dir}/table.html"), exist_ok=True)
116-
# with open(f"{output_dir}/table.html", "w", encoding="utf-8") as file:
117-
# file.write(complete_html)
118-
# # 可视化表格识别框 + 逻辑行列信息
119-
# plot_rec_box_with_logic_info(
120-
# img_path, f"{output_dir}/table_rec_box.jpg", logic_points, polygons
121-
# )
122-
# # 可视化 ocr 识别框
123-
# plot_rec_box(img_path, f"{output_dir}/ocr_box.jpg", ocr_res)
96+
from wired_table_rec.main import WiredTableInput, WiredTableRecognition
97+
from lineless_table_rec.main import LinelessTableInput, LinelessTableRecognition
98+
from rapidocr import RapidOCR
99+
100+
101+
if __name__ == "__main__":
102+
# Init
103+
wired_input = WiredTableInput()
104+
lineless_input = LinelessTableInput()
105+
wired_engine = WiredTableRecognition(wired_input)
106+
lineless_engine = LinelessTableRecognition(lineless_input)
107+
viser = VisTable()
108+
# 默认小yolo模型(0.1s),可切换为精度更高yolox(0.25s),更快的qanything(0.07s)模型或paddle模型(0.03s)
109+
table_cls = TableCls()
110+
img_path = f"tests/test_files/table.jpg"
111+
112+
cls, elasp = table_cls(img_path)
113+
if cls == "wired":
114+
table_engine = wired_engine
115+
else:
116+
table_engine = lineless_engine
117+
118+
# 使用RapidOCR输入
119+
ocr_engine = RapidOCR()
120+
rapid_ocr_output = ocr_engine(img_path, return_word_box=True)
121+
ocr_result = list(
122+
zip(rapid_ocr_output.boxes, rapid_ocr_output.txts, rapid_ocr_output.scores)
123+
)
124+
table_results = table_engine(
125+
img_path, ocr_result=ocr_result
126+
)
127+
128+
# 使用单字识别
129+
# word_results = rapid_ocr_output.word_results
130+
# ocr_result = [
131+
# [word_result[2], word_result[0], word_result[1]] for word_result in word_results
132+
# ]
133+
# table_results = table_engine(
134+
# img_path, ocr_result=ocr_result, enhance_box_line=False
135+
# )
136+
137+
# Save
138+
# save_dir = Path("outputs")
139+
# save_dir.mkdir(parents=True, exist_ok=True)
140+
#
141+
# save_html_path = f"outputs/{Path(img_path).stem}.html"
142+
# save_drawed_path = f"outputs/{Path(img_path).stem}_table_vis{Path(img_path).suffix}"
143+
# save_logic_path = (
144+
# f"outputs/{Path(img_path).stem}_table_vis_logic{Path(img_path).suffix}"
145+
# )
146+
147+
# Visualize table rec result
148+
# vis_imged = viser(
149+
# img_path, table_results, save_html_path, save_drawed_path, save_logic_path
150+
# )
151+
152+
153+
154+
155+
124156
```
125157

126158
#### 单字ocr匹配
159+
127160
```python
128161
# 将单字box转换为行识别同样的结构)
129-
from rapidocr_onnxruntime import RapidOCR
130-
from wired_table_rec.utils_table_recover import trans_char_ocr_res
162+
from rapidocr import RapidOCR
131163
img_path = "tests/test_files/wired/table4.jpg"
132-
ocr_engine =RapidOCR()
133-
ocr_res, _ = ocr_engine(img_path, return_word_box=True)
134-
ocr_res = trans_char_ocr_res(ocr_res)
164+
ocr_engine = RapidOCR()
165+
rapid_ocr_output = ocr_engine(img_path, return_word_box=True)
166+
word_results = rapid_ocr_output.word_results
167+
ocr_result = [
168+
[word_result[2], word_result[0], word_result[1]] for word_result in word_results
169+
]
135170
```
136171

137172
#### 表格旋转及透视修正
@@ -177,24 +212,53 @@ for i, res in enumerate(result):
177212

178213
### 核心参数
179214
```python
180-
wired_table_rec = WiredTableRecognition()
181-
html, elasp, polygons, logic_points, ocr_res = wired_table_rec(
215+
# 输入(WiredTableInput/LinelessTableInput)
216+
@dataclass
217+
class WiredTableInput:
218+
model_type: Optional[str] = "unet" #unet/cycle_center_net
219+
model_path: Union[str, Path, None, Dict[str, str]] = None
220+
use_cuda: bool = False
221+
device: str = "cpu"
222+
223+
@dataclass
224+
class LinelessTableInput:
225+
model_type: Optional[str] = "lore" #lore
226+
model_path: Union[str, Path, None, Dict[str, str]] = None
227+
use_cuda: bool = False
228+
device: str = "cpu"
229+
230+
# 输出(WiredTableOutput/LinelessTableOutput)
231+
@dataclass
232+
class WiredTableOutput:
233+
pred_html: Optional[str] = None
234+
cell_bboxes: Optional[np.ndarray] = None
235+
logic_points: Optional[np.ndarray] = None
236+
elapse: Optional[float] = None
237+
238+
@dataclass
239+
class LinelessTableOutput:
240+
pred_html: Optional[str] = None
241+
cell_bboxes: Optional[np.ndarray] = None
242+
logic_points: Optional[np.ndarray] = None
243+
elapse: Optional[float] = None
244+
```
245+
246+
```python
247+
wired_table_rec = WiredTableRecognition(WiredTableInput())
248+
table_results = wired_table_rec(
182249
img, # 图片 Union[str, np.ndarray, bytes, Path, PIL.Image.Image]
183250
ocr_result, # 输入rapidOCR识别结果,不传默认使用内部rapidocr模型
184-
version="v2", #默认使用v2线框模型,切换阿里读光模型可改为v1
185251
enhance_box_line=True, # 识别框切割增强(关闭避免多余切割,开启减少漏切割),默认为True
186252
col_threshold=15, # 识别框左边界x坐标差值小于col_threshold的默认同列
187253
row_threshold=10, # 识别框上边界y坐标差值小于row_threshold的默认同行
188254
rotated_fix=True, # wiredV2支持,轻度旋转(-45°~45°)矫正,默认为True
189255
need_ocr=True, # 是否进行OCR识别, 默认为True
190-
rec_again=True,# 是否针对未识别到文字的表格框,进行单独截取再识别,默认为True
191256
)
192-
lineless_table_rec = LinelessTableRecognition()
193-
html, elasp, polygons, logic_points, ocr_res = lineless_table_rec(
257+
lineless_table_rec = LinelessTableRecognition(LinelessTableInput())
258+
table_results = lineless_table_rec(
194259
img, # 图片 Union[str, np.ndarray, bytes, Path, PIL.Image.Image]
195260
ocr_result, # 输入rapidOCR识别结果,不传默认使用内部rapidocr模型
196261
need_ocr=True, # 是否进行OCR识别, 默认为True
197-
rec_again=True,# 是否针对未识别到文字的表格框,进行单独截取再识别,默认为True
198262
)
199263
```
200264

@@ -225,7 +289,7 @@ html, elasp, polygons, logic_points, ocr_res = lineless_table_rec(
225289
```mermaid
226290
flowchart TD
227291
A[/表格图片/] --> B([表格分类 table_cls])
228-
B --> C([有线表格识别 wired_table_rec]) & D([无线表格识别 lineless_table_rec]) --> E([文字识别 rapidocr_onnxruntime])
292+
B --> C([有线表格识别 wired_table_rec]) & D([无线表格识别 lineless_table_rec]) --> E([文字识别 rapidocr])
229293
E --> F[/html结构化输出/]
230294
```
231295

0 commit comments

Comments
 (0)