15
15
</div >
16
16
17
17
### 最近更新
18
- - ** 2024.11.22**
19
- - 支持单字符匹配方案,需要RapidOCR>=1.4.0
20
18
- ** 2024.12.25**
21
19
- 补充文档扭曲矫正/去模糊/去阴影/二值化方案,可作为前置处理 [ RapidUnDistort] ( https://github.com/Joker1212/RapidUnWrap )
22
20
- ** 2025.1.9**
23
- - RapidTable支持了 unitable 模型,精度更高支持torch推理,补充测评数据
21
+ - RapidTable支持了 unitable 模型,精度更高支持torch推理,补充测评数据
22
+ - ** 2025.3.30**
23
+ - 输入输出格式对齐RapidTable
24
+ - 支持模型自动下载
25
+ - 增加来自paddle的新表格分类模型
26
+ - 增加最新PaddleX表格识别模型测评值
27
+ - 支持 rapidocr 2.0 取消重复ocr检测
24
28
25
29
### 简介
26
30
💖该仓库是用来对文档中表格做结构化识别的推理库,包括来自阿里读光有线和无线表格识别模型,llaipython(微信)贡献的有线表格模型,网易Qanything内置表格分类模型等。\
54
58
Surya-Tabled 使用内置ocr模块,表格模型为行列识别模型,无法识别单元格合并,导致分数较低
55
59
56
60
| 方法 | TEDS | TEDS-only-structure |
57
- | :---------------------------------------------------------------------------------------------------------| :-----------:| :-------------------:|
58
- | [ surya-tabled(--skip-detect)] ( https://github.com/VikParuchuri/tabled ) | 0.33437 | 0.65865 |
59
- | [ surya-tabled] ( https://github.com/VikParuchuri/tabled ) | 0.33940 | 0.67103 |
60
- | [ deepdoctection(table-transformer)] ( https://github.com/deepdoctection/deepdoctection?tab=readme-ov-file ) | 0.59975 | 0.69918 |
61
- | [ ppstructure_table_master] ( https://github.com/PaddlePaddle/PaddleOCR/tree/main/ppstructure ) | 0.61606 | 0.73892 |
62
- | [ ppsturcture_table_engine] ( https://github.com/PaddlePaddle/PaddleOCR/tree/main/ppstructure ) | 0.67924 | 0.78653 |
63
- | [ StructEqTable] ( https://github.com/UniModal4Reasoning/StructEqTable-Deploy ) | 0.67310 | 0.81210 |
64
- | [ RapidTable(SLANet)] ( https://github.com/RapidAI/RapidTable ) | 0.71654 | 0.81067 |
65
- | table_cls + wired_table_rec v1 + lineless_table_rec | 0.75288 | 0.82574 |
66
- | table_cls + wired_table_rec v2 + lineless_table_rec | 0.77676 | 0.84580 |
67
- | [ RapidTable(SLANet-plus)] ( https://github.com/RapidAI/RapidTable ) | 0.84481 | 0.91369 |
68
- | [ RapidTable(unitable)] ( https://github.com/RapidAI/RapidTable ) | ** 0.86200** | ** 0.91813** |
61
+ | :---------------------------------------------------------------------------------------------------------| :-----------:| :-----------------:|
62
+ | [ surya-tabled(--skip-detect)] ( https://github.com/VikParuchuri/tabled ) | 0.33437 | 0.65865 |
63
+ | [ surya-tabled] ( https://github.com/VikParuchuri/tabled ) | 0.33940 | 0.67103 |
64
+ | [ deepdoctection(table-transformer)] ( https://github.com/deepdoctection/deepdoctection?tab=readme-ov-file ) | 0.59975 | 0.69918 |
65
+ | [ ppstructure_table_master] ( https://github.com/PaddlePaddle/PaddleOCR/tree/main/ppstructure ) | 0.61606 | 0.73892 |
66
+ | [ ppsturcture_table_engine] ( https://github.com/PaddlePaddle/PaddleOCR/tree/main/ppstructure ) | 0.67924 | 0.78653 |
67
+ | [ StructEqTable] ( https://github.com/UniModal4Reasoning/StructEqTable-Deploy ) | 0.67310 | 0.81210 |
68
+ | [ RapidTable(SLANet)] ( https://github.com/RapidAI/RapidTable ) | 0.71654 | 0.81067 |
69
+ | table_cls + wired_table_rec v1 + lineless_table_rec | 0.75288 | 0.82574 |
70
+ | table_cls + wired_table_rec v2 + lineless_table_rec | 0.77676 | 0.84580 |
71
+ | [ PaddleX(SLANetXt+RT-DERT)] ( https://github.com/PaddlePaddle/PaddleX ) | 0.79900 | ** 0.92222** |
72
+ | [ RapidTable(SLANet-plus)] ( https://github.com/RapidAI/RapidTable ) | 0.84481 | 0.91369 |
73
+ | [ RapidTable(unitable)] ( https://github.com/RapidAI/RapidTable ) | ** 0.86200** | 0.91813 |
69
74
70
75
### 使用建议
71
76
wired_table_rec_v2(有线表格精度最高): 通用场景有线表格(论文,杂志,期刊, 收据,单据,账单)
@@ -75,63 +80,93 @@ wired_table_rec_v2 对1500px内大小的图片效果最好,所以分辨率超
75
80
SLANet-plus/unitable (综合精度最高): 文档场景表格(论文,杂志,期刊中的表格)
76
81
77
82
### 安装
78
-
83
+ rapidocr2.0以上版本支持torch,onnx,paddle,openvino等多引擎切换,详情参考 [ rapidocr文档 ] ( https://rapidai.github.io/RapidOCRDocs/main/install_usage/rapidocr/usage/ )
79
84
``` python {linenos=table}
80
85
pip install wired_table_rec lineless_table_rec table_cls
86
+ pip install rapidocr
81
87
```
82
88
83
89
### 快速使用
84
-
90
+ > ⚠️注意:在 ` wired_table_rec/table_cls ` >=1.2.0 ` ` lineless_table_rec` > 0.1.0 后,采用同RapidTable完全一致格式的输入输出
85
91
``` python {linenos=table}
86
- import os
92
+ from pathlib import Path
87
93
88
- from lineless_table_rec import LinelessTableRecognition
89
- from lineless_table_rec.utils_table_recover import format_html, plot_rec_box_with_logic_info, plot_rec_box
94
+ from wired_table_rec.utils.utils import VisTable
90
95
from table_cls import TableCls
91
- from wired_table_rec import WiredTableRecognition
92
- from rapidocr_onnxruntime import RapidOCR
93
-
94
- lineless_engine = LinelessTableRecognition()
95
- wired_engine = WiredTableRecognition()
96
- # 默认小yolo模型(0.1s),可切换为精度更高yolox(0.25s),更快的qanything(0.07s)模型
97
- table_cls = TableCls() # TableCls(model_type="yolox"),TableCls(model_type="q")
98
- img_path = f ' images/img14.jpg '
99
-
100
- cls ,elasp = table_cls(img_path)
101
- if cls == ' wired' :
102
- table_engine = wired_engine
103
- else :
104
- table_engine = lineless_engine
105
-
106
- html, elasp, polygons, logic_points, ocr_res = table_engine(img_path)
107
- print (f " elasp: { elasp} " )
108
-
109
- # 使用其他ocr模型
110
- # ocr_engine =RapidOCR(det_model_path="xxx/det_server_infer.onnx",rec_model_path="xxx/rec_server_infer.onnx")
111
- # ocr_res, _ = ocr_engine(img_path)
112
- # html, elasp, polygons, logic_points, ocr_res = table_engine(img_path, ocr_result=ocr_res)
113
- # output_dir = f'outputs'
114
- # complete_html = format_html(html)
115
- # os.makedirs(os.path.dirname(f"{output_dir}/table.html"), exist_ok=True)
116
- # with open(f"{output_dir}/table.html", "w", encoding="utf-8") as file:
117
- # file.write(complete_html)
118
- # # 可视化表格识别框 + 逻辑行列信息
119
- # plot_rec_box_with_logic_info(
120
- # img_path, f"{output_dir}/table_rec_box.jpg", logic_points, polygons
121
- # )
122
- # # 可视化 ocr 识别框
123
- # plot_rec_box(img_path, f"{output_dir}/ocr_box.jpg", ocr_res)
96
+ from wired_table_rec.main import WiredTableInput, WiredTableRecognition
97
+ from lineless_table_rec.main import LinelessTableInput, LinelessTableRecognition
98
+ from rapidocr import RapidOCR
99
+
100
+
101
+ if __name__ == " __main__" :
102
+ # Init
103
+ wired_input = WiredTableInput()
104
+ lineless_input = LinelessTableInput()
105
+ wired_engine = WiredTableRecognition(wired_input)
106
+ lineless_engine = LinelessTableRecognition(lineless_input)
107
+ viser = VisTable()
108
+ # 默认小yolo模型(0.1s),可切换为精度更高yolox(0.25s),更快的qanything(0.07s)模型或paddle模型(0.03s)
109
+ table_cls = TableCls()
110
+ img_path = f " tests/test_files/table.jpg "
111
+
112
+ cls , elasp = table_cls(img_path)
113
+ if cls == " wired" :
114
+ table_engine = wired_engine
115
+ else :
116
+ table_engine = lineless_engine
117
+
118
+ # 使用RapidOCR输入
119
+ ocr_engine = RapidOCR()
120
+ rapid_ocr_output = ocr_engine(img_path, return_word_box = True )
121
+ ocr_result = list (
122
+ zip (rapid_ocr_output.boxes, rapid_ocr_output.txts, rapid_ocr_output.scores)
123
+ )
124
+ table_results = table_engine(
125
+ img_path, ocr_result = ocr_result
126
+ )
127
+
128
+ # 使用单字识别
129
+ # word_results = rapid_ocr_output.word_results
130
+ # ocr_result = [
131
+ # [word_result[2], word_result[0], word_result[1]] for word_result in word_results
132
+ # ]
133
+ # table_results = table_engine(
134
+ # img_path, ocr_result=ocr_result, enhance_box_line=False
135
+ # )
136
+
137
+ # Save
138
+ # save_dir = Path("outputs")
139
+ # save_dir.mkdir(parents=True, exist_ok=True)
140
+ #
141
+ # save_html_path = f"outputs/{Path(img_path).stem}.html"
142
+ # save_drawed_path = f"outputs/{Path(img_path).stem}_table_vis{Path(img_path).suffix}"
143
+ # save_logic_path = (
144
+ # f"outputs/{Path(img_path).stem}_table_vis_logic{Path(img_path).suffix}"
145
+ # )
146
+
147
+ # Visualize table rec result
148
+ # vis_imged = viser(
149
+ # img_path, table_results, save_html_path, save_drawed_path, save_logic_path
150
+ # )
151
+
152
+
153
+
154
+
155
+
124
156
```
125
157
126
158
#### 单字ocr匹配
159
+
127
160
``` python
128
161
# 将单字box转换为行识别同样的结构)
129
- from rapidocr_onnxruntime import RapidOCR
130
- from wired_table_rec.utils_table_recover import trans_char_ocr_res
162
+ from rapidocr import RapidOCR
131
163
img_path = " tests/test_files/wired/table4.jpg"
132
- ocr_engine = RapidOCR()
133
- ocr_res, _ = ocr_engine(img_path, return_word_box = True )
134
- ocr_res = trans_char_ocr_res(ocr_res)
164
+ ocr_engine = RapidOCR()
165
+ rapid_ocr_output = ocr_engine(img_path, return_word_box = True )
166
+ word_results = rapid_ocr_output.word_results
167
+ ocr_result = [
168
+ [word_result[2 ], word_result[0 ], word_result[1 ]] for word_result in word_results
169
+ ]
135
170
```
136
171
137
172
#### 表格旋转及透视修正
@@ -177,24 +212,53 @@ for i, res in enumerate(result):
177
212
178
213
### 核心参数
179
214
``` python
180
- wired_table_rec = WiredTableRecognition()
181
- html, elasp, polygons, logic_points, ocr_res = wired_table_rec(
215
+ # 输入(WiredTableInput/LinelessTableInput)
216
+ @dataclass
217
+ class WiredTableInput :
218
+ model_type: Optional[str ] = " unet" # unet/cycle_center_net
219
+ model_path: Union[str , Path, None , Dict[str , str ]] = None
220
+ use_cuda: bool = False
221
+ device: str = " cpu"
222
+
223
+ @dataclass
224
+ class LinelessTableInput :
225
+ model_type: Optional[str ] = " lore" # lore
226
+ model_path: Union[str , Path, None , Dict[str , str ]] = None
227
+ use_cuda: bool = False
228
+ device: str = " cpu"
229
+
230
+ # 输出(WiredTableOutput/LinelessTableOutput)
231
+ @dataclass
232
+ class WiredTableOutput :
233
+ pred_html: Optional[str ] = None
234
+ cell_bboxes: Optional[np.ndarray] = None
235
+ logic_points: Optional[np.ndarray] = None
236
+ elapse: Optional[float ] = None
237
+
238
+ @dataclass
239
+ class LinelessTableOutput :
240
+ pred_html: Optional[str ] = None
241
+ cell_bboxes: Optional[np.ndarray] = None
242
+ logic_points: Optional[np.ndarray] = None
243
+ elapse: Optional[float ] = None
244
+ ```
245
+
246
+ ``` python
247
+ wired_table_rec = WiredTableRecognition(WiredTableInput())
248
+ table_results = wired_table_rec(
182
249
img, # 图片 Union[str, np.ndarray, bytes, Path, PIL.Image.Image]
183
250
ocr_result, # 输入rapidOCR识别结果,不传默认使用内部rapidocr模型
184
- version = " v2" , # 默认使用v2线框模型,切换阿里读光模型可改为v1
185
251
enhance_box_line = True , # 识别框切割增强(关闭避免多余切割,开启减少漏切割),默认为True
186
252
col_threshold = 15 , # 识别框左边界x坐标差值小于col_threshold的默认同列
187
253
row_threshold = 10 , # 识别框上边界y坐标差值小于row_threshold的默认同行
188
254
rotated_fix = True , # wiredV2支持,轻度旋转(-45°~45°)矫正,默认为True
189
255
need_ocr = True , # 是否进行OCR识别, 默认为True
190
- rec_again = True ,# 是否针对未识别到文字的表格框,进行单独截取再识别,默认为True
191
256
)
192
- lineless_table_rec = LinelessTableRecognition()
193
- html, elasp, polygons, logic_points, ocr_res = lineless_table_rec(
257
+ lineless_table_rec = LinelessTableRecognition(LinelessTableInput() )
258
+ table_results = lineless_table_rec(
194
259
img, # 图片 Union[str, np.ndarray, bytes, Path, PIL.Image.Image]
195
260
ocr_result, # 输入rapidOCR识别结果,不传默认使用内部rapidocr模型
196
261
need_ocr = True , # 是否进行OCR识别, 默认为True
197
- rec_again = True ,# 是否针对未识别到文字的表格框,进行单独截取再识别,默认为True
198
262
)
199
263
```
200
264
@@ -225,7 +289,7 @@ html, elasp, polygons, logic_points, ocr_res = lineless_table_rec(
225
289
``` mermaid
226
290
flowchart TD
227
291
A[/表格图片/] --> B([表格分类 table_cls])
228
- B --> C([有线表格识别 wired_table_rec]) & D([无线表格识别 lineless_table_rec]) --> E([文字识别 rapidocr_onnxruntime ])
292
+ B --> C([有线表格识别 wired_table_rec]) & D([无线表格识别 lineless_table_rec]) --> E([文字识别 rapidocr ])
229
293
E --> F[/html结构化输出/]
230
294
```
231
295
0 commit comments