diff --git a/README.md b/README.md
index e2d4563..43b7a4e 100644
--- a/README.md
+++ b/README.md
@@ -1,86 +1,88 @@
# WebMainBench
-WebMainBench 是一个专门用于端到端评测网页正文抽取质量的基准测试工具。
+[简体中文](README_zh.md) | English
-## 功能特点
+WebMainBench is a specialized benchmark tool for end-to-end evaluation of web main content extraction quality.
-### 🎯 **核心功能**
-- **多抽取器支持**: 支持 trafilatura,resiliparse 等多种抽取工具
-- **全面的评测指标**: 包含文本编辑距离、表格结构相似度(TEDS)、公式抽取质量等多维度指标
-- **人工标注支持**: 评测数据集100%人工标注
+## Features
-#### 指标详细说明
+### 🎯 **Core Features**
+- **Multiple Extractor Support**: Supports various extraction tools such as trafilatura, resiliparse, and more
+- **Comprehensive Evaluation Metrics**: Includes multi-dimensional metrics such as text edit distance, table structure similarity (TEDS), formula extraction quality, etc.
+- **Manual Annotation Support**: 100% manually annotated evaluation dataset
-| 指标名称 | 计算方式 | 取值范围 | 说明 |
+#### Metric Details
+
+| Metric Name | Calculation Method | Value Range | Description |
|---------|----------|----------|------|
-| `overall` | 所有成功指标的平均值 | 0.0-1.0 | 综合质量评分,分数越高质量越好 |
-| `text_edit` | `1 - (编辑距离 / 最大文本长度)` | 0.0-1.0 | 纯文本相似度,分数越高质量越好 |
-| `code_edit` | `1 - (编辑距离 / 最大代码长度)` | 0.0-1.0 | 代码内容相似度,分数越高质量越好 |
-| `table_TEDS` | `1 - (树编辑距离 / 最大节点数)` | 0.0-1.0 | 表格结构相似度,分数越高质量越好 |
-| `table_edit` | `1 - (编辑距离 / 最大表格长度)` | 0.0-1.0 | 表格内容相似度,分数越高质量越好 |
-| `formula_edit` | `1 - (编辑距离 / 最大公式长度)` | 0.0-1.0 | 公式内容相似度,分数越高质量越好 |
+| `overall` | Average of all successful metrics | 0.0-1.0 | Comprehensive quality score, higher is better |
+| `text_edit` | `1 - (edit distance / max text length)` | 0.0-1.0 | Plain text similarity, higher is better |
+| `code_edit` | `1 - (edit distance / max code length)` | 0.0-1.0 | Code content similarity, higher is better |
+| `table_TEDS` | `1 - (tree edit distance / max nodes)` | 0.0-1.0 | Table structure similarity, higher is better |
+| `table_edit` | `1 - (edit distance / max table length)` | 0.0-1.0 | Table content similarity, higher is better |
+| `formula_edit` | `1 - (edit distance / max formula length)` | 0.0-1.0 | Formula content similarity, higher is better |
-### 🏗️ **系统架构**
+### 🏗️ **System Architecture**

-### 🔧 **核心模块**
-1. **data 模块**: 评测集文件和结果的读写管理
-2. **extractors 模块**: 各种抽取工具的统一接口
-3. **metrics 模块**: 评测指标的计算实现
-4. **evaluator 模块**: 评测任务的执行和结果输出
+### 🔧 **Core Modules**
+1. **data module**: Read/write management of evaluation sets and results
+2. **extractors module**: Unified interface for various extraction tools
+3. **metrics module**: Implementation of evaluation metrics calculation
+4. **evaluator module**: Execution and result output of evaluation tasks
-## 快速开始
+## Quick Start
-### 安装
+### Installation
```bash
-# 基础安装
+# Basic installation
pip install webmainbench
-# 安装所有可选依赖
+# Install with all optional dependencies
pip install webmainbench[all]
-# 开发环境安装
+# Development environment installation
pip install webmainbench[dev]
```
-### 基本使用
+### Basic Usage
```python
from webmainbench import DataLoader, Evaluator, ExtractorFactory
-# 1. 加载评测数据集
+# 1. Load evaluation dataset
dataset = DataLoader.load_jsonl("your_dataset.jsonl")
-# 2. 创建抽取器
+# 2. Create extractor
extractor = ExtractorFactory.create("trafilatura")
-# 3. 运行评测
+# 3. Run evaluation
evaluator = Evaluator()
result = evaluator.evaluate(dataset, extractor)
-# 4. 查看结果
+# 4. View results
print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
```
-### 数据格式
+### Data Format
-评测数据集应包含以下字段:
+Evaluation datasets should contain the following fields:
```jsonl
{
"track_id": "0b7f2636-d35f-40bf-9b7f-94be4bcbb396",
- "html": "
这是标题
", # 人工标注带cc-select="true" 属性
+ "html": "This is a title
", # Manually annotated with cc-select="true" attribute
"url": "https://orderyourbooks.com/product-category/college-books-p-u/?products-per-page=all",
- "main_html": "这是标题
", # 从html中剪枝得到的正文html
- "convert_main_content": "# 这是标题", # 从main_html+html2text转化来
- "groundtruth_content": "# 这是标题", # 人工校准的markdown(部分提供)
+ "main_html": "This is a title
", # Main content HTML pruned from html
+ "convert_main_content": "# This is a title", # Converted from main_html + html2text
+ "groundtruth_content": "# This is a title", # Manually calibrated markdown (partially provided)
"meta": {
- "language": "en", # 网页的语言
- "style": "artical", # 网页的文体
+ "language": "en", # Web page language
+ "style": "artical", # Web page style
"table": [], # [], ["layout"], ["data"], ["layout", "data"]
"equation": [], # [], ["inline"], ["interline"], ["inline", "interline"]
"code": [], # [], ["inline"], ["interline"], ["inline", "interline"]
@@ -89,17 +91,17 @@ print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
}
```
-## 支持的抽取器
+## Supported Extractors
-- **trafilatura**: trafilatura抽取器
-- **resiliparse**: resiliparse抽取器
-- **llm-webkit**: llm-webkit 抽取器
-- **magic-html**: magic-html 抽取器
-- **自定义抽取器**: 通过继承 `BaseExtractor` 实现
+- **trafilatura**: trafilatura extractor
+- **resiliparse**: resiliparse extractor
+- **llm-webkit**: llm-webkit extractor
+- **magic-html**: magic-html extractor
+- **Custom extractors**: Implement by inheriting from `BaseExtractor`
-## 评测榜单
+## Evaluation Leaderboard
-| extractor | extractor_version | dataset | total_samples | overall(macro avg) | code_edit | formula_edit | table_TEDS | table_edit | text_edit |
+| extractor | extractor_version | dataset | total_samples | overall (macro avg) | code_edit | formula_edit | table_TEDS | table_edit | text_edit |
|-----------|-------------------|---------|---------------|---------------------|-----------|--------------|------------|-----------|-----------|
| llm-webkit | 4.1.1 | WebMainBench1.0 | 545 | 0.8256 | 0.9093 | 0.9399 | 0.7388 | 0.678 | 0.8621 |
| magic-html | 0.1.5 | WebMainBench1.0 | 545 | 0.5141 | 0.4117 | 0.7204 | 0.3984 | 0.2611 | 0.7791 |
@@ -107,12 +109,12 @@ print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
| trafilatura_txt | 2.0.0 | WebMainBench1.0 | 545 | 0.2657 | 0 | 0.6162 | 0 | 0 | 0.7126 |
| resiliparse | 0.14.5 | WebMainBench1.0 | 545 | 0.2954 | 0.0641 | 0.6747 | 0 | 0 | 0.7381 |
-## 高级功能
+## Advanced Features
-### 多抽取器对比评估
+### Multi-Extractor Comparison
```python
-# 对比多个抽取器
+# Compare multiple extractors
extractors = ["trafilatura", "resiliparse"]
results = evaluator.compare_extractors(dataset, extractors)
@@ -120,34 +122,34 @@ for name, result in results.items():
print(f"{name}: {result.overall_metrics['overall']:.4f}")
```
-#### 具体示例
+#### Detailed Example
```python
python examples/multi_extractor_compare.py
```
-这个例子演示了如何:
+This example demonstrates how to:
-1. **加载测试数据集**:使用包含代码、公式、表格、文本等多种内容类型的样本数据
-2. **创建多个抽取器**:
- - `magic-html`:基于 magic-html 库的抽取器
- - `trafilatura`:基于 trafilatura 库的抽取器
- - `resiliparse`:基于 resiliparse 库的抽取器
-3. **批量评估对比**:使用 `evaluator.compare_extractors()` 同时评估所有抽取器
-4. **生成对比报告**:自动保存多种格式的评估结果
+1. **Load test dataset**: Use sample data containing multiple content types such as code, formulas, tables, text, etc.
+2. **Create multiple extractors**:
+ - `magic-html`: Extractor based on magic-html library
+ - `trafilatura`: Extractor based on trafilatura library
+ - `resiliparse`: Extractor based on resiliparse library
+3. **Batch evaluation comparison**: Use `evaluator.compare_extractors()` to evaluate all extractors simultaneously
+4. **Generate comparison report**: Automatically save evaluation results in multiple formats
-#### 输出文件说明
+#### Output File Description
-评估完成后会在 `results/` 目录下生成三个重要文件:
+After evaluation is complete, three important files will be generated in the `results/` directory:
-| 文件名 | 格式 | 内容描述 |
+| File Name | Format | Content Description |
|--------|------|----------|
-| `leaderboard.csv` | CSV | **排行榜文件**:包含各抽取器的整体排名和分项指标对比,便于快速查看性能差异 |
-| `evaluation_results.json` | JSON | **详细评估结果**:包含每个抽取器的完整评估数据、指标详情和元数据信息 |
-| `dataset_with_results.jsonl` | JSONL | **增强数据集**:原始测试数据加上所有抽取器的提取结果,便于人工检查和分析 |
+| `leaderboard.csv` | CSV | **Leaderboard file**: Contains overall rankings and sub-metric comparisons for each extractor, for quick performance comparison |
+| `evaluation_results.json` | JSON | **Detailed evaluation results**: Contains complete evaluation data, metric details and metadata for each extractor |
+| `dataset_with_results.jsonl` | JSONL | **Enhanced dataset**: Original test data plus extraction results from all extractors, for manual inspection and analysis |
-`leaderboard.csv` 内容示例:
+`leaderboard.csv` content example:
```csv
extractor,dataset,total_samples,success_rate,overall,code_edit,formula_edit,table_TEDS,table_edit,text_edit
magic-html,sample_dataset,4,1.0,0.1526,0.1007,0.0,0.0,0.0,0.6624
@@ -155,7 +157,7 @@ resiliparse,sample_dataset,4,1.0,0.1379,0.0,0.0,0.0,0.0,0.6897
trafilatura,sample_dataset,4,1.0,0.1151,0.1007,0.0,0.0,0.0,0.4746
```
-### 自定义指标
+### Custom Metrics
```python
from webmainbench.metrics import BaseMetric, MetricResult
@@ -165,7 +167,7 @@ class CustomMetric(BaseMetric):
pass
def _calculate_score(self, predicted, groundtruth, **kwargs):
- # 实现自定义评测逻辑
+ # Implement custom evaluation logic
score = your_calculation(predicted, groundtruth)
return MetricResult(
metric_name=self.name,
@@ -173,22 +175,22 @@ class CustomMetric(BaseMetric):
details={"custom_info": "value"}
)
-# 添加到评测器
+# Add to evaluator
evaluator.metric_calculator.add_metric("custom", CustomMetric("custom"))
```
-### 自定义抽取器
+### Custom Extractors
```python
from webmainbench.extractors import BaseExtractor, ExtractionResult
class MyExtractor(BaseExtractor):
def _setup(self):
- # 初始化抽取器
+ # Initialize extractor
pass
def _extract_content(self, html, url=None):
- # 实现抽取逻辑
+ # Implement extraction logic
content = your_extraction_logic(html)
return ExtractionResult(
@@ -197,34 +199,35 @@ class MyExtractor(BaseExtractor):
success=True
)
-# 注册自定义抽取器
+# Register custom extractor
ExtractorFactory.register("my-extractor", MyExtractor)
```
-## 项目架构
+## Project Architecture
```
webmainbench/
-├── data/ # 数据处理模块
-│ ├── dataset.py # 数据集类
-│ ├── loader.py # 数据加载器
-│ └── saver.py # 数据保存器
-├── extractors/ # 抽取器模块
-│ ├── base.py # 基础接口
-│ ├── factory.py # 工厂模式
-│ └── ... # 具体实现
-├── metrics/ # 指标模块
-│ ├── base.py # 基础接口
-│ ├── text_metrics.py # 文本指标
-│ ├── table_metrics.py # 表格指标
-│ └── calculator.py # 指标计算器
-├── evaluator/ # 评估器模块
-│ └── evaluator.py # 主评估器
-└── utils/ # 工具模块
- └── helpers.py # 辅助函数
+├── data/ # Data processing module
+│ ├── dataset.py # Dataset class
+│ ├── loader.py # Data loader
+│ └── saver.py # Data saver
+├── extractors/ # Extractor module
+│ ├── base.py # Base interface
+│ ├── factory.py # Factory pattern
+│ └── ... # Specific implementations
+├── metrics/ # Metrics module
+│ ├── base.py # Base interface
+│ ├── text_metrics.py # Text metrics
+│ ├── table_metrics.py # Table metrics
+│ └── calculator.py # Metric calculator
+├── evaluator/ # Evaluator module
+│ └── evaluator.py # Main evaluator
+└── utils/ # Utility module
+ └── helpers.py # Helper functions
```
-## 许可证
+## License
+
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
-本项目采用 MIT 许可证 - 查看 [LICENSE](LICENSE) 文件了解详情。
diff --git a/README_zh.md b/README_zh.md
new file mode 100644
index 0000000..e2d4563
--- /dev/null
+++ b/README_zh.md
@@ -0,0 +1,230 @@
+# WebMainBench
+
+WebMainBench 是一个专门用于端到端评测网页正文抽取质量的基准测试工具。
+
+## 功能特点
+
+### 🎯 **核心功能**
+- **多抽取器支持**: 支持 trafilatura,resiliparse 等多种抽取工具
+- **全面的评测指标**: 包含文本编辑距离、表格结构相似度(TEDS)、公式抽取质量等多维度指标
+- **人工标注支持**: 评测数据集100%人工标注
+
+#### 指标详细说明
+
+| 指标名称 | 计算方式 | 取值范围 | 说明 |
+|---------|----------|----------|------|
+| `overall` | 所有成功指标的平均值 | 0.0-1.0 | 综合质量评分,分数越高质量越好 |
+| `text_edit` | `1 - (编辑距离 / 最大文本长度)` | 0.0-1.0 | 纯文本相似度,分数越高质量越好 |
+| `code_edit` | `1 - (编辑距离 / 最大代码长度)` | 0.0-1.0 | 代码内容相似度,分数越高质量越好 |
+| `table_TEDS` | `1 - (树编辑距离 / 最大节点数)` | 0.0-1.0 | 表格结构相似度,分数越高质量越好 |
+| `table_edit` | `1 - (编辑距离 / 最大表格长度)` | 0.0-1.0 | 表格内容相似度,分数越高质量越好 |
+| `formula_edit` | `1 - (编辑距离 / 最大公式长度)` | 0.0-1.0 | 公式内容相似度,分数越高质量越好 |
+
+
+### 🏗️ **系统架构**
+
+
+
+### 🔧 **核心模块**
+1. **data 模块**: 评测集文件和结果的读写管理
+2. **extractors 模块**: 各种抽取工具的统一接口
+3. **metrics 模块**: 评测指标的计算实现
+4. **evaluator 模块**: 评测任务的执行和结果输出
+
+
+## 快速开始
+
+### 安装
+
+```bash
+# 基础安装
+pip install webmainbench
+
+# 安装所有可选依赖
+pip install webmainbench[all]
+
+# 开发环境安装
+pip install webmainbench[dev]
+```
+
+### 基本使用
+
+```python
+from webmainbench import DataLoader, Evaluator, ExtractorFactory
+
+# 1. 加载评测数据集
+dataset = DataLoader.load_jsonl("your_dataset.jsonl")
+
+# 2. 创建抽取器
+extractor = ExtractorFactory.create("trafilatura")
+
+# 3. 运行评测
+evaluator = Evaluator()
+result = evaluator.evaluate(dataset, extractor)
+
+# 4. 查看结果
+print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
+```
+
+### 数据格式
+
+评测数据集应包含以下字段:
+
+```jsonl
+{
+ "track_id": "0b7f2636-d35f-40bf-9b7f-94be4bcbb396",
+ "html": "这是标题
", # 人工标注带cc-select="true" 属性
+ "url": "https://orderyourbooks.com/product-category/college-books-p-u/?products-per-page=all",
+ "main_html": "这是标题
", # 从html中剪枝得到的正文html
+ "convert_main_content": "# 这是标题", # 从main_html+html2text转化来
+ "groundtruth_content": "# 这是标题", # 人工校准的markdown(部分提供)
+ "meta": {
+ "language": "en", # 网页的语言
+ "style": "artical", # 网页的文体
+ "table": [], # [], ["layout"], ["data"], ["layout", "data"]
+ "equation": [], # [], ["inline"], ["interline"], ["inline", "interline"]
+ "code": [], # [], ["inline"], ["interline"], ["inline", "interline"]
+ "level": "mid" # simple, mid, hard
+ }
+}
+```
+
+## 支持的抽取器
+
+- **trafilatura**: trafilatura抽取器
+- **resiliparse**: resiliparse抽取器
+- **llm-webkit**: llm-webkit 抽取器
+- **magic-html**: magic-html 抽取器
+- **自定义抽取器**: 通过继承 `BaseExtractor` 实现
+
+## 评测榜单
+
+| extractor | extractor_version | dataset | total_samples | overall(macro avg) | code_edit | formula_edit | table_TEDS | table_edit | text_edit |
+|-----------|-------------------|---------|---------------|---------------------|-----------|--------------|------------|-----------|-----------|
+| llm-webkit | 4.1.1 | WebMainBench1.0 | 545 | 0.8256 | 0.9093 | 0.9399 | 0.7388 | 0.678 | 0.8621 |
+| magic-html | 0.1.5 | WebMainBench1.0 | 545 | 0.5141 | 0.4117 | 0.7204 | 0.3984 | 0.2611 | 0.7791 |
+| trafilatura_md | 2.0.0 | WebMainBench1.0 | 545 | 0.3858 | 0.1305 | 0.6242 | 0.3203 | 0.1653 | 0.6887 |
+| trafilatura_txt | 2.0.0 | WebMainBench1.0 | 545 | 0.2657 | 0 | 0.6162 | 0 | 0 | 0.7126 |
+| resiliparse | 0.14.5 | WebMainBench1.0 | 545 | 0.2954 | 0.0641 | 0.6747 | 0 | 0 | 0.7381 |
+
+## 高级功能
+
+### 多抽取器对比评估
+
+```python
+# 对比多个抽取器
+extractors = ["trafilatura", "resiliparse"]
+results = evaluator.compare_extractors(dataset, extractors)
+
+for name, result in results.items():
+ print(f"{name}: {result.overall_metrics['overall']:.4f}")
+```
+
+#### 具体示例
+
+```python
+python examples/multi_extractor_compare.py
+```
+
+这个例子演示了如何:
+
+1. **加载测试数据集**:使用包含代码、公式、表格、文本等多种内容类型的样本数据
+2. **创建多个抽取器**:
+ - `magic-html`:基于 magic-html 库的抽取器
+ - `trafilatura`:基于 trafilatura 库的抽取器
+ - `resiliparse`:基于 resiliparse 库的抽取器
+3. **批量评估对比**:使用 `evaluator.compare_extractors()` 同时评估所有抽取器
+4. **生成对比报告**:自动保存多种格式的评估结果
+
+#### 输出文件说明
+
+评估完成后会在 `results/` 目录下生成三个重要文件:
+
+| 文件名 | 格式 | 内容描述 |
+|--------|------|----------|
+| `leaderboard.csv` | CSV | **排行榜文件**:包含各抽取器的整体排名和分项指标对比,便于快速查看性能差异 |
+| `evaluation_results.json` | JSON | **详细评估结果**:包含每个抽取器的完整评估数据、指标详情和元数据信息 |
+| `dataset_with_results.jsonl` | JSONL | **增强数据集**:原始测试数据加上所有抽取器的提取结果,便于人工检查和分析 |
+
+
+`leaderboard.csv` 内容示例:
+```csv
+extractor,dataset,total_samples,success_rate,overall,code_edit,formula_edit,table_TEDS,table_edit,text_edit
+magic-html,sample_dataset,4,1.0,0.1526,0.1007,0.0,0.0,0.0,0.6624
+resiliparse,sample_dataset,4,1.0,0.1379,0.0,0.0,0.0,0.0,0.6897
+trafilatura,sample_dataset,4,1.0,0.1151,0.1007,0.0,0.0,0.0,0.4746
+```
+
+### 自定义指标
+
+```python
+from webmainbench.metrics import BaseMetric, MetricResult
+
+class CustomMetric(BaseMetric):
+ def _setup(self):
+ pass
+
+ def _calculate_score(self, predicted, groundtruth, **kwargs):
+ # 实现自定义评测逻辑
+ score = your_calculation(predicted, groundtruth)
+ return MetricResult(
+ metric_name=self.name,
+ score=score,
+ details={"custom_info": "value"}
+ )
+
+# 添加到评测器
+evaluator.metric_calculator.add_metric("custom", CustomMetric("custom"))
+```
+
+### 自定义抽取器
+
+```python
+from webmainbench.extractors import BaseExtractor, ExtractionResult
+
+class MyExtractor(BaseExtractor):
+ def _setup(self):
+ # 初始化抽取器
+ pass
+
+ def _extract_content(self, html, url=None):
+ # 实现抽取逻辑
+ content = your_extraction_logic(html)
+
+ return ExtractionResult(
+ content=content,
+ content_list=[...],
+ success=True
+ )
+
+# 注册自定义抽取器
+ExtractorFactory.register("my-extractor", MyExtractor)
+```
+
+## 项目架构
+
+```
+webmainbench/
+├── data/ # 数据处理模块
+│ ├── dataset.py # 数据集类
+│ ├── loader.py # 数据加载器
+│ └── saver.py # 数据保存器
+├── extractors/ # 抽取器模块
+│ ├── base.py # 基础接口
+│ ├── factory.py # 工厂模式
+│ └── ... # 具体实现
+├── metrics/ # 指标模块
+│ ├── base.py # 基础接口
+│ ├── text_metrics.py # 文本指标
+│ ├── table_metrics.py # 表格指标
+│ └── calculator.py # 指标计算器
+├── evaluator/ # 评估器模块
+│ └── evaluator.py # 主评估器
+└── utils/ # 工具模块
+ └── helpers.py # 辅助函数
+```
+
+
+## 许可证
+
+本项目采用 MIT 许可证 - 查看 [LICENSE](LICENSE) 文件了解详情。