Stop Coding Scrapers, Start Getting Data β from Hours to Seconds
20260108134043.mp4
The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages
| Precision | Recall | F1 Score | |
|---|---|---|---|
| COT | 87.75 | 79.90 | 76.95 |
| Reflexion | 93.28 | 82.76 | 82.40 |
| AUTOSCRAPER | 92.49 | 89.13 | 88.69 |
| Web2JSON-Agent | 91.50 | 90.46 | 89.93 |
# 1. Install package
pip install web2json-agent
# 2. Initialize configuration
web2json setup# 1. Clone the repository
git clone https://github.com/ccprocessor/web2json-agent
cd web2json-agent
# 2. Install in editable mode
pip install -e .
# 3. Initialize configuration
web2json setupWeb2JSON provides five simple APIs for different use cases. All examples are ready to run!
Auto Mode - Let AI automatically filter fields and extract data:
from web2json import Web2JsonConfig, extract_html_to_json
config = Web2JsonConfig(
name="my_project",
html_path="html_samples/",
output_path="output/"
)
result = extract_html_to_json(config)
# Output: output/my_project/result/*.json
print(f"β Results saved to: {result}")Predefined Mode - Extract only specific fields:
from web2json import Web2JsonConfig, extract_html_to_json
config = Web2JsonConfig(
name="articles",
html_path="html_samples/",
output_path="output/",
schema={
"title": "string",
"author": "string",
"date": "string",
"content": "string"
}
)
result = extract_html_to_json(config)
# Output: output/articles/result/*.json
print(f"β Results saved to: {result}")Generate a parser once, use it many times:
from web2json import Web2JsonConfig, generate_html_parser
config = Web2JsonConfig(
name="product_parser",
html_path="training_samples/",
output_path="parsers/"
)
parser_path = generate_html_parser(config)
# Output: parsers/product_parser/final_parser.py
print(f"β Parser saved: {parser_path}")Reuse a trained parser on new HTML files:
from web2json import Web2JsonConfig, parse_html_with_parser
config = Web2JsonConfig(
name="batch_001",
html_path="new_html_files/",
output_path="results/",
parser_path="parsers/product_parser/final_parser.py"
)
result = parse_html_with_parser(config)
# Output: results/batch_001/result/*.json
print(f"β Parsed data saved to: {result}")Generate a JSON Schema containing field descriptions and XPath:
from web2json import Web2JsonConfig, infer_html_to_schema
import json
config = Web2JsonConfig(
name="schema_exploration",
html_path="html_samples/",
output_path="schemas/"
)
schema_path = infer_html_to_schema(config)
# Output: schemas/schema_exploration/final_schema.json
# View the learned schema
with open(schema_path) as f:
schema = json.load(f)
print(json.dumps(schema, indent=2))Group HTML files with different layouts into separate directories(Each directory needs to call the Agent only once):
from web2json import Web2JsonConfig, cluster_html_files
config = Web2JsonConfig(
name="clustered_pages",
html_path="html_samples/",
output_path="output/"
)
result = cluster_html_files(config)
# Output: output/clustered_pages/cluster_0/, cluster_1/, noise/, cluster_info.txt
print(f"β Found {len(result['clusters'])} layout types")
print(f"β Cluster info: {result['cluster_info_file']}")| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str |
Required | Project name (creates subdirectory) |
html_path |
str |
Required | Directory with HTML files |
output_path |
str |
"output" |
Output directory |
iteration_rounds |
int |
3 |
Number of samples for learning |
schema |
Dict |
None |
Predefined fields (None = auto mode) |
parser_path |
str |
None |
Parser file (for parse_html_with_parser) |
# Need JSON data immediately? β extract_html_to_json
extract_html_to_json(config)
# Want to inspect schema first? β infer_html_to_schema
infer_html_to_schema(config)
# Need reusable parser? β generate_html_parser
generate_html_parser(config)
# Have parser, need to parse more files? β parse_html_with_parser
parse_html_with_parser(config)
# Input HTML from different domains/with different layouts? β cluster_html_files
cluster_html_files(config)Apache-2.0 License
Made with β€οΈ by the web2json-agent team
β Star us on GitHub | π Report Issues | π Documentation