🌐 web2json-agent

Stop Coding Scrapers, Start Getting Data — from Hours to Seconds

📋 Demo

20260108134043.mp4

📊 SWDE Benchmark Results

The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages

	Precision	Recall	F1 Score
COT	87.75	79.90	76.95
Reflexion	93.28	82.76	82.40
AUTOSCRAPER	92.49	89.13	88.69
Web2JSON-Agent	91.50	90.46	89.93

🚀 Quick Start

Install via pip

# 1. Install package
pip install web2json-agent

# 2. Initialize configuration
web2json setup

Install for Developers

# 1. Clone the repository
git clone https://github.com/ccprocessor/web2json-agent
cd web2json-agent

# 2. Install in editable mode
pip install -e .

# 3. Initialize configuration
web2json setup

🐍 API Usage

Web2JSON provides five simple APIs for different use cases. All examples are ready to run!

Example 1: Directly obtain structured data

Auto Mode - Let AI automatically filter fields and extract data:

from web2json import Web2JsonConfig, extract_html_to_json

config = Web2JsonConfig(
    name="my_project",
    html_path="html_samples/",
    output_path="output/"
)

result = extract_html_to_json(config)
# Output: output/my_project/result/*.json
print(f"✓ Results saved to: {result}")

Predefined Mode - Extract only specific fields:

from web2json import Web2JsonConfig, extract_html_to_json

config = Web2JsonConfig(
    name="articles",
    html_path="html_samples/",
    output_path="output/",
    schema={
        "title": "string",
        "author": "string",
        "date": "string",
        "content": "string"
    }
)

result = extract_html_to_json(config)
# Output: output/articles/result/*.json
print(f"✓ Results saved to: {result}")

Example 2: Generate Reusable Parser

Generate a parser once, use it many times:

from web2json import Web2JsonConfig, generate_html_parser

config = Web2JsonConfig(
    name="product_parser",
    html_path="training_samples/",
    output_path="parsers/"
)

parser_path = generate_html_parser(config)
# Output: parsers/product_parser/final_parser.py
print(f"✓ Parser saved: {parser_path}")

Example 3: Parse with Existing Parser

Reuse a trained parser on new HTML files:

from web2json import Web2JsonConfig, parse_html_with_parser

config = Web2JsonConfig(
    name="batch_001",
    html_path="new_html_files/",
    output_path="results/",
    parser_path="parsers/product_parser/final_parser.py"
)

result = parse_html_with_parser(config)
# Output: results/batch_001/result/*.json
print(f"✓ Parsed data saved to: {result}")

Example 4: Generate Schema Only

Generate a JSON Schema containing field descriptions and XPath:

from web2json import Web2JsonConfig, infer_html_to_schema
import json

config = Web2JsonConfig(
    name="schema_exploration",
    html_path="html_samples/",
    output_path="schemas/"
)

schema_path = infer_html_to_schema(config)
# Output: schemas/schema_exploration/final_schema.json

# View the learned schema
with open(schema_path) as f:
    schema = json.load(f)
    print(json.dumps(schema, indent=2))

Example 5: Cluster HTML Files by Layout

Group HTML files with different layouts into separate directories(Each directory needs to call the Agent only once):

from web2json import Web2JsonConfig, cluster_html_files

config = Web2JsonConfig(
    name="clustered_pages",
    html_path="html_samples/",
    output_path="output/"
)

result = cluster_html_files(config)
# Output: output/clustered_pages/cluster_0/, cluster_1/, noise/, cluster_info.txt
print(f"✓ Found {len(result['clusters'])} layout types")
print(f"✓ Cluster info: {result['cluster_info_file']}")

Configuration Reference

Parameter	Type	Default	Description
`name`	`str`	Required	Project name (creates subdirectory)
`html_path`	`str`	Required	Directory with HTML files
`output_path`	`str`	`"output"`	Output directory
`iteration_rounds`	`int`	`3`	Number of samples for learning
`schema`	`Dict`	`None`	Predefined fields (None = auto mode)
`parser_path`	`str`	`None`	Parser file (for `parse_html_with_parser`)

Which API Should I Use?

# Need JSON data immediately? → extract_html_to_json
extract_html_to_json(config)

# Want to inspect schema first? → infer_html_to_schema
infer_html_to_schema(config)

# Need reusable parser? → generate_html_parser
generate_html_parser(config)

# Have parser, need to parse more files? → parse_html_with_parser
parse_html_with_parser(config)

# Input HTML from different domains/with different layouts? → cluster_html_files
cluster_html_files(config)

📄 License

Apache-2.0 License

Made with ❤️ by the web2json-agent team

⭐ Star us on GitHub | 🐛 Report Issues | 📖 Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
docs		docs
evaluation		evaluation
html_samples		html_samples
tests		tests
web2json		web2json
web2json_api		web2json_api
web2json_ui		web2json_ui
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
start.sh		start.sh
stop.sh		stop.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌐 web2json-agent

📋 Demo

📊 SWDE Benchmark Results

🚀 Quick Start

Install via pip

Install for Developers

🐍 API Usage

Example 1: Directly obtain structured data

Example 2: Generate Reusable Parser

Example 3: Parse with Existing Parser

Example 4: Generate Schema Only

Example 5: Cluster HTML Files by Layout

Configuration Reference

Which API Should I Use?

📄 License

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

ccprocessor/web2json-agent

Folders and files

Latest commit

History

Repository files navigation

🌐 web2json-agent

📋 Demo

📊 SWDE Benchmark Results

🚀 Quick Start

Install via pip

Install for Developers

🐍 API Usage

Example 1: Directly obtain structured data

Example 2: Generate Reusable Parser

Example 3: Parse with Existing Parser

Example 4: Generate Schema Only

Example 5: Cluster HTML Files by Layout

Configuration Reference

Which API Should I Use?

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages