Skip to content

ccprocessor/web2json-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌐 web2json-agent

Stop Coding Scrapers, Start Getting Data β€” from Hours to Seconds

Python LangChain OpenAI PyPI

English | δΈ­ζ–‡


πŸ“‹ Demo

20260108134043.mp4

πŸ“Š SWDE Benchmark Results

The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages

Precision Recall F1 Score
COT 87.75 79.90 76.95
Reflexion 93.28 82.76 82.40
AUTOSCRAPER 92.49 89.13 88.69
Web2JSON-Agent 91.50 90.46 89.93

πŸš€ Quick Start

Install via pip

# 1. Install package
pip install web2json-agent

# 2. Initialize configuration
web2json setup

Install for Developers

# 1. Clone the repository
git clone https://github.com/ccprocessor/web2json-agent
cd web2json-agent

# 2. Install in editable mode
pip install -e .

# 3. Initialize configuration
web2json setup

🐍 API Usage

Web2JSON provides five simple APIs for different use cases. All examples are ready to run!

Example 1: Directly obtain structured data

Auto Mode - Let AI automatically filter fields and extract data:

from web2json import Web2JsonConfig, extract_html_to_json

config = Web2JsonConfig(
    name="my_project",
    html_path="html_samples/",
    output_path="output/"
)

result = extract_html_to_json(config)
# Output: output/my_project/result/*.json
print(f"βœ“ Results saved to: {result}")

Predefined Mode - Extract only specific fields:

from web2json import Web2JsonConfig, extract_html_to_json

config = Web2JsonConfig(
    name="articles",
    html_path="html_samples/",
    output_path="output/",
    schema={
        "title": "string",
        "author": "string",
        "date": "string",
        "content": "string"
    }
)

result = extract_html_to_json(config)
# Output: output/articles/result/*.json
print(f"βœ“ Results saved to: {result}")

Example 2: Generate Reusable Parser

Generate a parser once, use it many times:

from web2json import Web2JsonConfig, generate_html_parser

config = Web2JsonConfig(
    name="product_parser",
    html_path="training_samples/",
    output_path="parsers/"
)

parser_path = generate_html_parser(config)
# Output: parsers/product_parser/final_parser.py
print(f"βœ“ Parser saved: {parser_path}")

Example 3: Parse with Existing Parser

Reuse a trained parser on new HTML files:

from web2json import Web2JsonConfig, parse_html_with_parser

config = Web2JsonConfig(
    name="batch_001",
    html_path="new_html_files/",
    output_path="results/",
    parser_path="parsers/product_parser/final_parser.py"
)

result = parse_html_with_parser(config)
# Output: results/batch_001/result/*.json
print(f"βœ“ Parsed data saved to: {result}")

Example 4: Generate Schema Only

Generate a JSON Schema containing field descriptions and XPath:

from web2json import Web2JsonConfig, infer_html_to_schema
import json

config = Web2JsonConfig(
    name="schema_exploration",
    html_path="html_samples/",
    output_path="schemas/"
)

schema_path = infer_html_to_schema(config)
# Output: schemas/schema_exploration/final_schema.json

# View the learned schema
with open(schema_path) as f:
    schema = json.load(f)
    print(json.dumps(schema, indent=2))

Example 5: Cluster HTML Files by Layout

Group HTML files with different layouts into separate directories(Each directory needs to call the Agent only once):

from web2json import Web2JsonConfig, cluster_html_files

config = Web2JsonConfig(
    name="clustered_pages",
    html_path="html_samples/",
    output_path="output/"
)

result = cluster_html_files(config)
# Output: output/clustered_pages/cluster_0/, cluster_1/, noise/, cluster_info.txt
print(f"βœ“ Found {len(result['clusters'])} layout types")
print(f"βœ“ Cluster info: {result['cluster_info_file']}")

Configuration Reference

Parameter Type Default Description
name str Required Project name (creates subdirectory)
html_path str Required Directory with HTML files
output_path str "output" Output directory
iteration_rounds int 3 Number of samples for learning
schema Dict None Predefined fields (None = auto mode)
parser_path str None Parser file (for parse_html_with_parser)

Which API Should I Use?

# Need JSON data immediately? β†’ extract_html_to_json
extract_html_to_json(config)

# Want to inspect schema first? β†’ infer_html_to_schema
infer_html_to_schema(config)

# Need reusable parser? β†’ generate_html_parser
generate_html_parser(config)

# Have parser, need to parse more files? β†’ parse_html_with_parser
parse_html_with_parser(config)

# Input HTML from different domains/with different layouts? β†’ cluster_html_files
cluster_html_files(config)

πŸ“„ License

Apache-2.0 License


Made with ❀️ by the web2json-agent team

⭐ Star us on GitHub | πŸ› Report Issues | πŸ“– Documentation

About

Web Structured Data Extraction Agent

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5