Scrape an HTML table from any URL with a CSS selector, optionally clean it with an LLM, and persist both raw and cleaned JSONB into PostgreSQL.
- Python 3.10+
- PostgreSQL instance you can reach (local is fine).
- (Optional) OpenAI API key for cloud LLM cleaning, or a running Ollama server for local models.
- Clone or copy this folder, then create and activate a virtual env:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Configure environment variables (export in your shell or use a
.envfile):
DATABASE_URL(required): e.g.postgresql+psycopg2://postgres:postgres@localhost:5432/postgresOPENAI_API_KEY(for--llm openai)OPENAI_MODEL(optional, defaultgpt-4o-mini)OLLAMA_BASE_URLandOLLAMA_MODEL(for--llm ollama, defaulthttp://localhost:11434andllama3)
If you prefer a standalone one-time bootstrap instead of --setup, you can still run:
python setup_db.py --host localhost --port 5432 --db-name scrape_results --user postgres --password secretLocal URL: postgresql+psycopg2://postgres@localhost:5432/yourdb
tools/db.py auto-creates scrape_results with:
id SERIAL PRIMARY KEYurl TEXTscraped_at TIMESTAMP DEFAULT now()raw_json JSONBcleaned_json JSONBllm_summary TEXT
python main.py \
--url "https://example.com" \
--selector "table.my-table" \ # optional if you have an LLM; first <table> is used otherwise
--llm openai # or ollama or none- Change the CSS selector via
--selector(e.g.#pricing table). - Switch LLM providers with
--llm openai|ollama|none; override models with--model NAME. - To skip LLM cleaning entirely, use
--llm none(basic normalization is applied). - Optional
--hint "cinema listings"(or similar) gives the LLM context when inferring structure from non-table pages. - For JS-heavy pages, use headless rendering:
python main.py \
--url "https://example.com/weather" \
--llm ollama \
--model llama3:latest \
--render \
--hint "weather forecasts"Install Playwright browsers first (one-time): python -m playwright install chromium.
- Ingest a local CSV instead of scraping:
python main.py \
--csv ./data/my_table.csv \
--llm none # or openai/ollama if you want LLM cleaning- Ingest a JSON API (list or object containing a list): Example:
python main.py \
--api "https://api.example.com/v1/weather/forecast?city=Paris" \
--json-path data.daily \
--llm noneUse --json-path to point at the list inside the JSON (dot-delimited), or omit if the top-level is already a list.
- When an LLM is enabled (
--llm openaior--llm ollama),--selector(for HTML) and--json-path(for API JSON) become optional—the app will attempt to auto-detect the first table or list before handing it to the cleaner, which can infer schema and normalize fields. - For HTML pages without tables (e.g., card/list layouts), with an LLM enabled the app will try to infer repeated items directly from the HTML; a short
--hintcan improve results (e.g., "cinema listings", "weather predictions"). - For JS-rendered content, add
--renderto fetch with Playwright; combine with--hintto guide structure inference. - Automatically ensure the database exists (create if missing) by passing
--setupand database connection options:
python main.py \
--url "https://example.com" \
--selector "table.data" \
--llm none \
--setup \
--db-host localhost \
--db-port 5432 \
--db-name scrape_results \
--db-user postgres \
--db-password secretIf DATABASE_URL is already set, --setup is optional. Without --setup, the app uses DATABASE_URL or defaults in tools/db.py.
- Static table:
python main.py --url "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)" --selector "table.wikitable" --llm none - Simple list page (no table) with LLM inference:
python main.py --url "https://example.com/events" --llm ollama --model llama3:latest --hint "music events with date and venue" - API data:
python main.py --api "https://api.example.com/v1/stocks?symbols=AAPL,GOOG" --json-path data.prices --llm none - CSV ingestion:
python main.py --csv ./data/sales.csv --llm none - SPA with rendering + inference:
python main.py --url "https://example.com/spa-products" --render --llm ollama --model llama3:latest --hint "product cards with name, price, rating"
Assuming DATABASE_URL points at your DB and the scrape_results table exists:
- Show recent rows (raw/cleaned truncated):
SELECT id, url, scraped_at, jsonb_pretty(raw_json) AS raw, jsonb_pretty(cleaned_json) AS cleaned
FROM scrape_results
ORDER BY scraped_at DESC
LIMIT 5;- Extract flattened cleaned rows (one row per JSON element):
SELECT r.id,
r.url,
r.scraped_at,
elem ->> 'title' AS title,
elem ->> 'date' AS date,
elem ->> 'price' AS price
FROM scrape_results r
CROSS JOIN LATERAL jsonb_array_elements(r.cleaned_json) AS elem
ORDER BY r.scraped_at DESC, r.id;- Filter by a value inside cleaned JSON (case-insensitive match on title):
SELECT r.id, r.url, elem
FROM scrape_results r
CROSS JOIN LATERAL jsonb_array_elements(r.cleaned_json) AS elem
WHERE elem ->> 'title' ILIKE '%weather%'
ORDER BY r.id DESC;- Count rows per source:
SELECT url, COUNT(*) AS rows_count
FROM scrape_results,
LATERAL jsonb_array_length(cleaned_json)
GROUP BY url
ORDER BY rows_count DESC;On success, the script prints how many rows were stored and the inserted scrape_results.id.
- Override LLM defaults in
llm_provider.py. - Adjust scraping logic in
tools/scraper.py(e.g., custom headers or auth). - Add more tasks/agents in
tasks.pyandagents.pyto enrich the pipeline.