Skip to content

traversys/crewai-table-scraper

Repository files navigation

CrewAI Table Scraper

Scrape an HTML table from any URL with a CSS selector, optionally clean it with an LLM, and persist both raw and cleaned JSONB into PostgreSQL.

Prerequisites

  • Python 3.10+
  • PostgreSQL instance you can reach (local is fine).
  • (Optional) OpenAI API key for cloud LLM cleaning, or a running Ollama server for local models.

Setup

  1. Clone or copy this folder, then create and activate a virtual env:
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment variables (export in your shell or use a .env file):
  • DATABASE_URL (required): e.g. postgresql+psycopg2://postgres:postgres@localhost:5432/postgres
  • OPENAI_API_KEY (for --llm openai)
  • OPENAI_MODEL (optional, default gpt-4o-mini)
  • OLLAMA_BASE_URL and OLLAMA_MODEL (for --llm ollama, default http://localhost:11434 and llama3)

One-time database bootstrap

If you prefer a standalone one-time bootstrap instead of --setup, you can still run:

python setup_db.py --host localhost --port 5432 --db-name scrape_results --user postgres --password secret

Local URL: postgresql+psycopg2://postgres@localhost:5432/yourdb

PostgreSQL table

tools/db.py auto-creates scrape_results with:

  • id SERIAL PRIMARY KEY
  • url TEXT
  • scraped_at TIMESTAMP DEFAULT now()
  • raw_json JSONB
  • cleaned_json JSONB
  • llm_summary TEXT

Running the workflow

python main.py \
  --url "https://example.com" \
  --selector "table.my-table" \  # optional if you have an LLM; first <table> is used otherwise
  --llm openai  # or ollama or none
  • Change the CSS selector via --selector (e.g. #pricing table).
  • Switch LLM providers with --llm openai|ollama|none; override models with --model NAME.
  • To skip LLM cleaning entirely, use --llm none (basic normalization is applied).
  • Optional --hint "cinema listings" (or similar) gives the LLM context when inferring structure from non-table pages.
  • For JS-heavy pages, use headless rendering:
python main.py \
  --url "https://example.com/weather" \
  --llm ollama \
  --model llama3:latest \
  --render \
  --hint "weather forecasts"

Install Playwright browsers first (one-time): python -m playwright install chromium.

  • Ingest a local CSV instead of scraping:
python main.py \
  --csv ./data/my_table.csv \
  --llm none  # or openai/ollama if you want LLM cleaning
  • Ingest a JSON API (list or object containing a list): Example:
python main.py \
  --api "https://api.example.com/v1/weather/forecast?city=Paris" \
  --json-path data.daily \
  --llm none

Use --json-path to point at the list inside the JSON (dot-delimited), or omit if the top-level is already a list.

  • When an LLM is enabled (--llm openai or --llm ollama), --selector (for HTML) and --json-path (for API JSON) become optional—the app will attempt to auto-detect the first table or list before handing it to the cleaner, which can infer schema and normalize fields.
  • For HTML pages without tables (e.g., card/list layouts), with an LLM enabled the app will try to infer repeated items directly from the HTML; a short --hint can improve results (e.g., "cinema listings", "weather predictions").
  • For JS-rendered content, add --render to fetch with Playwright; combine with --hint to guide structure inference.
  • Automatically ensure the database exists (create if missing) by passing --setup and database connection options:
python main.py \
  --url "https://example.com" \
  --selector "table.data" \
  --llm none \
  --setup \
  --db-host localhost \
  --db-port 5432 \
  --db-name scrape_results \
  --db-user postgres \
  --db-password secret

If DATABASE_URL is already set, --setup is optional. Without --setup, the app uses DATABASE_URL or defaults in tools/db.py.

Example scenarios

  • Static table: python main.py --url "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)" --selector "table.wikitable" --llm none
  • Simple list page (no table) with LLM inference: python main.py --url "https://example.com/events" --llm ollama --model llama3:latest --hint "music events with date and venue"
  • API data: python main.py --api "https://api.example.com/v1/stocks?symbols=AAPL,GOOG" --json-path data.prices --llm none
  • CSV ingestion: python main.py --csv ./data/sales.csv --llm none
  • SPA with rendering + inference: python main.py --url "https://example.com/spa-products" --render --llm ollama --model llama3:latest --hint "product cards with name, price, rating"

Querying results in PostgreSQL

Assuming DATABASE_URL points at your DB and the scrape_results table exists:

  • Show recent rows (raw/cleaned truncated):
SELECT id, url, scraped_at, jsonb_pretty(raw_json) AS raw, jsonb_pretty(cleaned_json) AS cleaned
FROM scrape_results
ORDER BY scraped_at DESC
LIMIT 5;
  • Extract flattened cleaned rows (one row per JSON element):
SELECT r.id,
       r.url,
       r.scraped_at,
       elem ->> 'title' AS title,
       elem ->> 'date' AS date,
       elem ->> 'price' AS price
FROM scrape_results r
     CROSS JOIN LATERAL jsonb_array_elements(r.cleaned_json) AS elem
ORDER BY r.scraped_at DESC, r.id;
  • Filter by a value inside cleaned JSON (case-insensitive match on title):
SELECT r.id, r.url, elem
FROM scrape_results r
     CROSS JOIN LATERAL jsonb_array_elements(r.cleaned_json) AS elem
WHERE elem ->> 'title' ILIKE '%weather%'
ORDER BY r.id DESC;
  • Count rows per source:
SELECT url, COUNT(*) AS rows_count
FROM scrape_results,
     LATERAL jsonb_array_length(cleaned_json)
GROUP BY url
ORDER BY rows_count DESC;

On success, the script prints how many rows were stored and the inserted scrape_results.id.

Extending

  • Override LLM defaults in llm_provider.py.
  • Adjust scraping logic in tools/scraper.py (e.g., custom headers or auth).
  • Add more tasks/agents in tasks.py and agents.py to enrich the pipeline.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages