CrewAI Table Scraper

Scrape an HTML table from any URL with a CSS selector, optionally clean it with an LLM, and persist both raw and cleaned JSONB into PostgreSQL.

Prerequisites

Python 3.10+
PostgreSQL instance you can reach (local is fine).
(Optional) OpenAI API key for cloud LLM cleaning, or a running Ollama server for local models.

Setup

Clone or copy this folder, then create and activate a virtual env:

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configure environment variables (export in your shell or use a .env file):

DATABASE_URL (required): e.g. postgresql+psycopg2://postgres:postgres@localhost:5432/postgres
OPENAI_API_KEY (for --llm openai)
OPENAI_MODEL (optional, default gpt-4o-mini)
OLLAMA_BASE_URL and OLLAMA_MODEL (for --llm ollama, default http://localhost:11434 and llama3)

One-time database bootstrap

If you prefer a standalone one-time bootstrap instead of --setup, you can still run:

python setup_db.py --host localhost --port 5432 --db-name scrape_results --user postgres --password secret

Local URL: postgresql+psycopg2://postgres@localhost:5432/yourdb

PostgreSQL table

tools/db.py auto-creates scrape_results with:

id SERIAL PRIMARY KEY
url TEXT
scraped_at TIMESTAMP DEFAULT now()
raw_json JSONB
cleaned_json JSONB
llm_summary TEXT

Running the workflow

python main.py \
  --url "https://example.com" \
  --selector "table.my-table" \  # optional if you have an LLM; first <table> is used otherwise
  --llm openai  # or ollama or none

Change the CSS selector via --selector (e.g. #pricing table).
Switch LLM providers with --llm openai|ollama|none; override models with --model NAME.
To skip LLM cleaning entirely, use --llm none (basic normalization is applied).
Optional --hint "cinema listings" (or similar) gives the LLM context when inferring structure from non-table pages.
For JS-heavy pages, use headless rendering:

python main.py \
  --url "https://example.com/weather" \
  --llm ollama \
  --model llama3:latest \
  --render \
  --hint "weather forecasts"

Install Playwright browsers first (one-time): python -m playwright install chromium.

Ingest a local CSV instead of scraping:

python main.py \
  --csv ./data/my_table.csv \
  --llm none  # or openai/ollama if you want LLM cleaning

Ingest a JSON API (list or object containing a list): Example:

python main.py \
  --api "https://api.example.com/v1/weather/forecast?city=Paris" \
  --json-path data.daily \
  --llm none

Use --json-path to point at the list inside the JSON (dot-delimited), or omit if the top-level is already a list.

When an LLM is enabled (--llm openai or --llm ollama), --selector (for HTML) and --json-path (for API JSON) become optional—the app will attempt to auto-detect the first table or list before handing it to the cleaner, which can infer schema and normalize fields.
For HTML pages without tables (e.g., card/list layouts), with an LLM enabled the app will try to infer repeated items directly from the HTML; a short --hint can improve results (e.g., "cinema listings", "weather predictions").
For JS-rendered content, add --render to fetch with Playwright; combine with --hint to guide structure inference.
Automatically ensure the database exists (create if missing) by passing --setup and database connection options:

python main.py \
  --url "https://example.com" \
  --selector "table.data" \
  --llm none \
  --setup \
  --db-host localhost \
  --db-port 5432 \
  --db-name scrape_results \
  --db-user postgres \
  --db-password secret

If DATABASE_URL is already set, --setup is optional. Without --setup, the app uses DATABASE_URL or defaults in tools/db.py.

Example scenarios

Static table: python main.py --url "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)" --selector "table.wikitable" --llm none
Simple list page (no table) with LLM inference: python main.py --url "https://example.com/events" --llm ollama --model llama3:latest --hint "music events with date and venue"
API data: python main.py --api "https://api.example.com/v1/stocks?symbols=AAPL,GOOG" --json-path data.prices --llm none
CSV ingestion: python main.py --csv ./data/sales.csv --llm none
SPA with rendering + inference: python main.py --url "https://example.com/spa-products" --render --llm ollama --model llama3:latest --hint "product cards with name, price, rating"

Querying results in PostgreSQL

Assuming DATABASE_URL points at your DB and the scrape_results table exists:

Show recent rows (raw/cleaned truncated):

SELECT id, url, scraped_at, jsonb_pretty(raw_json) AS raw, jsonb_pretty(cleaned_json) AS cleaned
FROM scrape_results
ORDER BY scraped_at DESC
LIMIT 5;

Extract flattened cleaned rows (one row per JSON element):

SELECT r.id,
       r.url,
       r.scraped_at,
       elem ->> 'title' AS title,
       elem ->> 'date' AS date,
       elem ->> 'price' AS price
FROM scrape_results r
     CROSS JOIN LATERAL jsonb_array_elements(r.cleaned_json) AS elem
ORDER BY r.scraped_at DESC, r.id;

Filter by a value inside cleaned JSON (case-insensitive match on title):

SELECT r.id, r.url, elem
FROM scrape_results r
     CROSS JOIN LATERAL jsonb_array_elements(r.cleaned_json) AS elem
WHERE elem ->> 'title' ILIKE '%weather%'
ORDER BY r.id DESC;

Count rows per source:

SELECT url, COUNT(*) AS rows_count
FROM scrape_results,
     LATERAL jsonb_array_length(cleaned_json)
GROUP BY url
ORDER BY rows_count DESC;

On success, the script prints how many rows were stored and the inserted scrape_results.id.

Extending

Override LLM defaults in llm_provider.py.
Adjust scraping logic in tools/scraper.py (e.g., custom headers or auth).
Add more tasks/agents in tasks.py and agents.py to enrich the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
testenv		testenv
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
agents.py		agents.py
llm_provider.py		llm_provider.py
main.py		main.py
requirements.txt		requirements.txt
setup_db.py		setup_db.py
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CrewAI Table Scraper

Prerequisites

Setup

One-time database bootstrap

PostgreSQL table

Running the workflow

Example scenarios

Querying results in PostgreSQL

Extending

About

Uh oh!

Releases

Packages

Languages

License

traversys/crewai-table-scraper

Folders and files

Latest commit

History

Repository files navigation

CrewAI Table Scraper

Prerequisites

Setup

One-time database bootstrap

PostgreSQL table

Running the workflow

Example scenarios

Querying results in PostgreSQL

Extending

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages