A production-ready web scraping pipeline that combines advanced stealth techniques with LLM-powered structured data extraction. This tool loads JavaScript-heavy pages using a headless browser with anti-detection features, converts them to token-efficient Markdown, and uses an LLM to extract structured data based on schemas you define with Pydantic—enforced at the API boundary via Instructor.
Traditional scrapers break when layouts change because they rely on fixed CSS selectors and regex. This project takes a different approach: it sends cleaned page content to an LLM with your custom schema and system prompt, so extraction follows semantic meaning rather than DOM structure. You define the fields (URLs, titles, prices, dates, etc.) using Pydantic models, and the pipeline handles the rest.
- Advanced stealth browser — Custom
StealthPlaywrightScraperclass with rotating user agents, screen resolutions, and timezones - Anti-detection measures — Removes webdriver flags, injects Chrome runtime objects, and randomizes browser fingerprints
- Playwright stealth plugin — Integrated
playwright-stealthfor maximum stealth capabilities - Proxy support — Built-in proxy configuration for IP rotation and geo-targeting
- Concurrent scraping — Semaphore-based concurrency control for scalable multi-page extraction
- Retry logic — Automatic retries with exponential backoff for resilient scraping
- HTML → Markdown conversion — Strips scripts, styles, navigation, and other noise before conversion to reduce token usage
- Token-aware chunking — Intelligently splits long pages at line boundaries using
tiktokento stay within model limits - Schema-driven extraction — Define Pydantic models in
scripts/main.py; Instructor validates every LLM response - CSV export — Aggregated results written to
data/data.csvfor easy analysis - Flexible wait conditions — Support for
load,domcontentloaded,networkidle, and custom selectors
- Cloud or local LLMs — Groq (remote) or Ollama (local) via a single
is_localflag - Structured output — Guaranteed valid JSON conforming to your Pydantic schema
- Usage reporting — Per-chunk prompt, completion, and total token counts for cost tracking
- Zero-temperature extraction — Minimizes hallucinations for consistent, reliable results
1. Clone
git clone https://github.com/sadmanhsakib/llm_structured_scraper.git
cd llm_structured_scraper2. Install dependencies
This project uses uv for fast, reliable dependency management:
uv syncPlaywright browser installation (required):
uv run playwright install chromium3. Configure environment
copy example.env .env # Windows
cp example.env .env # macOS / LinuxEdit .env with your settings:
| Variable | Description | Example |
|---|---|---|
API_KEY |
Groq API key (remote only) | gsk_... |
SITE_URL |
Page to scrape | https://example.com |
LOCAL_MODEL_NAME |
Ollama model (local only) | llama3.2 |
REMOTE_MODEL_NAME |
Groq model (remote only) | llama-3.3-70b-versatile |
4. Customize extraction schema
Edit scripts/main.py to define what data you want to extract:
class Schema(BaseModel):
"""Schema for a single extracted data point."""
title: str
url: HttpUrl
price: Optional[str] = NoneUpdate the SYSTEM_PROMPT to match your schema:
SYSTEM_PROMPT = """
You are a data extraction assistant.
Respond ONLY with a valid JSON object. No explanation, no markdown fences.
Extract product information: title, url, and price (if available).
"""5. Run
From the repository root:
uv run python scripts/main.pyOutput files:
data/webpage.md— Cleaned markdown of the scraped pagedata/data.csv— Extracted structured data
SITE_URL (.env)
│
▼
┌───────────────────┐ data/webpage.md ┌─────────────┐ data/data.csv
│ scraper.py │ ──────────────────────► │ parser.py │ ─────────────────►
│ StealthPlaywright│ │ chunk → │
│ + Anti-Detection │ │ LLM → CSV │
│ + Markdown │ │ │
└───────────────────┘ └─────────────┘
▲ ▲
│ │
scripts/main.py (Schema, SYSTEM_PROMPT, orchestration)
-
Fetch —
StealthPlaywrightScraperlaunches a headless Chromium instance with randomized fingerprints (user agent, screen resolution, timezone). It applies stealth techniques to bypass bot detection and waits for the specified page load condition (networkidleby default). -
Convert —
export_as_markdown()strips non-content HTML tags (scripts, styles, navigation, ads) and converts the remaining content to Markdown, preserving structure while dramatically reducing token count. -
Chunk —
chunk_text()usestiktokento split the Markdown into chunks that fit within the model's context window (1,500 tokens for local, 6,000 for remote), splitting at line boundaries to preserve context. -
Extract — For each chunk,
generate_output()sends the content to the LLM with your system prompt and Pydantic schema. Instructor enforces structured output, ensuring every response is valid JSON matching your schema. -
Export — All extracted records are aggregated into a single
SchemaCollectionand written todata/data.csvusing pandas.
Temperature is set to 0.0 to minimize hallucinations and ensure consistent, factual extraction.
The pipeline is designed to be generic—you control what gets extracted by editing scripts/main.py.
Fields must match the data you want the LLM to extract:
from pydantic import BaseModel, HttpUrl
from typing import Optional
class Schema(BaseModel):
"""Schema for a single extracted data point."""
title: str
duration: str
url: HttpUrl
author: Optional[str] = NoneThe parser wraps your schema in SchemaCollection (a list of Schema objects) to handle multiple records per chunk.
Keep instructions strict and focused. The LLM should return only JSON—no markdown fences, no explanations:
SYSTEM_PROMPT = """
You are a data extraction assistant.
Respond ONLY with a valid JSON object. No explanation, no markdown fences.
Extract video metadata from the provided content:
- title: video title
- duration: video length
- url: video URL
- author: channel or creator name (if available)
"""In the main() function, adjust scraping parameters as needed:
async def main():
html_content = await scraper.fetch_page(
url=SITE_URL,
wait_until="networkidle", # Options: "load", "domcontentloaded", "networkidle"
timeout=30000, # 30 seconds
wait_for_selector=".content", # Optional: wait for specific element
wait_for_timeout=2000, # Optional: additional wait in ms
)
output_path = scraper.export_as_markdown(html_content)
parser.extract_data_from_markdown(
md_path=output_path,
SYSTEM_PROMPT=SYSTEM_PROMPT,
is_local=False, # True → Ollama, False → Groq
)For more control, use the StealthPlaywrightScraper class directly:
from scraper import StealthPlaywrightScraper
async def advanced_scraping():
async with StealthPlaywrightScraper(
headless=True,
max_concurrent_browsers=3,
proxy={"server": "http://proxy.example.com:8080"},
use_stealth=True,
) as scraper:
html = await scraper.fetch_page(
url="https://example.com",
wait_until="networkidle",
retry_count=3,
delay_between_retries=2.0,
)
# Process html...Default chunk sizes are 1,500 tokens (local) and 6,000 tokens (remote). To override:
# In parser.py, modify the chunk_text call:
chunks = chunk_text(markdown_content, max_tokens=4000)llm-structured-scraper/
├── scripts/
│ ├── main.py # Entry point: schema, system prompt, orchestration
│ ├── scraper.py # StealthPlaywrightScraper class + HTML → Markdown
│ ├── parser.py # Token-aware chunking, LLM calls, CSV export
│ └── test.py # Optional utilities for data/data.csv analysis
├── data/ # Generated at runtime (gitignored)
│ ├── webpage.md # Intermediate markdown output
│ └── data.csv # Final structured data export
├── .venv/ # Virtual environment (uv managed)
├── pyproject.toml # Project metadata and dependencies
├── uv.lock # Locked dependency versions
├── example.env # Environment variable template
├── .env # Your actual configuration (gitignored)
├── .python-version # Python version specification
└── README.md # This file
Important: Always run commands from the repository root so relative paths like data/webpage.md resolve correctly.
- Python 3.14+ (specified in
pyproject.toml) - uv package manager — Installation guide
- Playwright Chromium — Install via
uv run playwright install chromium - Groq API key (for remote mode) — Get yours at groq.com
- Ollama (for local mode, optional) — Download and install
If you want to use local LLMs:
- Install Ollama from ollama.com
- Pull a model:
ollama pull llama3.2 - Start the server:
ollama serve(usually runs automatically) - Set
LOCAL_MODEL_NAME=llama3.2in.env - Use
is_local=Truein your code
Run the complete scraping and extraction pipeline:
uv run python scripts/main.pyThis will:
- Fetch the page specified in
SITE_URL - Convert HTML to markdown and save to
data/webpage.md - Extract structured data according to your schema
- Export results to
data/data.csv
If you only need to fetch and convert a page to markdown:
import asyncio
from scripts.scraper import fetch_page, export_as_markdown
async def scrape_only():
html = await fetch_page(
url="https://example.com",
wait_until="networkidle",
timeout=30000
)
output_path = export_as_markdown(html)
print(f"Markdown saved to: {output_path}")
asyncio.run(scrape_only())If you already have a markdown file and want to extract data from it:
from scripts import parser, main
# Requires data/webpage.md to exist
parser.extract_data_from_markdown(
md_path="data/webpage.md",
SYSTEM_PROMPT=main.SYSTEM_PROMPT,
is_local=False, # or True for Ollama
)For maximum control over scraping behavior:
import asyncio
from scripts.scraper import StealthPlaywrightScraper
async def advanced_scrape():
scraper = StealthPlaywrightScraper(
headless=True,
max_concurrent_browsers=5,
proxy=None, # or {"server": "http://proxy:8080", "username": "user", "password": "pass"}
use_stealth=True,
)
await scraper.start()
try:
html = await scraper.fetch_page(
url="https://example.com",
wait_until="load",
timeout=60000,
retry_count=5,
delay_between_retries=3.0,
wait_for_selector=".main-content", # Optional
wait_for_timeout=1000, # Optional additional wait in ms
)
print(f"Fetched {len(html)} characters")
finally:
await scraper.close()
asyncio.run(advanced_scrape())Scrape multiple pages concurrently using the context manager:
import asyncio
from scripts.scraper import StealthPlaywrightScraper
async def scrape_multiple():
urls = ["https://example1.com", "https://example2.com", "https://example3.com"]
async with StealthPlaywrightScraper(
headless=True,
max_concurrent_browsers=3,
use_stealth=True,
) as scraper:
tasks = [scraper.fetch_page(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
for url, result in zip(urls, results):
if isinstance(result, Exception):
print(f"Failed to scrape {url}: {result}")
else:
print(f"Successfully scraped {url}: {len(result)} chars")
asyncio.run(scrape_multiple())| Feature | Groq (remote) | Ollama (local) |
|---|---|---|
| Speed | Fast (cloud GPUs) | Hardware-dependent |
| Privacy | Data sent to Groq | Stays on your machine |
| Cost | API usage fees | Free (your compute) |
| Setup | API_KEY + REMOTE_MODEL_NAME |
Ollama + LOCAL_MODEL_NAME |
| Default chunk size | 6,000 tokens | 1,500 tokens |
| Best for | Production, large-scale | Development, sensitive data |
Groq:
llama-3.3-70b-versatile— Best balance of speed and qualityllama-3.1-8b-instant— Fastest, good for simple extraction
Ollama:
llama3.2— Good general-purpose modelmistral— Fast and efficientqwen2.5— Excellent for structured output
Local chunks use a smaller default (1,500 tokens) to accommodate typical 7B–14B model context windows. Remote models with larger contexts can handle 6,000+ tokens per chunk.
The pipeline uses tiktoken with the cl100k_base encoding to estimate token counts. While this encoding is designed for OpenAI models, it provides a reasonable approximation for other models and helps prevent context window overflows.
- Line-based splitting — Text is split at line boundaries to preserve markdown structure
- Token counting — Each line is counted using
tiktokenbefore being added to a chunk - Boundary respect — Chunks never exceed
max_tokens, creating a new chunk when necessary - Oversized line handling — Single lines exceeding
max_tokensare placed in their own chunk
| Scenario | Recommended max_tokens | Reason |
|---|---|---|
| Local 7B models | 1,500 | Fits comfortably in 4K context |
| Local 13B+ models | 3,000 | Can handle larger contexts |
| Groq cloud models | 6,000 | Take advantage of larger contexts |
| Dense content | Lower values | Ensures extraction doesn't miss items |
| Sparse content | Higher values | Reduces API calls |
Each chunk logs its estimated token count and the LLM usage:
Processing chunk 1/3 (Estimated tokens: 1847)
Usage: P:1847 C:324 T:2171
- P = Prompt tokens (your chunk + system prompt)
- C = Completion tokens (LLM's response)
- T = Total tokens (P + C)
Use this to optimize chunk sizes and estimate API costs.
The StealthPlaywrightScraper class implements multiple anti-detection techniques to bypass bot detection systems:
Each request uses randomized characteristics to avoid fingerprint tracking:
- User agents — Rotates between 5 recent Chrome, Firefox, and Safari user agents
- Screen resolutions — Randomly selects from common resolutions (1920×1080, 2560×1440, etc.)
- Timezones — Randomizes timezone IDs across major regions (US, Europe, Asia)
- Geolocation — Sets realistic geolocation coordinates (defaults to NYC)
- Webdriver flag removal — Overrides
navigator.webdriverproperty - Chrome runtime injection — Adds
window.chromeobject to mimic real Chrome - Plugin fingerprinting — Provides realistic plugin list instead of empty array
- Permission queries — Intercepts and properly handles permission API calls
- Language headers — Sets realistic Accept-Language headers
- Custom headers — Adds standard browser headers (Accept-Encoding, DNT, etc.)
- Stealth plugin — Applies
playwright-stealthfor additional protections - Launch arguments — Disables automation flags and site isolation features
- Retry logic — Automatic retries with exponential backoff on failures
- Semaphore-based concurrency — Prevents resource exhaustion during concurrent scraping
- Set
headless=Falsefor debugging (see the actual browser) - Adjust
max_concurrent_browsersbased on your system resources - Use proxies for IP rotation:
proxy={"server": "http://...", "username": "...", "password": "..."} - Set
use_stealth=Falseonly if the target doesn't have bot detection
| Library | Version | Role |
|---|---|---|
| Playwright | 1.60+ | Headless browser automation |
| playwright-stealth | 2.0+ | Anti-detection and bot evasion |
| markdownify | 1.2+ | HTML to Markdown conversion |
| Instructor | 1.15+ | Structured LLM output validation |
| Pydantic | 2.13+ | Schema definition and validation |
| Groq | 1.2+ | Remote LLM inference (cloud) |
| OpenAI | — | Ollama client compatibility |
| tiktoken | 0.13+ | Token counting and estimation |
| pandas | 3.0+ | CSV export and data manipulation |
| python-dotenv | 0.9+ | Environment variable management |
All dependencies are specified in pyproject.toml and locked in uv.lock for reproducible builds.
| Issue | Solution |
|---|---|
Input file not found: data/webpage.md |
Run the scraper first, or ensure you're running from the repository root. |
Model name not configured |
Set LOCAL_MODEL_NAME (+ Ollama running) or REMOTE_MODEL_NAME + API_KEY in .env. |
playwright._impl._errors.Error: Executable doesn't exist |
Install Chromium: uv run playwright install chromium |
| Empty CSV or missing rows | 1) Check data/webpage.md has content2) Adjust SYSTEM_PROMPT to be more specific3) Increase chunk size 4) Verify target page actually contains the data |
Connection refused (Ollama) |
1) Start Ollama: ollama serve2) Pull model: ollama pull llama3.23) Verify it's running on port 11434 |
TimeoutError on page.goto() |
1) Use wait_until="load" instead of "networkidle"2) Increase timeout parameter3) Use wait_for_selector for dynamic content |
| Invalid JSON or validation errors | 1) Make SYSTEM_PROMPT more strict2) Add examples to the prompt 3) Set max_retries=3 in generate_output() for production4) Try a different model |
| Bot detection / 403 errors | 1) Ensure use_stealth=True2) Add proxy configuration 3) Increase delays between requests 4) Try headless=False to debug |
| High memory usage | 1) Reduce max_concurrent_browsers2) Decrease chunk size 3) Process fewer pages at once |
| Slow extraction | 1) Use Groq instead of local Ollama 2) Increase chunk size to reduce API calls 3) Use faster model (e.g., llama-3.1-8b-instant) |
| Import errors | Ensure you're running from repo root: uv run python scripts/main.py (not cd scripts && python main.py) |
- Be specific — Define exact field types (str, int, HttpUrl, datetime, etc.)
- Use Optional — Mark fields that may not always be present as
Optional[type] - Add descriptions — Use docstrings and Field descriptions for better LLM understanding
- Keep it simple — Complex nested schemas are harder for LLMs to fill correctly
- Be explicit — Tell the model exactly what format you want ("valid JSON object")
- Prohibit extras — Explicitly forbid markdown fences, explanations, and conversational text
- Provide examples — Show 1-2 example outputs in the prompt for complex schemas
- Define edge cases — Explain how to handle missing data, ambiguous fields, etc.
- Right-size chunks — Larger chunks = fewer API calls but may miss edge cases
- Use remote for production — Groq is significantly faster than local Ollama
- Cache markdown — Save
webpage.mdand iterate on extraction without re-scraping - Batch similar pages — Process multiple pages with the same schema together
- Monitor token usage — Use the logged token counts to optimize costs
- Respect robots.txt — Check if scraping is allowed before targeting a site
- Rate limiting — Add delays between requests to avoid overwhelming servers
- Error handling — Always handle exceptions and implement retry logic
- Session persistence — For authenticated scraping, reuse browser contexts
- Legal compliance — Ensure your use case complies with terms of service and local laws
- Validate output — Check
data/data.csvafter extraction for accuracy - Test on samples — Try different pages to ensure schema generalization
- Handle duplicates — Implement deduplication if processing multiple pages
- Version your schemas — Track changes to extraction schemas over time
Approximate performance on a typical product listing page (50KB HTML, 5KB markdown):
| Configuration | Scrape time | Extract time | Total | Cost (per page) |
|---|---|---|---|---|
| Groq (llama-3.3-70b) | 5s | 2s | 7s | ~$0.0001 |
| Groq (llama-3.1-8b) | 5s | 1s | 6s | ~$0.00005 |
| Ollama (llama3.2, M1 Mac) | 5s | 15s | 20s | Free |
| Ollama (llama3.2, RTX 4090) | 5s | 8s | 13s | Free |
Times and costs are estimates and will vary based on page complexity and infrastructure.
This project is near completion. If you encounter bugs or have suggestions for improvements, feel free to open an issue or submit a pull request.
Provided as-is for educational and personal use. No warranty is expressed or implied.