CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

A Model Context Protocol (MCP) server for efficient web scraping. Built with Python using FastMCP, providing AI tools with standardized web scraping capabilities through four main tools: raw HTML scraping, markdown conversion, text extraction, and link extraction. All tools support both single URL and batch operations with intelligent retry logic.

Development Commands

Environment Setup

# Install dependencies (uses uv package manager)
uv pip install -e ".[dev]"

Running the Server

# Run locally with default settings
python -m scraper_mcp

# Run with specific transport and port
python -m scraper_mcp streamable-http 0.0.0.0 8000

# Run with Docker
docker-compose up -d
docker-compose logs -f
docker-compose down

Testing

# Run all tests with coverage
pytest

# Run specific test file
pytest tests/test_server.py

# Run specific test class
pytest tests/test_server.py::TestScrapeUrlTool

# Run specific test function
pytest tests/test_server.py::TestScrapeUrlTool::test_scrape_url_success

# Run with verbose output
pytest -v

# Run without coverage report
pytest --no-cov

Code Quality

# Type checking
mypy src/

# Linting
ruff check .

# Auto-fix linting issues
ruff check . --fix

# Format code
ruff format .

Architecture

Provider Pattern

The server uses an extensible provider architecture for different scraping backends:

ScraperProvider (providers/base.py): Abstract interface defining scrape() and supports_url() methods
RequestsProvider (providers/requests_provider.py): Default HTTP scraper using requests library with exponential backoff retry logic
Future extensibility: Easy to add Playwright, Selenium, or Scrapy providers for JavaScript-heavy sites or specialized scraping

The get_provider() function in server.py routes URLs to appropriate providers. Currently defaults to RequestsProvider for all HTTP/HTTPS URLs.

Tool Architecture

All four MCP tools (scrape_url, scrape_url_markdown, scrape_url_text, scrape_extract_links) follow a dual-mode pattern:

Single URL mode: Returns ScrapeResponse or LinksResponse directly
Batch mode: Accepts list[str] URLs, returns BatchScrapeResponse or BatchLinksResponse with individual results and success/failure counts

Batch operations use asyncio.Semaphore (default concurrency: 5) to limit concurrent requests and asyncio.gather() for parallel execution.

HTML Processing Utilities

All utilities in utils.py use BeautifulSoup with lxml parser:

html_to_markdown(): Converts HTML to markdown using markdownify with ATX heading style
html_to_text(): Extracts plain text with default stripping of script/style/meta/link/noscript tags
extract_links(): Extracts all <a> tags with URL resolution using urllib.parse.urljoin()
extract_metadata(): Extracts <title> and all <meta> tags (name/property attributes)
filter_html_by_selector(): Filters HTML using CSS selectors via BeautifulSoup's .select() method, returns tuple of (filtered_html, element_count)

CSS Selector Filtering

Optional css_selector parameter added to all four tools for targeted content extraction:

Purpose: "What to KEEP" filtering (inclusive) - complementary to strip_tags (exclusive)
Implementation: Applied BEFORE any other processing (markdown/text conversion, link extraction)
Selector Support: Full CSS selector syntax via Soup Sieve (tags, classes, IDs, attributes, combinators, pseudo-classes)
Processing Order:
1. Scrape HTML from provider
2. Apply CSS selector filter (if provided)
3. Apply strip_tags (if provided)
4. Convert to markdown/text or extract links
Metadata: When selector provided, adds css_selector_applied and elements_matched to response metadata
Batch Support: Works in both single-URL and batch modes
Empty Results: Returns empty string with count=0 if no elements match (graceful degradation)

Retry Logic

All scraping operations implement exponential backoff:

Default: 3 retries, 30s timeout, 1s initial delay
Backoff schedule: 1s, 2s, 4s (exponential: retry_delay * 2^(attempt-1))
Retryable errors: requests.Timeout, requests.ConnectionError, requests.HTTPError
Metadata tracking: All responses include attempts, retries, and elapsed_ms fields

Pydantic Models

Strong typing using Pydantic v2:

ScrapeResult (dataclass in providers/base.py): Provider return type
ScrapeResponse (Pydantic model): Single scrape tool response
LinksResponse (Pydantic model): Single link extraction response
ScrapeResultItem/LinkResultItem: Individual batch operation results with success flag and optional error
BatchScrapeResponse/BatchLinksResponse: Batch operation responses with totals and results array

Testing Approach

Tests use pytest-asyncio with pytest-mock for mocking. Key patterns:

Fixtures (tests/conftest.py): Provide sample HTML with various features (links, metadata, scripts)
Mocking pattern: Mock get_provider() to return a provider with mocked scrape() method
Batch test pattern: Test both successful batch operations and partial failures
Backward compatibility: Ensure single URL mode still works after adding batch support

When adding new tools:

Create fixtures for test HTML in conftest.py
Add test class following Test<ToolName>Tool naming pattern
Test single URL, batch mode, error cases, and parameter variations
Mock at the provider level, not the requests level

Common Development Tasks

Adding a New Scraping Tool

Define Pydantic response model in server.py
Add utility function to utils.py if needed
Create @mcp.tool() decorated function with dual-mode support (single/batch)
Add batch operation helper function following existing patterns
Add comprehensive tests in tests/test_server.py

Adding a New Provider

Create new file in providers/ (e.g., playwright_provider.py)
Subclass ScraperProvider and implement scrape() and supports_url()
Update get_provider() in server.py to route specific URL patterns
Add provider-specific tests in tests/test_providers.py
Update pyproject.toml dependencies if needed

Modifying Retry Behavior

Retry logic is centralized in RequestsProvider.scrape() at providers/requests_provider.py:78-127. Key parameters:

max_retries: Maximum attempts (default: 3)
retry_delay: Initial backoff delay (default: 1.0s)
Backoff calculation: delay = self.retry_delay * (2 ** (attempt - 1))

To modify retry behavior, adjust the retry loop or add retry parameters to tool signatures.

Project Configuration

Python version: 3.12+ (uses modern type hints like str | None)
Package manager: uv for dependency management
Build system: Hatchling
Line length: 100 characters (ruff)
Pytest config: Async mode auto, coverage enabled by default, testpaths: tests/
Mypy: Strict mode enabled

Docker Configuration

Base image: Python 3.12-slim
Default port: 8000
Transport: streamable-http (configurable via env vars)
Environment variables: TRANSPORT, HOST, PORT
Restart policy: unless-stopped (docker-compose)

MCP Integration

To connect from Claude Desktop, add to MCP settings:

{
  "mcpServers": {
    "scraper": {
      "url": "http://localhost:8000/mcp"
    }
  }
}

The server uses FastMCP which automatically handles transport negotiation and tool registration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Development Commands

Environment Setup

Running the Server

Testing

Code Quality

Architecture

Provider Pattern

Tool Architecture

HTML Processing Utilities

CSS Selector Filtering

Retry Logic

Pydantic Models

Testing Approach

Common Development Tasks

Adding a New Scraping Tool

Adding a New Provider

Modifying Retry Behavior

Project Configuration

Docker Configuration

MCP Integration

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Development Commands

Environment Setup

Running the Server

Testing

Code Quality

Architecture

Provider Pattern

Tool Architecture

HTML Processing Utilities

CSS Selector Filtering

Retry Logic

Pydantic Models

Testing Approach

Common Development Tasks

Adding a New Scraping Tool

Adding a New Provider

Modifying Retry Behavior

Project Configuration

Docker Configuration

MCP Integration