This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
A Model Context Protocol (MCP) server for efficient web scraping. Built with Python using FastMCP, providing AI tools with standardized web scraping capabilities through four main tools: raw HTML scraping, markdown conversion, text extraction, and link extraction. All tools support both single URL and batch operations with intelligent retry logic.
# Install dependencies (uses uv package manager)
uv pip install -e ".[dev]"# Run locally with default settings
python -m scraper_mcp
# Run with specific transport and port
python -m scraper_mcp streamable-http 0.0.0.0 8000
# Run with Docker
docker-compose up -d
docker-compose logs -f
docker-compose down# Run all tests with coverage
pytest
# Run specific test file
pytest tests/test_server.py
# Run specific test class
pytest tests/test_server.py::TestScrapeUrlTool
# Run specific test function
pytest tests/test_server.py::TestScrapeUrlTool::test_scrape_url_success
# Run with verbose output
pytest -v
# Run without coverage report
pytest --no-cov# Type checking
mypy src/
# Linting
ruff check .
# Auto-fix linting issues
ruff check . --fix
# Format code
ruff format .The server uses an extensible provider architecture for different scraping backends:
ScraperProvider(providers/base.py): Abstract interface definingscrape()andsupports_url()methodsRequestsProvider(providers/requests_provider.py): Default HTTP scraper usingrequestslibrary with exponential backoff retry logic- Future extensibility: Easy to add Playwright, Selenium, or Scrapy providers for JavaScript-heavy sites or specialized scraping
The get_provider() function in server.py routes URLs to appropriate providers. Currently defaults to RequestsProvider for all HTTP/HTTPS URLs.
All four MCP tools (scrape_url, scrape_url_markdown, scrape_url_text, scrape_extract_links) follow a dual-mode pattern:
- Single URL mode: Returns
ScrapeResponseorLinksResponsedirectly - Batch mode: Accepts
list[str]URLs, returnsBatchScrapeResponseorBatchLinksResponsewith individual results and success/failure counts
Batch operations use asyncio.Semaphore (default concurrency: 5) to limit concurrent requests and asyncio.gather() for parallel execution.
All utilities in utils.py use BeautifulSoup with lxml parser:
html_to_markdown(): Converts HTML to markdown usingmarkdownifywith ATX heading stylehtml_to_text(): Extracts plain text with default stripping of script/style/meta/link/noscript tagsextract_links(): Extracts all<a>tags with URL resolution usingurllib.parse.urljoin()extract_metadata(): Extracts<title>and all<meta>tags (name/property attributes)filter_html_by_selector(): Filters HTML using CSS selectors via BeautifulSoup's.select()method, returns tuple of (filtered_html, element_count)
Optional css_selector parameter added to all four tools for targeted content extraction:
- Purpose: "What to KEEP" filtering (inclusive) - complementary to
strip_tags(exclusive) - Implementation: Applied BEFORE any other processing (markdown/text conversion, link extraction)
- Selector Support: Full CSS selector syntax via Soup Sieve (tags, classes, IDs, attributes, combinators, pseudo-classes)
- Processing Order:
- Scrape HTML from provider
- Apply CSS selector filter (if provided)
- Apply strip_tags (if provided)
- Convert to markdown/text or extract links
- Metadata: When selector provided, adds
css_selector_appliedandelements_matchedto response metadata - Batch Support: Works in both single-URL and batch modes
- Empty Results: Returns empty string with count=0 if no elements match (graceful degradation)
All scraping operations implement exponential backoff:
- Default: 3 retries, 30s timeout, 1s initial delay
- Backoff schedule: 1s, 2s, 4s (exponential:
retry_delay * 2^(attempt-1)) - Retryable errors:
requests.Timeout,requests.ConnectionError,requests.HTTPError - Metadata tracking: All responses include
attempts,retries, andelapsed_msfields
Strong typing using Pydantic v2:
ScrapeResult(dataclass inproviders/base.py): Provider return typeScrapeResponse(Pydantic model): Single scrape tool responseLinksResponse(Pydantic model): Single link extraction responseScrapeResultItem/LinkResultItem: Individual batch operation results with success flag and optional errorBatchScrapeResponse/BatchLinksResponse: Batch operation responses with totals and results array
Tests use pytest-asyncio with pytest-mock for mocking. Key patterns:
- Fixtures (
tests/conftest.py): Provide sample HTML with various features (links, metadata, scripts) - Mocking pattern: Mock
get_provider()to return a provider with mockedscrape()method - Batch test pattern: Test both successful batch operations and partial failures
- Backward compatibility: Ensure single URL mode still works after adding batch support
When adding new tools:
- Create fixtures for test HTML in
conftest.py - Add test class following
Test<ToolName>Toolnaming pattern - Test single URL, batch mode, error cases, and parameter variations
- Mock at the provider level, not the requests level
- Define Pydantic response model in
server.py - Add utility function to
utils.pyif needed - Create
@mcp.tool()decorated function with dual-mode support (single/batch) - Add batch operation helper function following existing patterns
- Add comprehensive tests in
tests/test_server.py
- Create new file in
providers/(e.g.,playwright_provider.py) - Subclass
ScraperProviderand implementscrape()andsupports_url() - Update
get_provider()inserver.pyto route specific URL patterns - Add provider-specific tests in
tests/test_providers.py - Update
pyproject.tomldependencies if needed
Retry logic is centralized in RequestsProvider.scrape() at providers/requests_provider.py:78-127. Key parameters:
max_retries: Maximum attempts (default: 3)retry_delay: Initial backoff delay (default: 1.0s)- Backoff calculation:
delay = self.retry_delay * (2 ** (attempt - 1))
To modify retry behavior, adjust the retry loop or add retry parameters to tool signatures.
- Python version: 3.12+ (uses modern type hints like
str | None) - Package manager:
uvfor dependency management - Build system: Hatchling
- Line length: 100 characters (ruff)
- Pytest config: Async mode auto, coverage enabled by default, testpaths:
tests/ - Mypy: Strict mode enabled
- Base image: Python 3.12-slim
- Default port: 8000
- Transport: streamable-http (configurable via env vars)
- Environment variables:
TRANSPORT,HOST,PORT - Restart policy: unless-stopped (docker-compose)
To connect from Claude Desktop, add to MCP settings:
{
"mcpServers": {
"scraper": {
"url": "http://localhost:8000/mcp"
}
}
}The server uses FastMCP which automatically handles transport negotiation and tool registration.