Complete documentation for all Scraper MCP tools.
Scrape raw HTML content from a URL.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls |
string or list | Yes | - | Single URL or list of URLs (http:// or https://) |
timeout |
integer | No | 30 | Request timeout in seconds |
max_retries |
integer | No | 3 | Maximum retry attempts on failure |
css_selector |
string | No | - | CSS selector to filter HTML elements |
Returns:
url: Final URL after redirectscontent: Raw HTML content (filtered if css_selector provided)status_code: HTTP status codecontent_type: Content-Type header valuemetadata: Object containingheaders,encoding,elapsed_ms,attempts,retries,css_selector_applied,elements_matched
Scrape a URL and convert the content to markdown format. Best for LLM consumption.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls |
string or list | Yes | - | Single URL or list of URLs |
timeout |
integer | No | 30 | Request timeout in seconds |
max_retries |
integer | No | 3 | Maximum retry attempts |
strip_tags |
array | No | - | HTML tags to strip (e.g., ['script', 'style']) |
css_selector |
string | No | - | CSS selector to filter HTML before conversion |
Returns:
- Same as
scrape_urlbut with markdown-formatted content metadata.page_metadata: Extracted page metadata (title, description, etc.)
Scrape a URL and extract plain text content.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls |
string or list | Yes | - | Single URL or list of URLs |
timeout |
integer | No | 30 | Request timeout in seconds |
max_retries |
integer | No | 3 | Maximum retry attempts |
strip_tags |
array | No | script, style, meta, link, noscript | HTML tags to strip |
css_selector |
string | No | - | CSS selector to filter HTML before extraction |
Returns:
- Same as
scrape_urlbut with plain text content
Scrape a URL and extract all links.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls |
string or list | Yes | - | Single URL or list of URLs |
timeout |
integer | No | 30 | Request timeout in seconds |
max_retries |
integer | No | 3 | Maximum retry attempts |
css_selector |
string | No | - | CSS selector to scope link extraction |
Returns:
url: The URL that was scrapedlinks: Array of link objects withurl,text, andtitlecount: Total number of links found
Search the web using Perplexity AI. Requires PERPLEXITY_API_KEY environment variable.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
messages |
array | Yes | - | Conversation messages with role and content |
model |
string | No | sonar | Model: "sonar" or "sonar-pro" |
temperature |
number | No | 0.3 | Creativity 0-2 (lower = focused) |
max_tokens |
integer | No | 4000 | Maximum response length |
Returns:
content: AI-generated response with citation markersmodel: Model usedcitations: Array of source URLsusage: Token usage statistics
Complex reasoning tasks using Perplexity's reasoning model. Requires PERPLEXITY_API_KEY.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query |
string | Yes | - | The query or problem to reason about |
temperature |
number | No | 0.3 | Creativity 0-2 |
max_tokens |
integer | No | 4000 | Maximum response length |
Returns:
- Same as
perplexitytool
All scraping tools support CSS selector filtering to extract specific elements before processing.
The server uses BeautifulSoup4's .select() method (Soup Sieve), supporting:
| Selector Type | Example | Description |
|---|---|---|
| Tag | meta, img, a |
Select by tag name |
| Multiple | img, video |
Comma-separated |
| Class | .article-content |
Select by class |
| ID | #main-content |
Select by ID |
| Attribute | a[href], meta[property="og:image"] |
Select by attribute |
| Descendant | article p, div.content a |
Nested selectors |
| Pseudo-class | p:nth-of-type(3), a:not([rel]) |
Advanced filtering |
# Extract only meta tags
scrape_url("https://example.com", css_selector="meta")
# Get article content as markdown
scrape_url_markdown("https://blog.com/article", css_selector="article.main-content")
# Extract text from specific section
scrape_url_text("https://example.com", css_selector="#main-content")
# Get only navigation links
scrape_extract_links("https://example.com", css_selector="nav.primary")
# Get Open Graph meta tags
scrape_url("https://example.com", css_selector='meta[property^="og:"]')
# Combine with strip_tags
scrape_url_markdown(
"https://example.com",
css_selector="article",
strip_tags=["script", "style"]
)- Scrape: Fetch HTML from the URL
- Filter: Apply CSS selector to keep only matching elements
- Process: Convert to markdown/text or extract links
- Return: Include
elements_matchedcount in metadata
The scraper includes intelligent retry logic with exponential backoff.
| Setting | Default | Description |
|---|---|---|
max_retries |
3 | Maximum retry attempts |
timeout |
30s | Request timeout |
| Retry delay | 1s initial | Exponential backoff |
For default configuration (max_retries=3):
- First attempt: Immediate
- Retry 1: Wait 1 second
- Retry 2: Wait 2 seconds
- Retry 3: Wait 4 seconds
Total maximum wait: ~7 seconds before final failure.
- Network timeouts
- Connection failures
- HTTP errors (4xx, 5xx status codes)
All responses include retry information:
{
"attempts": 2,
"retries": 1,
"elapsed_ms": 234.5
}# Disable retries
scrape_url("https://example.com", max_retries=0)
# More aggressive retries
scrape_url("https://example.com", max_retries=5, timeout=60)
# Quick fail
scrape_url("https://example.com", max_retries=1, timeout=10)All tools support batch operations by passing a list of URLs:
# Single URL
scrape_url("https://example.com")
# Batch operation
scrape_url(["https://example.com", "https://example.org", "https://example.net"])Batch operations:
- Execute concurrently (default: 5 parallel requests)
- Return results for all URLs with individual success/failure status
- Include totals:
total,successful,failed
MCP resources provide read-only data access via URI-based addressing. Access resources via resources/list and resources/read.
| URI | Description |
|---|---|
cache://stats |
Cache statistics (hit rate, size, entries) |
cache://requests |
List of recent request IDs with metadata |
cache://request/{id} |
Full cached result by request ID |
cache://request/{id}/content |
Just the content from a cached request |
cache://request/{id}/metadata |
Just the metadata from a cached request |
| URI | Description |
|---|---|
config://current |
Current runtime configuration |
config://defaults |
Default configuration values |
config://scraping |
Scraping settings (timeout, retries, concurrency) |
config://cache |
Cache settings (TTLs, directory) |
| URI | Description |
|---|---|
server://info |
Server info (version, uptime, capabilities) |
server://metrics |
Request metrics (counts, success rates) |
server://tools |
List of available tools with descriptions |
MCP prompts provide reusable, parameterized workflow templates. Access prompts via prompts/list and prompts/get.
| Prompt | Parameters | Purpose |
|---|---|---|
analyze_webpage |
url, focus |
Structured webpage analysis |
summarize_content |
url, length, style |
Generate content summaries |
extract_data |
url, data_type, selector |
Extract specific data types |
compare_pages |
urls |
Compare multiple pages |
| Prompt | Parameters | Purpose |
|---|---|---|
seo_audit |
url |
Comprehensive SEO check |
link_audit |
url |
Analyze internal/external links |
metadata_check |
url |
Review meta tags and OG data |
accessibility_check |
url |
Basic accessibility analysis |
| Prompt | Parameters | Purpose |
|---|---|---|
research_topic |
topic, depth |
Multi-source research |
fact_check |
claim, sources |
Verify claims |
competitive_analysis |
urls |
Compare competitors |
news_roundup |
topic, timeframe |
Gather recent news |
To reduce context overhead:
# Environment variables
DISABLE_RESOURCES=true
DISABLE_PROMPTS=true
# CLI flags
python -m scraper_mcp --disable-resources --disable-prompts