Skip to content

Latest commit

 

History

History
323 lines (231 loc) · 9.58 KB

File metadata and controls

323 lines (231 loc) · 9.58 KB

API Reference

Complete documentation for all Scraper MCP tools.

Available Tools

1. scrape_url

Scrape raw HTML content from a URL.

Parameters:

Parameter Type Required Default Description
urls string or list Yes - Single URL or list of URLs (http:// or https://)
timeout integer No 30 Request timeout in seconds
max_retries integer No 3 Maximum retry attempts on failure
css_selector string No - CSS selector to filter HTML elements

Returns:

  • url: Final URL after redirects
  • content: Raw HTML content (filtered if css_selector provided)
  • status_code: HTTP status code
  • content_type: Content-Type header value
  • metadata: Object containing headers, encoding, elapsed_ms, attempts, retries, css_selector_applied, elements_matched

2. scrape_url_markdown

Scrape a URL and convert the content to markdown format. Best for LLM consumption.

Parameters:

Parameter Type Required Default Description
urls string or list Yes - Single URL or list of URLs
timeout integer No 30 Request timeout in seconds
max_retries integer No 3 Maximum retry attempts
strip_tags array No - HTML tags to strip (e.g., ['script', 'style'])
css_selector string No - CSS selector to filter HTML before conversion

Returns:

  • Same as scrape_url but with markdown-formatted content
  • metadata.page_metadata: Extracted page metadata (title, description, etc.)

3. scrape_url_text

Scrape a URL and extract plain text content.

Parameters:

Parameter Type Required Default Description
urls string or list Yes - Single URL or list of URLs
timeout integer No 30 Request timeout in seconds
max_retries integer No 3 Maximum retry attempts
strip_tags array No script, style, meta, link, noscript HTML tags to strip
css_selector string No - CSS selector to filter HTML before extraction

Returns:

  • Same as scrape_url but with plain text content

4. scrape_extract_links

Scrape a URL and extract all links.

Parameters:

Parameter Type Required Default Description
urls string or list Yes - Single URL or list of URLs
timeout integer No 30 Request timeout in seconds
max_retries integer No 3 Maximum retry attempts
css_selector string No - CSS selector to scope link extraction

Returns:

  • url: The URL that was scraped
  • links: Array of link objects with url, text, and title
  • count: Total number of links found

5. perplexity

Search the web using Perplexity AI. Requires PERPLEXITY_API_KEY environment variable.

Parameters:

Parameter Type Required Default Description
messages array Yes - Conversation messages with role and content
model string No sonar Model: "sonar" or "sonar-pro"
temperature number No 0.3 Creativity 0-2 (lower = focused)
max_tokens integer No 4000 Maximum response length

Returns:

  • content: AI-generated response with citation markers
  • model: Model used
  • citations: Array of source URLs
  • usage: Token usage statistics

6. perplexity_reason

Complex reasoning tasks using Perplexity's reasoning model. Requires PERPLEXITY_API_KEY.

Parameters:

Parameter Type Required Default Description
query string Yes - The query or problem to reason about
temperature number No 0.3 Creativity 0-2
max_tokens integer No 4000 Maximum response length

Returns:

  • Same as perplexity tool

CSS Selector Filtering

All scraping tools support CSS selector filtering to extract specific elements before processing.

Supported Selectors

The server uses BeautifulSoup4's .select() method (Soup Sieve), supporting:

Selector Type Example Description
Tag meta, img, a Select by tag name
Multiple img, video Comma-separated
Class .article-content Select by class
ID #main-content Select by ID
Attribute a[href], meta[property="og:image"] Select by attribute
Descendant article p, div.content a Nested selectors
Pseudo-class p:nth-of-type(3), a:not([rel]) Advanced filtering

Examples

# Extract only meta tags
scrape_url("https://example.com", css_selector="meta")

# Get article content as markdown
scrape_url_markdown("https://blog.com/article", css_selector="article.main-content")

# Extract text from specific section
scrape_url_text("https://example.com", css_selector="#main-content")

# Get only navigation links
scrape_extract_links("https://example.com", css_selector="nav.primary")

# Get Open Graph meta tags
scrape_url("https://example.com", css_selector='meta[property^="og:"]')

# Combine with strip_tags
scrape_url_markdown(
    "https://example.com",
    css_selector="article",
    strip_tags=["script", "style"]
)

How It Works

  1. Scrape: Fetch HTML from the URL
  2. Filter: Apply CSS selector to keep only matching elements
  3. Process: Convert to markdown/text or extract links
  4. Return: Include elements_matched count in metadata

Retry Behavior

The scraper includes intelligent retry logic with exponential backoff.

Configuration

Setting Default Description
max_retries 3 Maximum retry attempts
timeout 30s Request timeout
Retry delay 1s initial Exponential backoff

Retry Schedule

For default configuration (max_retries=3):

  1. First attempt: Immediate
  2. Retry 1: Wait 1 second
  3. Retry 2: Wait 2 seconds
  4. Retry 3: Wait 4 seconds

Total maximum wait: ~7 seconds before final failure.

What Triggers Retries

  • Network timeouts
  • Connection failures
  • HTTP errors (4xx, 5xx status codes)

Response Metadata

All responses include retry information:

{
  "attempts": 2,
  "retries": 1,
  "elapsed_ms": 234.5
}

Customizing Retries

# Disable retries
scrape_url("https://example.com", max_retries=0)

# More aggressive retries
scrape_url("https://example.com", max_retries=5, timeout=60)

# Quick fail
scrape_url("https://example.com", max_retries=1, timeout=10)

Batch Operations

All tools support batch operations by passing a list of URLs:

# Single URL
scrape_url("https://example.com")

# Batch operation
scrape_url(["https://example.com", "https://example.org", "https://example.net"])

Batch operations:

  • Execute concurrently (default: 5 parallel requests)
  • Return results for all URLs with individual success/failure status
  • Include totals: total, successful, failed

Resources

MCP resources provide read-only data access via URI-based addressing. Access resources via resources/list and resources/read.

Cache Resources

URI Description
cache://stats Cache statistics (hit rate, size, entries)
cache://requests List of recent request IDs with metadata
cache://request/{id} Full cached result by request ID
cache://request/{id}/content Just the content from a cached request
cache://request/{id}/metadata Just the metadata from a cached request

Configuration Resources

URI Description
config://current Current runtime configuration
config://defaults Default configuration values
config://scraping Scraping settings (timeout, retries, concurrency)
config://cache Cache settings (TTLs, directory)

Server Resources

URI Description
server://info Server info (version, uptime, capabilities)
server://metrics Request metrics (counts, success rates)
server://tools List of available tools with descriptions

Prompts

MCP prompts provide reusable, parameterized workflow templates. Access prompts via prompts/list and prompts/get.

Content Analysis Prompts

Prompt Parameters Purpose
analyze_webpage url, focus Structured webpage analysis
summarize_content url, length, style Generate content summaries
extract_data url, data_type, selector Extract specific data types
compare_pages urls Compare multiple pages

SEO/Technical Prompts

Prompt Parameters Purpose
seo_audit url Comprehensive SEO check
link_audit url Analyze internal/external links
metadata_check url Review meta tags and OG data
accessibility_check url Basic accessibility analysis

Research Prompts

Prompt Parameters Purpose
research_topic topic, depth Multi-source research
fact_check claim, sources Verify claims
competitive_analysis urls Compare competitors
news_roundup topic, timeframe Gather recent news

Disabling Resources/Prompts

To reduce context overhead:

# Environment variables
DISABLE_RESOURCES=true
DISABLE_PROMPTS=true

# CLI flags
python -m scraper_mcp --disable-resources --disable-prompts