Awesome Scrapers

A curated list of scrapers, crawlers, and data extraction tools. 150+ tools across 17 categories.

⚠️ = aging (6-12 months since last commit) — may still work but watch for staleness.

How to Choose

I need to...	Start here
Extract data with AI / natural language	AI-Powered Scraping
Bypass Cloudflare / bot detection	Stealth & Anti-Detection
Give my LLM agent web access	MCP Servers
Scrape JavaScript-heavy sites	Browser Automation
Build a production crawler	Web Scraping Frameworks
Parse HTML / extract text	HTML & XML Parsing or Content Extraction
Download videos / images	Media Downloaders
Extract tables from PDFs	Document & PDF Extraction
Read text from images	OCR & Screen Scraping
Just pay someone to handle it	Managed Scraping APIs

🤖 AI-Powered Scraping

LLMs understand page structure, extract via natural language, and output LLM-ready formats.

Tool	Stars	Language	Description
Firecrawl	87k	TypeScript	Websites → LLM-ready markdown or structured data via API.
browser-use	79k	Python	AI agents that control a browser to complete tasks autonomously.
Crawl4AI	61k	Python	LLM-friendly web crawler with structured extraction.
Docling	54k	Python	IBM — parse PDFs, DOCX into AI-ready output.
ScrapeGraphAI	23k	Python	Graph pipelines + LLMs to extract data via plain English.
Stagehand	21k	TypeScript	Browser automation combining natural language with code precision.
Skyvern	21k	Python	Browser workflows with computer vision + LLMs, no selectors needed.
Jina Reader	10k	TypeScript	Any URL → LLM-friendly markdown with vision model support. ⚠️
llm-scraper	6k	TypeScript	Structured data from any webpage using LLMs with Zod schemas.
Spider	2k	Rust	Async web crawler — 100-1000x faster than Python alternatives.

(⬆ back to top)

🥷 Stealth & Anti-Detection

The cat-and-mouse game of modern scraping.

Tool	Stars	Language	Description
Scrapling	19k	Python	Adaptive scraping with built-in anti-detection and auto-matching.
SeleniumBase	12k	Python	Browser automation with UC (Undetected Chrome) mode.
Camoufox	6k	Python	Firefox fork patched at engine level — 0% bot detection rate.
curl_cffi	5k	Python	HTTP client with browser TLS/JA3/HTTP2 fingerprint impersonation.
Nodriver	4k	Python	Successor to undetected-chromedriver — direct CDP, no WebDriver.
Botasaurus	4k	Python	Scraping framework with anti-detection, parallelism, and caching.
Patchright	2k	JavaScript	Undetected Playwright fork that passes bot detection.

(⬆ back to top)

🔌 MCP Servers (Model Context Protocol)

Connect LLM agents (Claude, GPT, etc.) directly to scraping tools.

Server	Stars	Description
Playwright MCP	28k	Browser automation via accessibility snapshots (by Microsoft).
Firecrawl MCP	6k	Web scraping and search in Claude/Cursor via Firecrawl API.
Browserbase MCP	3k	Cloud browser control with Stagehand AI.
Bright Data MCP	2k	Web access with geo-unblocking and bot evasion.

(⬆ back to top)

🌐 Browser Automation

The foundation for dynamic/JS-heavy scraping.

Tool	Stars	Language	Description
Puppeteer	94k	JavaScript	Google's Chrome/Firefox control via DevTools Protocol.
Playwright	83k	Multi	Cross-browser automation (Chromium, Firefox, WebKit) by Microsoft.
Selenium	34k	Multi	The OG browser automation (W3C WebDriver standard).
Crawlee	22k	TypeScript	Scraping/automation library with proxy rotation by Apify.

(⬆ back to top)

🕷️ Web Scraping Frameworks

Python

Tool	Stars	Description
Scrapy	60k	Python scraping framework — middleware, pipelines, extensions.
MechanicalSoup	5k	Stateful browser-like interaction for simple scraping.
scrapy-playwright	1k	Playwright integration for Scrapy — JS rendering with full pipeline.

Go

Tool	Stars	Description
Colly	25k	Fast scraping framework for Go.
Katana	16k	Crawling and spidering framework by ProjectDiscovery.
Ferret	6k	Declarative scraping with FQL query language.

Ruby

Tool	Stars	Description
Nokogiri	6k	Standard HTML/XML parser for Ruby.

(⬆ back to top)

📡 HTTP Clients

The network layer — making requests that look human.

Tool	Stars	Language	Description
aiohttp	16k	Python	Async HTTP client/server for high-concurrency scraping.
httpx	15k	Python	Async/sync HTTP client with HTTP/2 support.
curl_cffi	5k	Python	HTTP client impersonating browser TLS fingerprints (also in Stealth).
got-scraping	736	Node.js	HTTP client with header/TLS mimicry by Apify.

(⬆ back to top)

🧩 HTML & XML Parsing

Tool	Stars	Language	Description
Cheerio	30k	JavaScript	jQuery-like HTML manipulation for Node.js.
goquery	15k	Go	jQuery-like HTML selector for Go.
jsoup	11k	Java	HTML parser with CSS selectors and XSS sanitization.
AngleSharp	5k	C#	W3C-compliant HTML5 parser for .NET.
Beautiful Soup	-	Python	Most popular Python HTML/XML parser.
lxml	3k	Python	Fast XML/HTML parser with XPath and XSLT.
html5ever	3k	Rust	Browser-grade HTML5 parser from Mozilla Servo.
selectolax	2k	Python	5-30x faster than Beautiful Soup using Lexbor engine.
parsel	1k	Python	CSS/XPath selectors for HTML+JSON (powers Scrapy).

(⬆ back to top)

📝 Content & Text Extraction

Pull clean text out of messy HTML — essential for LLM/RAG pipelines.

Tool	Stars	Language	Description
Readability.js	11k	JavaScript	Mozilla's article extractor (powers Firefox Reader View).
Trafilatura	5k	Python	Web text extraction with metadata and language detection.
html2text	2k	Python	HTML → clean Markdown.
Markdownify	2k	Python	Flexible HTML-to-Markdown with customizable options.
newspaper4k	1k	Python	News article extraction with NLP and multilingual support.

(⬆ back to top)

📱 Social Media Scrapers

Platforms frequently change APIs and block scrapers. Check issue trackers for current status.

Tool	Stars	Platform	Description
Instaloader	12k	Instagram	Posts, stories, reels, highlights with metadata.
TikTok-Api	6k	TikTok	Unofficial API wrapper for Python.
PRAW	4k	Reddit	Official Python Reddit API Wrapper.

(⬆ back to top)

🎬 Media Downloaders

Tool	Stars	Description
yt-dlp	149k	YouTube and 1000+ sites (fork of youtube-dl).
lux	31k	Go video downloader — 40+ sites (formerly annie).
spotdl	24k	Spotify tracks/playlists with metadata and album art.
gallery-dl	17k	Image galleries from 100+ sites (Pixiv, Twitter, Reddit).

(⬆ back to top)

📄 Document & PDF Extraction

Tool	Stars	Language	Description
Docling	54k	Python	IBM — PDFs, DOCX, PPTX into AI-ready output.
Unstructured	14k	Python	ETL pipeline for documents → structured data for LLMs.
pdfplumber	10k	Python	Text, tables, and layout from PDFs with precision.
PyMuPDF	9k	Python	Fast PDF/XPS/EPUB extraction and rendering.
Tabula	7k	Java	Data tables from PDFs. ⚠️
pdfminer.six	7k	Python	PDF text extraction with layout analysis.
Camelot	4k	Python	PDF table extraction — lattice and stream modes.
tabula-py	2k	Python	Python wrapper for Tabula. ⚠️

(⬆ back to top)

👁️ OCR & Screen Scraping

Tool	Stars	Language	Description
Tesseract	73k	C++	Google's OCR engine — 100+ languages.
PaddleOCR	71k	Python	Lightweight OCR — 100+ languages with LLM integration.
EasyOCR	29k	Python	Ready-to-use OCR — 80+ languages, PyTorch.
pytesseract	6k	Python	Python wrapper for Tesseract.

(⬆ back to top)

⛓️ Blockchain & On-Chain

Tool	Type	Description
The Graph	Open Source (3k)	Blockchain indexing via GraphQL subgraphs.
Subsquid	Open Source	Blockchain indexer — 50k+ blocks/sec.
Dune Analytics	SaaS	SQL-based blockchain analytics.
Etherscan APIs	Freemium API	REST APIs for Ethereum data.

(⬆ back to top)

☁️ Managed Scraping APIs

Pay-per-request services that handle proxies, browsers, and anti-bot for you.

Service	Best For	Key Feature
Apify	Full-stack platform	20,000+ pre-built scrapers, works with Crawlee/Scrapy/Playwright.
ScrapingBee	Simple API access	JS rendering, screenshots, Google Search API, 1k free calls.
ZenRows	Anti-bot bypass	>99% success vs Cloudflare, Puppeteer/Playwright support.
ScrapFly	Multi-API	Scraping + screenshots + crawler APIs, 130M+ proxy IPs.
Browserless	Headless browsers	Headless Chrome in Docker, BrowserQL, self-hostable.
Browserbase	AI browser agents	Cloud browsers for AI, session persistence, Stagehand integration.
Oxylabs	Enterprise	ML-driven proxy rotation, e-commerce specialized.
Bright Data	Scale	Web Unlocker with CAPTCHA solving, geo-routing, mobile UA.
SerpApi	SERP data	Structured results from Google, Bing, Yahoo.
ScraperAPI	Getting started	40M+ proxies, 5k free credits.

(⬆ back to top)

🧪 CAPTCHA Solving

Service	Method	Starting Price	Supports
2Captcha	Human workers	$1/1k solves	reCAPTCHA, Turnstile, FunCaptcha, GeeTest, image.
Anti-Captcha	Human workers	$0.0005/token	reCAPTCHA, hCaptcha, FunCaptcha, Turnstile.
CapSolver	AI/ML	$0.65/1k	reCAPTCHA, AWS WAF, Cloudflare, GeeTest.
CapMonster Cloud	AI/ML	$0.02/1k images	reCAPTCHA, hCaptcha, Turnstile, <2s solve.

(⬆ back to top)

🌍 Proxy Providers

Provider	Network Size	Starting Price	Highlights
Bright Data	150M+ IPs	Pay-as-you-go	Largest network, 195 countries.
Oxylabs	100M+ IPs	Contact sales	Fastest latency, city/ZIP targeting.
Decodo	125M+ IPs	$8.50/GB	Good value, 50 US states.
IPRoyal	34M+ IPs	$1.75/GB	Cheapest residential, ethically sourced.
NetNut	85M+ IPs	Contact sales	One-hop ISP connectivity (faster).
Webshare	10M+ IPs	$2.99/mo	Budget-friendly, free tier.

(⬆ back to top)

🆓 Try Free

Service	Free Tier	Try It
ScraperAPI	5,000 free API credits	Start free →
ScrapingBee	1,000 free API calls	Start free →
Webshare	Free tier with 10 proxies	Start free →
2Captcha	$1 starting balance	Start free →
NetNut	7-day free trial	Start free →

(⬆ back to top)

🪦 Deprecated Tools Graveyard

Dead Tool	Why	Use Instead
PhantomJS	Archived 2018	Playwright, Puppeteer.
CasperJS	Depended on PhantomJS	Playwright, Puppeteer.
Nightmare	Unmaintained since 2020	Playwright.
Zombie.js	Unmaintained	Playwright, Puppeteer.
SlimerJS	Unmaintained, Gecko-based	Playwright (Firefox).
Splash	Scrapinghub, deprecated	Scrapy-Playwright.
twint	Archived Mar 2023, blocked by Twitter	Official API.
Goutte (PHP)	Deprecated by Symfony	Symfony BrowserKit + DomCrawler.
snscrape	Unmaintained since Nov 2023	Official APIs.
undetected-chromedriver	Aging, last push Jul 2025	Nodriver, Camoufox.
puppeteer-extra-stealth	Unmaintained since Jul 2024	Patchright, Camoufox.
playwright-stealth	Unmaintained since Nov 2023	Patchright, Camoufox.
curl-impersonate	Unmaintained since Jul 2024	curl_cffi.
GoogleScraper	Unmaintained since Jul 2021	SerpApi.
pyautogui	Unmaintained since Aug 2024	pytesseract + Playwright.
SikuliX	Stale, niche	Playwright, pytesseract.

(⬆ back to top)

Disclosure

Some links in the Managed Scraping APIs, CAPTCHA Solving, Proxy Providers, and Try Free sections are affiliate/referral links. These help support the maintenance of this list. All tools are included based on merit — affiliate status does not influence placement or rankings.

(⬆ back to top)

🔗 Related Awesome Lists

List	Description
awesome-ai	400+ AI APIs, tools, frameworks, and platforms.
awesome-robotics	Robotics frameworks, simulators, and platforms.
awesome-web3-ai	Web3 x AI tools, agent frameworks, and protocols.

Contributing

Contributions welcome! Please read the contribution guidelines first.

Add tools you've actually used or evaluated
Include star count and language where applicable
Note if a tool is unmaintained (last commit >1 year ago)
Commercial tools/services are fine but must be clearly labeled

License

To the extent possible under law, Edward Tay has waived all copyright and related or neighboring rights to this work.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
social-preview.png		social-preview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Scrapers

How to Choose

🤖 AI-Powered Scraping

🥷 Stealth & Anti-Detection

🔌 MCP Servers (Model Context Protocol)

🌐 Browser Automation

🕷️ Web Scraping Frameworks

Python

Go

Ruby

📡 HTTP Clients

🧩 HTML & XML Parsing

📝 Content & Text Extraction

📱 Social Media Scrapers

🎬 Media Downloaders

📄 Document & PDF Extraction

👁️ OCR & Screen Scraping

⛓️ Blockchain & On-Chain

☁️ Managed Scraping APIs

🧪 CAPTCHA Solving

🌍 Proxy Providers

🆓 Try Free

🪦 Deprecated Tools Graveyard

Disclosure

🔗 Related Awesome Lists

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Scrapers

How to Choose

🤖 AI-Powered Scraping

🥷 Stealth & Anti-Detection

🔌 MCP Servers (Model Context Protocol)

🌐 Browser Automation

🕷️ Web Scraping Frameworks

Python

Go

Ruby

📡 HTTP Clients

🧩 HTML & XML Parsing

📝 Content & Text Extraction

📱 Social Media Scrapers

🎬 Media Downloaders

📄 Document & PDF Extraction

👁️ OCR & Screen Scraping

⛓️ Blockchain & On-Chain

☁️ Managed Scraping APIs

🧪 CAPTCHA Solving

🌍 Proxy Providers

🆓 Try Free

🪦 Deprecated Tools Graveyard

Disclosure

🔗 Related Awesome Lists

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages