Awesome Scrapers
A curated list of scrapers, crawlers, and data extraction tools. 150+ tools across 17 categories.
β οΈ = aging (6-12 months since last commit) β may still work but watch for staleness.
LLMs understand page structure, extract via natural language, and output LLM-ready formats.
Tool
Stars
Language
Description
Firecrawl
87k
TypeScript
Websites β LLM-ready markdown or structured data via API.
browser-use
79k
Python
AI agents that control a browser to complete tasks autonomously.
Crawl4AI
61k
Python
LLM-friendly web crawler with structured extraction.
Docling
54k
Python
IBM β parse PDFs, DOCX into AI-ready output.
ScrapeGraphAI
23k
Python
Graph pipelines + LLMs to extract data via plain English.
Stagehand
21k
TypeScript
Browser automation combining natural language with code precision.
Skyvern
21k
Python
Browser workflows with computer vision + LLMs, no selectors needed.
Jina Reader
10k
TypeScript
Any URL β LLM-friendly markdown with vision model support. β οΈ
llm-scraper
6k
TypeScript
Structured data from any webpage using LLMs with Zod schemas.
Spider
2k
Rust
Async web crawler β 100-1000x faster than Python alternatives.
(β¬ back to top )
π₯· Stealth & Anti-Detection
The cat-and-mouse game of modern scraping.
Tool
Stars
Language
Description
Scrapling
19k
Python
Adaptive scraping with built-in anti-detection and auto-matching.
SeleniumBase
12k
Python
Browser automation with UC (Undetected Chrome) mode.
Camoufox
6k
Python
Firefox fork patched at engine level β 0% bot detection rate.
curl_cffi
5k
Python
HTTP client with browser TLS/JA3/HTTP2 fingerprint impersonation.
Nodriver
4k
Python
Successor to undetected-chromedriver β direct CDP, no WebDriver.
Botasaurus
4k
Python
Scraping framework with anti-detection, parallelism, and caching.
Patchright
2k
JavaScript
Undetected Playwright fork that passes bot detection.
(β¬ back to top )
π MCP Servers (Model Context Protocol)
Connect LLM agents (Claude, GPT, etc.) directly to scraping tools.
Server
Stars
Description
Playwright MCP
28k
Browser automation via accessibility snapshots (by Microsoft).
Firecrawl MCP
6k
Web scraping and search in Claude/Cursor via Firecrawl API.
Browserbase MCP
3k
Cloud browser control with Stagehand AI.
Bright Data MCP
2k
Web access with geo-unblocking and bot evasion.
(β¬ back to top )
The foundation for dynamic/JS-heavy scraping.
Tool
Stars
Language
Description
Puppeteer
94k
JavaScript
Google's Chrome/Firefox control via DevTools Protocol.
Playwright
83k
Multi
Cross-browser automation (Chromium, Firefox, WebKit) by Microsoft.
Selenium
34k
Multi
The OG browser automation (W3C WebDriver standard).
Crawlee
22k
TypeScript
Scraping/automation library with proxy rotation by Apify.
(β¬ back to top )
π·οΈ Web Scraping Frameworks
Tool
Stars
Description
Scrapy
60k
Python scraping framework β middleware, pipelines, extensions.
MechanicalSoup
5k
Stateful browser-like interaction for simple scraping.
scrapy-playwright
1k
Playwright integration for Scrapy β JS rendering with full pipeline.
Tool
Stars
Description
Colly
25k
Fast scraping framework for Go.
Katana
16k
Crawling and spidering framework by ProjectDiscovery.
Ferret
6k
Declarative scraping with FQL query language.
Tool
Stars
Description
Nokogiri
6k
Standard HTML/XML parser for Ruby.
(β¬ back to top )
The network layer β making requests that look human.
Tool
Stars
Language
Description
aiohttp
16k
Python
Async HTTP client/server for high-concurrency scraping.
httpx
15k
Python
Async/sync HTTP client with HTTP/2 support.
curl_cffi
5k
Python
HTTP client impersonating browser TLS fingerprints (also in Stealth).
got-scraping
736
Node.js
HTTP client with header/TLS mimicry by Apify.
(β¬ back to top )
Tool
Stars
Language
Description
Cheerio
30k
JavaScript
jQuery-like HTML manipulation for Node.js.
goquery
15k
Go
jQuery-like HTML selector for Go.
jsoup
11k
Java
HTML parser with CSS selectors and XSS sanitization.
AngleSharp
5k
C#
W3C-compliant HTML5 parser for .NET.
Beautiful Soup
-
Python
Most popular Python HTML/XML parser.
lxml
3k
Python
Fast XML/HTML parser with XPath and XSLT.
html5ever
3k
Rust
Browser-grade HTML5 parser from Mozilla Servo.
selectolax
2k
Python
5-30x faster than Beautiful Soup using Lexbor engine.
parsel
1k
Python
CSS/XPath selectors for HTML+JSON (powers Scrapy).
(β¬ back to top )
π Content & Text Extraction
Pull clean text out of messy HTML β essential for LLM/RAG pipelines.
Tool
Stars
Language
Description
Readability.js
11k
JavaScript
Mozilla's article extractor (powers Firefox Reader View).
Trafilatura
5k
Python
Web text extraction with metadata and language detection.
html2text
2k
Python
HTML β clean Markdown.
Markdownify
2k
Python
Flexible HTML-to-Markdown with customizable options.
newspaper4k
1k
Python
News article extraction with NLP and multilingual support.
(β¬ back to top )
π± Social Media Scrapers
Platforms frequently change APIs and block scrapers. Check issue trackers for current status.
Tool
Stars
Platform
Description
Instaloader
12k
Instagram
Posts, stories, reels, highlights with metadata.
TikTok-Api
6k
TikTok
Unofficial API wrapper for Python.
PRAW
4k
Reddit
Official Python Reddit API Wrapper.
(β¬ back to top )
Tool
Stars
Description
yt-dlp
149k
YouTube and 1000+ sites (fork of youtube-dl).
lux
31k
Go video downloader β 40+ sites (formerly annie).
spotdl
24k
Spotify tracks/playlists with metadata and album art.
gallery-dl
17k
Image galleries from 100+ sites (Pixiv, Twitter, Reddit).
(β¬ back to top )
π Document & PDF Extraction
Tool
Stars
Language
Description
Docling
54k
Python
IBM β PDFs, DOCX, PPTX into AI-ready output.
Unstructured
14k
Python
ETL pipeline for documents β structured data for LLMs.
pdfplumber
10k
Python
Text, tables, and layout from PDFs with precision.
PyMuPDF
9k
Python
Fast PDF/XPS/EPUB extraction and rendering.
Tabula
7k
Java
Data tables from PDFs. β οΈ
pdfminer.six
7k
Python
PDF text extraction with layout analysis.
Camelot
4k
Python
PDF table extraction β lattice and stream modes.
tabula-py
2k
Python
Python wrapper for Tabula. β οΈ
(β¬ back to top )
ποΈ OCR & Screen Scraping
Tool
Stars
Language
Description
Tesseract
73k
C++
Google's OCR engine β 100+ languages.
PaddleOCR
71k
Python
Lightweight OCR β 100+ languages with LLM integration.
EasyOCR
29k
Python
Ready-to-use OCR β 80+ languages, PyTorch.
pytesseract
6k
Python
Python wrapper for Tesseract.
(β¬ back to top )
βοΈ Blockchain & On-Chain
Tool
Type
Description
The Graph
Open Source (3k)
Blockchain indexing via GraphQL subgraphs.
Subsquid
Open Source
Blockchain indexer β 50k+ blocks/sec.
Dune Analytics
SaaS
SQL-based blockchain analytics.
Etherscan APIs
Freemium API
REST APIs for Ethereum data.
(β¬ back to top )
βοΈ Managed Scraping APIs
Pay-per-request services that handle proxies, browsers, and anti-bot for you.
Service
Best For
Key Feature
Apify
Full-stack platform
20,000+ pre-built scrapers, works with Crawlee/Scrapy/Playwright.
ScrapingBee
Simple API access
JS rendering, screenshots, Google Search API, 1k free calls.
ZenRows
Anti-bot bypass
>99% success vs Cloudflare, Puppeteer/Playwright support.
ScrapFly
Multi-API
Scraping + screenshots + crawler APIs, 130M+ proxy IPs.
Browserless
Headless browsers
Headless Chrome in Docker, BrowserQL, self-hostable.
Browserbase
AI browser agents
Cloud browsers for AI, session persistence, Stagehand integration.
Oxylabs
Enterprise
ML-driven proxy rotation, e-commerce specialized.
Bright Data
Scale
Web Unlocker with CAPTCHA solving, geo-routing, mobile UA.
SerpApi
SERP data
Structured results from Google, Bing, Yahoo.
ScraperAPI
Getting started
40M+ proxies, 5k free credits.
(β¬ back to top )
Service
Method
Starting Price
Supports
2Captcha
Human workers
$1/1k solves
reCAPTCHA, Turnstile, FunCaptcha, GeeTest, image.
Anti-Captcha
Human workers
$0.0005/token
reCAPTCHA, hCaptcha, FunCaptcha, Turnstile.
CapSolver
AI/ML
$0.65/1k
reCAPTCHA, AWS WAF, Cloudflare, GeeTest.
CapMonster Cloud
AI/ML
$0.02/1k images
reCAPTCHA, hCaptcha, Turnstile, <2s solve.
(β¬ back to top )
Provider
Network Size
Starting Price
Highlights
Bright Data
150M+ IPs
Pay-as-you-go
Largest network, 195 countries.
Oxylabs
100M+ IPs
Contact sales
Fastest latency, city/ZIP targeting.
Decodo
125M+ IPs
$8.50/GB
Good value, 50 US states.
IPRoyal
34M+ IPs
$1.75/GB
Cheapest residential, ethically sourced.
NetNut
85M+ IPs
Contact sales
One-hop ISP connectivity (faster).
Webshare
10M+ IPs
$2.99/mo
Budget-friendly, free tier.
(β¬ back to top )
(β¬ back to top )
πͺ¦ Deprecated Tools Graveyard
Dead Tool
Why
Use Instead
PhantomJS
Archived 2018
Playwright, Puppeteer.
CasperJS
Depended on PhantomJS
Playwright, Puppeteer.
Nightmare
Unmaintained since 2020
Playwright.
Zombie.js
Unmaintained
Playwright, Puppeteer.
SlimerJS
Unmaintained, Gecko-based
Playwright (Firefox).
Splash
Scrapinghub, deprecated
Scrapy-Playwright.
twint
Archived Mar 2023, blocked by Twitter
Official API.
Goutte (PHP)
Deprecated by Symfony
Symfony BrowserKit + DomCrawler.
snscrape
Unmaintained since Nov 2023
Official APIs.
undetected-chromedriver
Aging, last push Jul 2025
Nodriver, Camoufox.
puppeteer-extra-stealth
Unmaintained since Jul 2024
Patchright, Camoufox.
playwright-stealth
Unmaintained since Nov 2023
Patchright, Camoufox.
curl-impersonate
Unmaintained since Jul 2024
curl_cffi.
GoogleScraper
Unmaintained since Jul 2021
SerpApi.
pyautogui
Unmaintained since Aug 2024
pytesseract + Playwright.
SikuliX
Stale, niche
Playwright, pytesseract.
(β¬ back to top )
Some links in the Managed Scraping APIs, CAPTCHA Solving, Proxy Providers, and Try Free sections are affiliate/referral links. These help support the maintenance of this list. All tools are included based on merit β affiliate status does not influence placement or rankings.
(β¬ back to top )
π Related Awesome Lists
List
Description
awesome-ai
400+ AI APIs, tools, frameworks, and platforms.
awesome-robotics
Robotics frameworks, simulators, and platforms.
awesome-web3-ai
Web3 x AI tools, agent frameworks, and protocols.
Contributions welcome! Please read the contribution guidelines first.
Add tools you've actually used or evaluated
Include star count and language where applicable
Note if a tool is unmaintained (last commit >1 year ago)
Commercial tools/services are fine but must be clearly labeled
To the extent possible under law, Edward Tay has waived all copyright and related or neighboring rights to this work.