Skip to content

edwardtay/awesome-scrapers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Awesome Scrapers Awesome

Stars Links Last Commit PRs Welcome License: CC0

A curated list of scrapers, crawlers, and data extraction tools. 150+ tools across 17 categories.

⚠️ = aging (6-12 months since last commit) β€” may still work but watch for staleness.

How to Choose

I need to... Start here
Extract data with AI / natural language AI-Powered Scraping
Bypass Cloudflare / bot detection Stealth & Anti-Detection
Give my LLM agent web access MCP Servers
Scrape JavaScript-heavy sites Browser Automation
Build a production crawler Web Scraping Frameworks
Parse HTML / extract text HTML & XML Parsing or Content Extraction
Download videos / images Media Downloaders
Extract tables from PDFs Document & PDF Extraction
Read text from images OCR & Screen Scraping
Just pay someone to handle it Managed Scraping APIs

πŸ€– AI-Powered Scraping

LLMs understand page structure, extract via natural language, and output LLM-ready formats.

Tool Stars Language Description
Firecrawl 87k TypeScript Websites β†’ LLM-ready markdown or structured data via API.
browser-use 79k Python AI agents that control a browser to complete tasks autonomously.
Crawl4AI 61k Python LLM-friendly web crawler with structured extraction.
Docling 54k Python IBM β€” parse PDFs, DOCX into AI-ready output.
ScrapeGraphAI 23k Python Graph pipelines + LLMs to extract data via plain English.
Stagehand 21k TypeScript Browser automation combining natural language with code precision.
Skyvern 21k Python Browser workflows with computer vision + LLMs, no selectors needed.
Jina Reader 10k TypeScript Any URL β†’ LLM-friendly markdown with vision model support. ⚠️
llm-scraper 6k TypeScript Structured data from any webpage using LLMs with Zod schemas.
Spider 2k Rust Async web crawler β€” 100-1000x faster than Python alternatives.

(⬆ back to top)

πŸ₯· Stealth & Anti-Detection

The cat-and-mouse game of modern scraping.

Tool Stars Language Description
Scrapling 19k Python Adaptive scraping with built-in anti-detection and auto-matching.
SeleniumBase 12k Python Browser automation with UC (Undetected Chrome) mode.
Camoufox 6k Python Firefox fork patched at engine level β€” 0% bot detection rate.
curl_cffi 5k Python HTTP client with browser TLS/JA3/HTTP2 fingerprint impersonation.
Nodriver 4k Python Successor to undetected-chromedriver β€” direct CDP, no WebDriver.
Botasaurus 4k Python Scraping framework with anti-detection, parallelism, and caching.
Patchright 2k JavaScript Undetected Playwright fork that passes bot detection.

(⬆ back to top)

πŸ”Œ MCP Servers (Model Context Protocol)

Connect LLM agents (Claude, GPT, etc.) directly to scraping tools.

Server Stars Description
Playwright MCP 28k Browser automation via accessibility snapshots (by Microsoft).
Firecrawl MCP 6k Web scraping and search in Claude/Cursor via Firecrawl API.
Browserbase MCP 3k Cloud browser control with Stagehand AI.
Bright Data MCP 2k Web access with geo-unblocking and bot evasion.

(⬆ back to top)

🌐 Browser Automation

The foundation for dynamic/JS-heavy scraping.

Tool Stars Language Description
Puppeteer 94k JavaScript Google's Chrome/Firefox control via DevTools Protocol.
Playwright 83k Multi Cross-browser automation (Chromium, Firefox, WebKit) by Microsoft.
Selenium 34k Multi The OG browser automation (W3C WebDriver standard).
Crawlee 22k TypeScript Scraping/automation library with proxy rotation by Apify.

(⬆ back to top)

πŸ•·οΈ Web Scraping Frameworks

Python

Tool Stars Description
Scrapy 60k Python scraping framework β€” middleware, pipelines, extensions.
MechanicalSoup 5k Stateful browser-like interaction for simple scraping.
scrapy-playwright 1k Playwright integration for Scrapy β€” JS rendering with full pipeline.

Go

Tool Stars Description
Colly 25k Fast scraping framework for Go.
Katana 16k Crawling and spidering framework by ProjectDiscovery.
Ferret 6k Declarative scraping with FQL query language.

Ruby

Tool Stars Description
Nokogiri 6k Standard HTML/XML parser for Ruby.

(⬆ back to top)

πŸ“‘ HTTP Clients

The network layer β€” making requests that look human.

Tool Stars Language Description
aiohttp 16k Python Async HTTP client/server for high-concurrency scraping.
httpx 15k Python Async/sync HTTP client with HTTP/2 support.
curl_cffi 5k Python HTTP client impersonating browser TLS fingerprints (also in Stealth).
got-scraping 736 Node.js HTTP client with header/TLS mimicry by Apify.

(⬆ back to top)

🧩 HTML & XML Parsing

Tool Stars Language Description
Cheerio 30k JavaScript jQuery-like HTML manipulation for Node.js.
goquery 15k Go jQuery-like HTML selector for Go.
jsoup 11k Java HTML parser with CSS selectors and XSS sanitization.
AngleSharp 5k C# W3C-compliant HTML5 parser for .NET.
Beautiful Soup - Python Most popular Python HTML/XML parser.
lxml 3k Python Fast XML/HTML parser with XPath and XSLT.
html5ever 3k Rust Browser-grade HTML5 parser from Mozilla Servo.
selectolax 2k Python 5-30x faster than Beautiful Soup using Lexbor engine.
parsel 1k Python CSS/XPath selectors for HTML+JSON (powers Scrapy).

(⬆ back to top)

πŸ“ Content & Text Extraction

Pull clean text out of messy HTML β€” essential for LLM/RAG pipelines.

Tool Stars Language Description
Readability.js 11k JavaScript Mozilla's article extractor (powers Firefox Reader View).
Trafilatura 5k Python Web text extraction with metadata and language detection.
html2text 2k Python HTML β†’ clean Markdown.
Markdownify 2k Python Flexible HTML-to-Markdown with customizable options.
newspaper4k 1k Python News article extraction with NLP and multilingual support.

(⬆ back to top)

πŸ“± Social Media Scrapers

Platforms frequently change APIs and block scrapers. Check issue trackers for current status.

Tool Stars Platform Description
Instaloader 12k Instagram Posts, stories, reels, highlights with metadata.
TikTok-Api 6k TikTok Unofficial API wrapper for Python.
PRAW 4k Reddit Official Python Reddit API Wrapper.

(⬆ back to top)

🎬 Media Downloaders

Tool Stars Description
yt-dlp 149k YouTube and 1000+ sites (fork of youtube-dl).
lux 31k Go video downloader β€” 40+ sites (formerly annie).
spotdl 24k Spotify tracks/playlists with metadata and album art.
gallery-dl 17k Image galleries from 100+ sites (Pixiv, Twitter, Reddit).

(⬆ back to top)

πŸ“„ Document & PDF Extraction

Tool Stars Language Description
Docling 54k Python IBM β€” PDFs, DOCX, PPTX into AI-ready output.
Unstructured 14k Python ETL pipeline for documents β†’ structured data for LLMs.
pdfplumber 10k Python Text, tables, and layout from PDFs with precision.
PyMuPDF 9k Python Fast PDF/XPS/EPUB extraction and rendering.
Tabula 7k Java Data tables from PDFs. ⚠️
pdfminer.six 7k Python PDF text extraction with layout analysis.
Camelot 4k Python PDF table extraction β€” lattice and stream modes.
tabula-py 2k Python Python wrapper for Tabula. ⚠️

(⬆ back to top)

πŸ‘οΈ OCR & Screen Scraping

Tool Stars Language Description
Tesseract 73k C++ Google's OCR engine β€” 100+ languages.
PaddleOCR 71k Python Lightweight OCR β€” 100+ languages with LLM integration.
EasyOCR 29k Python Ready-to-use OCR β€” 80+ languages, PyTorch.
pytesseract 6k Python Python wrapper for Tesseract.

(⬆ back to top)

⛓️ Blockchain & On-Chain

Tool Type Description
The Graph Open Source (3k) Blockchain indexing via GraphQL subgraphs.
Subsquid Open Source Blockchain indexer β€” 50k+ blocks/sec.
Dune Analytics SaaS SQL-based blockchain analytics.
Etherscan APIs Freemium API REST APIs for Ethereum data.

(⬆ back to top)

☁️ Managed Scraping APIs

Pay-per-request services that handle proxies, browsers, and anti-bot for you.

Service Best For Key Feature
Apify Full-stack platform 20,000+ pre-built scrapers, works with Crawlee/Scrapy/Playwright.
ScrapingBee Simple API access JS rendering, screenshots, Google Search API, 1k free calls.
ZenRows Anti-bot bypass >99% success vs Cloudflare, Puppeteer/Playwright support.
ScrapFly Multi-API Scraping + screenshots + crawler APIs, 130M+ proxy IPs.
Browserless Headless browsers Headless Chrome in Docker, BrowserQL, self-hostable.
Browserbase AI browser agents Cloud browsers for AI, session persistence, Stagehand integration.
Oxylabs Enterprise ML-driven proxy rotation, e-commerce specialized.
Bright Data Scale Web Unlocker with CAPTCHA solving, geo-routing, mobile UA.
SerpApi SERP data Structured results from Google, Bing, Yahoo.
ScraperAPI Getting started 40M+ proxies, 5k free credits.

(⬆ back to top)

πŸ§ͺ CAPTCHA Solving

Service Method Starting Price Supports
2Captcha Human workers $1/1k solves reCAPTCHA, Turnstile, FunCaptcha, GeeTest, image.
Anti-Captcha Human workers $0.0005/token reCAPTCHA, hCaptcha, FunCaptcha, Turnstile.
CapSolver AI/ML $0.65/1k reCAPTCHA, AWS WAF, Cloudflare, GeeTest.
CapMonster Cloud AI/ML $0.02/1k images reCAPTCHA, hCaptcha, Turnstile, <2s solve.

(⬆ back to top)

🌍 Proxy Providers

Provider Network Size Starting Price Highlights
Bright Data 150M+ IPs Pay-as-you-go Largest network, 195 countries.
Oxylabs 100M+ IPs Contact sales Fastest latency, city/ZIP targeting.
Decodo 125M+ IPs $8.50/GB Good value, 50 US states.
IPRoyal 34M+ IPs $1.75/GB Cheapest residential, ethically sourced.
NetNut 85M+ IPs Contact sales One-hop ISP connectivity (faster).
Webshare 10M+ IPs $2.99/mo Budget-friendly, free tier.

(⬆ back to top)

πŸ†“ Try Free

Service Free Tier Try It
ScraperAPI 5,000 free API credits Start free β†’
ScrapingBee 1,000 free API calls Start free β†’
Webshare Free tier with 10 proxies Start free β†’
2Captcha $1 starting balance Start free β†’
NetNut 7-day free trial Start free β†’

(⬆ back to top)

πŸͺ¦ Deprecated Tools Graveyard

Dead Tool Why Use Instead
PhantomJS Archived 2018 Playwright, Puppeteer.
CasperJS Depended on PhantomJS Playwright, Puppeteer.
Nightmare Unmaintained since 2020 Playwright.
Zombie.js Unmaintained Playwright, Puppeteer.
SlimerJS Unmaintained, Gecko-based Playwright (Firefox).
Splash Scrapinghub, deprecated Scrapy-Playwright.
twint Archived Mar 2023, blocked by Twitter Official API.
Goutte (PHP) Deprecated by Symfony Symfony BrowserKit + DomCrawler.
snscrape Unmaintained since Nov 2023 Official APIs.
undetected-chromedriver Aging, last push Jul 2025 Nodriver, Camoufox.
puppeteer-extra-stealth Unmaintained since Jul 2024 Patchright, Camoufox.
playwright-stealth Unmaintained since Nov 2023 Patchright, Camoufox.
curl-impersonate Unmaintained since Jul 2024 curl_cffi.
GoogleScraper Unmaintained since Jul 2021 SerpApi.
pyautogui Unmaintained since Aug 2024 pytesseract + Playwright.
SikuliX Stale, niche Playwright, pytesseract.

(⬆ back to top)


Disclosure

Some links in the Managed Scraping APIs, CAPTCHA Solving, Proxy Providers, and Try Free sections are affiliate/referral links. These help support the maintenance of this list. All tools are included based on merit β€” affiliate status does not influence placement or rankings.

(⬆ back to top)

πŸ”— Related Awesome Lists

List Description
awesome-ai 400+ AI APIs, tools, frameworks, and platforms.
awesome-robotics Robotics frameworks, simulators, and platforms.
awesome-web3-ai Web3 x AI tools, agent frameworks, and protocols.

Contributing

Contributions welcome! Please read the contribution guidelines first.

  • Add tools you've actually used or evaluated
  • Include star count and language where applicable
  • Note if a tool is unmaintained (last commit >1 year ago)
  • Commercial tools/services are fine but must be clearly labeled

License

CC0

To the extent possible under law, Edward Tay has waived all copyright and related or neighboring rights to this work.

Releases

No releases published

Packages

 
 
 

Contributors