The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications
Convert any web content into clean, structured Markdown format with advanced bot protection and intelligent AI processing
🙏 A humble request I am currently on the free plan of Cloudflare Workers, so there is a daily limit on browser rendering, which means it might not work for website URLs all the time. You can deploy your own Markminion easily. Everything else will work all the time.
- Web Pages: Convert any webpage to clean Markdown with advanced readability extraction
- Documents: Extract text from PDFs, DOCX, DOC, TXT, MD, HTML, and XLSX files
- Videos: Extract metadata, descriptions, and transcripts from YouTube, Vimeo, and other platforms
- Social Media: Parse Twitter/X tweets with full metadata and engagement metrics
- Google Docs: Direct extraction from Google Docs, Sheets, and Presentations
- Smart Content Filtering: AI-powered removal of ads, navigation, and irrelevant content
- Chunked Processing: Intelligent text chunking for large documents
- Content Optimization: Tailored output for LLM consumption
- Stealth Browsing: Advanced bot detection bypass with randomized user agents
- Human-like Behavior: Simulated human browsing patterns and delays
- Multiple Retry Logic: Robust navigation with intelligent fallbacks
- Intelligent Caching: Multi-layer caching with KV storage for optimal performance
- Rate Limiting: Built-in protection against abuse
- Subpage Crawling: Extract content from multiple related pages (up to 10)
- Metadata Extraction: Comprehensive metadata including OpenGraph, Twitter Cards, and structured data
- Plain Text: Clean Markdown output
- JSON: Structured data with metadata
- Detailed Mode: Enhanced extraction with additional content
- Metadata-Only: Extract just the page metadata
Mark Minion leverages modern serverless architecture for scalability and reliability:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Client App │────│ Cloudflare CDN │────│ Cloudflare Edge │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Durable Objects │
│ (Browser) │
└─────────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Content Proc. │ │ Document Proc. │ │ Video Proc. │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└──────────────────────────┼──────────────────────────┘
│
▼
┌─────────────────┐
│ AI Filtering │
└─────────────────┘
- Durable Objects Browser: Persistent browser instances with advanced bot protection
- Content Processors: Specialized handlers for different content types
- AI Utils: Intelligent content filtering and optimization
- Metadata Extractors: Comprehensive page metadata extraction
- Caching Layer: Multi-tier caching with Cloudflare KV
# Simple webpage conversion
curl "https://markminion.dhruvvakharwala.dev/?url=https://example.com"
# With authentication
curl -H "Authorization: Bearer YOUR_TOKEN" \
"https://markminion.dhruvvakharwala.dev/?url=https://example.com"# Detailed extraction with AI filtering
curl "https://markminion.dhruvvakharwala.dev/?url=https://example.com&detailed=true&unnecessaryfilter=true"
# Extract from Google Docs
curl "https://markminion.dhruvvakharwala.dev/?url=https://docs.google.com/document/d/YOUR_DOC_ID/edit"
# YouTube video with transcript
curl "https://markminion.dhruvvakharwala.dev/?url=https://www.youtube.com/watch?v=VIDEO_ID&detailed=true"
# Twitter thread extraction
curl "https://markminion.dhruvvakharwala.dev/?url=https://x.com/username/status/TWEET_ID"
# PDF document processing
curl "https://markminion.dhruvvakharwala.dev/?url=https://example.com/document.pdf&unnecessaryfilter=true"
# Subpage crawling
curl "https://markminion.dhruvvakharwala.dev/?url=https://blog.example.com&subpage=true&detailed=true"
# Metadata extraction only
curl "https://markminion.dhruvvakharwala.dev/metadata?url=https://example.com"
# JSON output format
curl -H "Content-Type: application/json" \
"https://markminion.dhruvvakharwala.dev/?url=https://example.com"| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ | The URL to extract content from |
detailed |
boolean | ❌ | Enhanced extraction with more content (default: false) |
subpage |
boolean | ❌ | Crawl and extract from linked subpages (default: false) |
unnecessaryfilter |
boolean | ❌ | Apply AI filtering to remove ads/irrelevant content (default: false) |
| Header | Required | Description |
|---|---|---|
Authorization |
❌ | Bearer token for authenticated requests |
Content-Type |
❌ | application/json for JSON response format |
Extract only metadata without content processing.
curl "https://markminion.dhruvvakharwala.dev/metadata?url=https://example.com"# Page Title
Page content in clean Markdown format...
**Metadata:**
- Extracted at: 2025-01-01T00:00:00.000Z
- Content length: 1234 characters
- Source: https://example.com
{
"url": "https://example.com",
"content": "# Page Title\n\nPage content...",
"metadata": {
"title": "Page Title",
"description": "Page description",
"extractedAt": "2025-01-01T00:00:00.000Z",
"contentLength": 1234,
"cached": false,
"openGraph": { ... },
"twitter": { ... }
}
}- PDF: Full text extraction with metadata
- Microsoft Word: DOCX and DOC formats
- Text Files: TXT, Markdown (MD)
- Spreadsheets: XLSX with sheet separation
- HTML: Clean text extraction
- YouTube: Metadata, descriptions, and transcripts (Transcript extraction is future work)
- Vimeo: Video information and descriptions
- Direct Video URLs: Basic metadata extraction
- TikTok, Twitch: Platform-specific extraction (future work)
- Twitter/X: Full tweet data with engagement metrics
- Thread Support: Multi-tweet thread extraction
- Media Handling: Images and video metadata
- Readability Extraction: Clean content using Mozilla Readability
- Fallback Processing: Multiple extraction strategies
- Rich Metadata: OpenGraph, Twitter Cards, JSON-LD
# Required
BACKEND_SECURITY_TOKEN=your_secret_token- Authenticated requests: No rate limits
- Public requests: Limited to prevent abuse
- Per-IP tracking: Using Cloudflare's connecting IP
- Clone the repository
git clone https://github.com/vakharwalad23/mark-minion.git
cd mark-minion- Install dependencies
bun install- Type Generation
bun run cf-typegen- Deploy
bun run deploy# Start development server
bun run dev
# Run tests (Currently just a placeholder test for the example)
bun run test- Training Data: Clean, structured content for model training
- RAG Systems: High-quality documents for retrieval augmentation
- Content Analysis: Preprocessed data for NLP tasks
- Documentation: Convert web content to documentation formats
- Research: Bulk extraction from multiple sources
- Archival: Clean content storage and preservation
- APIs: Programmatic content extraction
- Workflows: Part of larger data processing pipelines
- Automation: Scheduled content updates
# AI-powered content cleaning
curl "https://markminion.dhruvvakharwala.dev/?url=https://news-site.com&unnecessaryfilter=true"# Process multiple related pages
curl "https://markminion.dhruvvakharwala.dev/?url=https://docs.example.com&subpage=true"# Specify output format
curl -H "Content-Type: application/json" \
"https://markminion.dhruvvakharwala.dev/?url=https://example.com"- Cold Start: ~2-3 seconds for new browser instances
- Warm Requests: ~500ms-2s depending on content
- Caching: Subsequent requests ~50-200ms
- Concurrent Users: Scales automatically with Cloudflare
- Rate Limiting: Protection against abuse
- Input Validation: URL and parameter sanitization
- Bot Protection: Advanced anti-detection measures
- Token Authentication: Optional API key protection
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Mozilla Readability for content extraction
- Turndown for HTML to Markdown conversion
- Cloudflare for the serverless platform
- Puppeteer for browser automation
Built with ❤️ using Cloudflare Workers