Mark Minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications

Convert any web content into clean, structured Markdown format with advanced bot protection and intelligent AI processing

🙏 A humble request I am currently on the free plan of Cloudflare Workers, so there is a daily limit on browser rendering, which means it might not work for website URLs all the time. You can deploy your own Markminion easily. Everything else will work all the time.

✨ Features

🌐 Universal Content Support

Web Pages: Convert any webpage to clean Markdown with advanced readability extraction
Documents: Extract text from PDFs, DOCX, DOC, TXT, MD, HTML, and XLSX files
Videos: Extract metadata, descriptions, and transcripts from YouTube, Vimeo, and other platforms
Social Media: Parse Twitter/X tweets with full metadata and engagement metrics
Google Docs: Direct extraction from Google Docs, Sheets, and Presentations

🤖 AI-Powered Processing

Smart Content Filtering: AI-powered removal of ads, navigation, and irrelevant content
Chunked Processing: Intelligent text chunking for large documents
Content Optimization: Tailored output for LLM consumption

🛡️ Advanced Bot Protection (Still Work In Progress)

Stealth Browsing: Advanced bot detection bypass with randomized user agents
Human-like Behavior: Simulated human browsing patterns and delays
Multiple Retry Logic: Robust navigation with intelligent fallbacks

🚀 Performance & Reliability

Intelligent Caching: Multi-layer caching with KV storage for optimal performance
Rate Limiting: Built-in protection against abuse
Subpage Crawling: Extract content from multiple related pages (up to 10)
Metadata Extraction: Comprehensive metadata including OpenGraph, Twitter Cards, and structured data

📊 Output Formats

Plain Text: Clean Markdown output
JSON: Structured data with metadata
Detailed Mode: Enhanced extraction with additional content
Metadata-Only: Extract just the page metadata

🏗️ Architecture

Mark Minion leverages modern serverless architecture for scalability and reliability:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Client App    │────│  Cloudflare CDN  │────│ Cloudflare Edge │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                         │
                                                         ▼
                                               ┌─────────────────┐
                                               │ Durable Objects │
                                               │   (Browser)     │
                                               └─────────────────┘
                                                         │
                              ┌──────────────────────────┼──────────────────────────┐
                              │                          │                          │
                              ▼                          ▼                          ▼
                    ┌─────────────────┐        ┌─────────────────┐        ┌─────────────────┐
                    │  Content Proc.  │        │  Document Proc. │        │   Video Proc.   │
                    └─────────────────┘        └─────────────────┘        └─────────────────┘
                              │                          │                          │
                              └──────────────────────────┼──────────────────────────┘
                                                         │
                                                         ▼
                                               ┌─────────────────┐
                                               │   AI Filtering  │
                                               └─────────────────┘

Core Components

Durable Objects Browser: Persistent browser instances with advanced bot protection
Content Processors: Specialized handlers for different content types
AI Utils: Intelligent content filtering and optimization
Metadata Extractors: Comprehensive page metadata extraction
Caching Layer: Multi-tier caching with Cloudflare KV

🚀 Quick Start

Basic Usage

# Simple webpage conversion
curl "https://markminion.dhruvvakharwala.dev/?url=https://example.com"

# With authentication
curl -H "Authorization: Bearer YOUR_TOKEN" \
     "https://markminion.dhruvvakharwala.dev/?url=https://example.com"

Advanced Examples

# Detailed extraction with AI filtering
curl "https://markminion.dhruvvakharwala.dev/?url=https://example.com&detailed=true&unnecessaryfilter=true"

# Extract from Google Docs
curl "https://markminion.dhruvvakharwala.dev/?url=https://docs.google.com/document/d/YOUR_DOC_ID/edit"

# YouTube video with transcript
curl "https://markminion.dhruvvakharwala.dev/?url=https://www.youtube.com/watch?v=VIDEO_ID&detailed=true"

# Twitter thread extraction
curl "https://markminion.dhruvvakharwala.dev/?url=https://x.com/username/status/TWEET_ID"

# PDF document processing
curl "https://markminion.dhruvvakharwala.dev/?url=https://example.com/document.pdf&unnecessaryfilter=true"

# Subpage crawling
curl "https://markminion.dhruvvakharwala.dev/?url=https://blog.example.com&subpage=true&detailed=true"

# Metadata extraction only
curl "https://markminion.dhruvvakharwala.dev/metadata?url=https://example.com"

# JSON output format
curl -H "Content-Type: application/json" \
     "https://markminion.dhruvvakharwala.dev/?url=https://example.com"

📖 API Reference

Base Endpoint: `GET /`

Query Parameters

Parameter	Type	Required	Description
`url`	string	✅	The URL to extract content from
`detailed`	boolean	❌	Enhanced extraction with more content (default: false)
`subpage`	boolean	❌	Crawl and extract from linked subpages (default: false)
`unnecessaryfilter`	boolean	❌	Apply AI filtering to remove ads/irrelevant content (default: false)

Headers

Header	Required	Description
`Authorization`	❌	Bearer token for authenticated requests
`Content-Type`	❌	`application/json` for JSON response format

Metadata Endpoint: `GET /metadata`

Extract only metadata without content processing.

curl "https://markminion.dhruvvakharwala.dev/metadata?url=https://example.com"

Response Formats

Plain Text Response

# Page Title

Page content in clean Markdown format...

**Metadata:**
- Extracted at: 2025-01-01T00:00:00.000Z
- Content length: 1234 characters
- Source: https://example.com

JSON Response

{
  "url": "https://example.com",
  "content": "# Page Title\n\nPage content...",
  "metadata": {
    "title": "Page Title",
    "description": "Page description",
    "extractedAt": "2025-01-01T00:00:00.000Z",
    "contentLength": 1234,
    "cached": false,
    "openGraph": { ... },
    "twitter": { ... }
  }
}

🔧 Supported Content Types

📄 Documents

PDF: Full text extraction with metadata
Microsoft Word: DOCX and DOC formats
Text Files: TXT, Markdown (MD)
Spreadsheets: XLSX with sheet separation
HTML: Clean text extraction

🎥 Videos

YouTube: Metadata, descriptions, and transcripts (Transcript extraction is future work)
Vimeo: Video information and descriptions
Direct Video URLs: Basic metadata extraction
TikTok, Twitch: Platform-specific extraction (future work)

🐦 Social Media

Twitter/X: Full tweet data with engagement metrics
Thread Support: Multi-tweet thread extraction
Media Handling: Images and video metadata

🌐 Web Pages

Readability Extraction: Clean content using Mozilla Readability
Fallback Processing: Multiple extraction strategies
Rich Metadata: OpenGraph, Twitter Cards, JSON-LD

⚙️ Configuration

Environment Variables

# Required
BACKEND_SECURITY_TOKEN=your_secret_token

Rate Limiting

Authenticated requests: No rate limits
Public requests: Limited to prevent abuse
Per-IP tracking: Using Cloudflare's connecting IP

🚀 Deployment

Cloudflare Workers

Clone the repository

git clone https://github.com/vakharwalad23/mark-minion.git
cd mark-minion

Install dependencies

bun install

Type Generation

bun run cf-typegen

Deploy

bun run deploy

Local Development

# Start development server
bun run dev

# Run tests (Currently just a placeholder test for the example)
bun run test

🎯 Use Cases

🤖 AI/LLM Applications

Training Data: Clean, structured content for model training
RAG Systems: High-quality documents for retrieval augmentation
Content Analysis: Preprocessed data for NLP tasks

📊 Content Management

Documentation: Convert web content to documentation formats
Research: Bulk extraction from multiple sources
Archival: Clean content storage and preservation

🔄 Integration

APIs: Programmatic content extraction
Workflows: Part of larger data processing pipelines
Automation: Scheduled content updates

🛠️ Advanced Features

Content Filtering

# AI-powered content cleaning
curl "https://markminion.dhruvvakharwala.dev/?url=https://news-site.com&unnecessaryfilter=true"

Batch Processing

# Process multiple related pages
curl "https://markminion.dhruvvakharwala.dev/?url=https://docs.example.com&subpage=true"

Custom Headers

# Specify output format
curl -H "Content-Type: application/json" \
     "https://markminion.dhruvvakharwala.dev/?url=https://example.com"

📈 Performance

Cold Start: ~2-3 seconds for new browser instances
Warm Requests: ~500ms-2s depending on content
Caching: Subsequent requests ~50-200ms
Concurrent Users: Scales automatically with Cloudflare

🔒 Security

Rate Limiting: Protection against abuse
Input Validation: URL and parameter sanitization
Bot Protection: Advanced anti-detection measures
Token Authentication: Optional API key protection

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

Issues: GitHub Issues
Discussions: GitHub Discussions

🙏 Acknowledgments

Mozilla Readability for content extraction
Turndown for HTML to Markdown conversion
Cloudflare for the serverless platform
Puppeteer for browser automation

Built with ❤️ using Cloudflare Workers

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
public		public
src		src
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
bun.lockb		bun.lockb
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.mts		vitest.config.mts
worker-configuration.d.ts		worker-configuration.d.ts
wrangler.toml		wrangler.toml

Folders and files

Latest commit

History

Repository files navigation

Mark Minion

✨ Features

🌐 Universal Content Support

🤖 AI-Powered Processing

🛡️ Advanced Bot Protection (Still Work In Progress)

🚀 Performance & Reliability

📊 Output Formats

🏗️ Architecture

Core Components

🚀 Quick Start

Basic Usage

Advanced Examples

📖 API Reference

Base Endpoint: GET /

Query Parameters

Headers

Metadata Endpoint: GET /metadata

Response Formats

Plain Text Response

JSON Response

🔧 Supported Content Types

📄 Documents

🎥 Videos

🐦 Social Media

🌐 Web Pages

⚙️ Configuration

Environment Variables

Rate Limiting

🚀 Deployment

Cloudflare Workers

Local Development

🎯 Use Cases

🤖 AI/LLM Applications

📊 Content Management

🔄 Integration

🛠️ Advanced Features

Content Filtering

Batch Processing

Custom Headers

📈 Performance

🔒 Security

🤝 Contributing

📄 License

🆘 Support

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

Base Endpoint: `GET /`

Metadata Endpoint: `GET /metadata`