A streamlined tool for converting web pages to markdown format, designed specifically for collecting documentation for RAG (Retrieval-Augmented Generation) systems and AI agents.
- Process multiple URLs simultaneously
- Convert HTML content to clean markdown format
- Real-time markdown preview
- Flexible content filtering options (class-based and element-based)
- Export options (clipboard copy and file download)
- Simple and intuitive two-column interface
- Python 3.x
- pip package manager
- Create a virtual environment:
python -m venv venv
- Activate the virtual environment:
- On macOS/Linux:
source venv/bin/activate
- On Windows:
.\venv\Scripts\activate
- Install required packages:
pip install streamlit beautifulsoup4 requests markdownify pyperclip
- Start the application:
streamlit run app.py
- Using the Application:
- Add URLs through the left sidebar interface
- Manage URL list (add, remove, clear all)
- Use filtering options to customize content extraction
- Preview generated markdown in real-time
- Export using clipboard copy or download as file with timestamp
- Class-based filtering with multiselect options
- Element-based filtering with multiselect options
- Customizable content extraction
- Copy entire content to clipboard
- Download as markdown file with timestamp
- Clean markdown export (removes URL headers)
The project maintains several key documentation files:
- Tracks current implementation status and progress
- Lists prioritized goals and upcoming tasks
- Documents completed features with timestamps
- Requires explicit approval for modifying developer tasks
- Must be updated before and after any development work
- Contains project vision and technical architecture
- Documents feature specifications
- Tracks architectural decisions and rationale
- Technical documentation
- API references
- Library dependencies
- Setup instructions
- Frontend: Streamlit web interface
- Backend: Python-based processing pipeline
- Core Libraries:
streamlit
: Web application frameworkbeautifulsoup4
: HTML parsingrequests
: HTTP requestsmarkdownify
: HTML to markdown conversionpyperclip
: Cross-platform clipboard operations
- Document Collection:
- URL validation and management
- Duplicate prevention
- Content Processing:
- HTML content fetching
- Content filtering
- Markdown conversion
- Output Generation:
- Real-time preview
- Export options