Scrape2Markdown

A streamlined tool for converting web pages to markdown format, designed specifically for collecting documentation for RAG (Retrieval-Augmented Generation) systems and AI agents.

Features

Process multiple URLs simultaneously
Convert HTML content to clean markdown format
Real-time markdown preview
Flexible content filtering options (class-based and element-based)
Export options (clipboard copy and file download)
Simple and intuitive two-column interface

Installation

Prerequisites

Python 3.x
pip package manager

Setup

Create a virtual environment:

python -m venv venv

Activate the virtual environment:

On macOS/Linux:

source venv/bin/activate

On Windows:

.\venv\Scripts\activate

Install required packages:

pip install streamlit beautifulsoup4 requests markdownify pyperclip

Usage

Start the application:

streamlit run app.py

Using the Application:
- Add URLs through the left sidebar interface
- Manage URL list (add, remove, clear all)
- Use filtering options to customize content extraction
- Preview generated markdown in real-time
- Export using clipboard copy or download as file with timestamp

Content Filtering

Class-based filtering with multiselect options
Element-based filtering with multiselect options
Customizable content extraction

Export Options

Copy entire content to clipboard
Download as markdown file with timestamp
Clean markdown export (removes URL headers)

Project Documentation

Development Workflow

The project maintains several key documentation files:

todo.md

Tracks current implementation status and progress
Lists prioritized goals and upcoming tasks
Documents completed features with timestamps
Requires explicit approval for modifying developer tasks
Must be updated before and after any development work

design.md

Contains project vision and technical architecture
Documents feature specifications
Tracks architectural decisions and rationale

documentation.md

Technical documentation
API references
Library dependencies
Setup instructions

Technical Stack

Frontend: Streamlit web interface
Backend: Python-based processing pipeline
Core Libraries:
- streamlit: Web application framework
- beautifulsoup4: HTML parsing
- requests: HTTP requests
- markdownify: HTML to markdown conversion
- pyperclip: Cross-platform clipboard operations

Data Flow

Document Collection:
- URL validation and management
- Duplicate prevention
Content Processing:
- HTML content fetching
- Content filtering
- Markdown conversion
Output Generation:
- Real-time preview
- Export options

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Documents		Documents
memory-bank		memory-bank
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
app.py		app.py
mypy.ini		mypy.ini
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scrape2Markdown

Features

Installation

Prerequisites

Setup

Usage

Content Filtering

Export Options

Project Documentation

Development Workflow

todo.md

design.md

documentation.md

Technical Stack

Data Flow

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SkelegonDK/scrape2markdown

Folders and files

Latest commit

History

Repository files navigation

Scrape2Markdown

Features

Installation

Prerequisites

Setup

Usage

Content Filtering

Export Options

Project Documentation

Development Workflow

todo.md

design.md

documentation.md

Technical Stack

Data Flow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages