B2 SiteFox

A smart and efficient website content archiving tool, created in the Fox Cities.

Features

Text Content Extraction
- Downloads and processes webpage content
- Removes headers, footers, and navigation elements
- Creates clean HTML and Markdown versions
- Generates a table of contents
Smart Image Downloading
- Optimized for WordPress sites
- Automatically finds and downloads full-size images
- Organizes images by page
- Handles scaled image variants

Installation

Clone the repository:

git clone https://github.com/yourusername/sitefox.git
cd sitefox

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Unix/Mac
# or
venv\Scripts\activate     # On Windows

Install dependencies:

pip install -r requirements.txt

Usage

Run the main script:

python sitefox.py

Choose your desired operation:

Download page text and create HTML/Markdown versions
Download images (optimized for WordPress sites)
Download both text and images

Enter the website domain when prompted (e.g., example.com).

Project Structure

sitefox/
├── sitefox.py           # Main entry point
├── sitefox_text/        # Text scraper package
│   ├── __init__.py
│   └── scraper.py
├── sitefox_images/      # Image scraper package
│   ├── __init__.py
│   └── scraper.py
├── downloads/           # All downloaded content is stored here
│   └── domain.com/      # Separate folder for each domain
├── venv/                # Virtual environment (not in repo)
├── requirements.txt     # Dependencies
├── README.md            # Documentation
└── LICENSE              # MIT License

Output Structure

sitefox/downloads/domain.com/
├── html/                 # Text content
│   ├── index.html
│   ├── page1.html
│   └── toc.html
├── page1/               # Images by page
│   ├── image1.jpg
│   └── image2.png
├── page2/
│   └── image3.jpg
├── domain.com_content.md
└── sitefox_report.txt

License

MIT License - See LICENSE file for details.

Author

Created by Brett Belau in the Fox Cities.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

B2 SiteFox

Features

Installation

Usage

Project Structure

Output Structure

License

Author

Contributing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
quincycabin.com		quincycabin.com
sitefox_images		sitefox_images
sitefox_text		sitefox_text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sitefox.py		sitefox.py

License

brettbelau/sitefox

Folders and files

Latest commit

History

Repository files navigation

B2 SiteFox

Features

Installation

Usage

Project Structure

Output Structure

License

Author

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages