Convert web-hosted EPUB-style books into a single, clean PDF — right from your terminal.
| Feature | Description |
|---|---|
| 🔗 Smart URL handling | Automatically normalises URLs — strips index.html, ensures trailing slashes, and validates reachability |
| 📖 EPUB metadata parsing | Reads the book's package.xml (OPF manifest) to discover the title and every XHTML page |
| 🖨️ Headless PDF rendering | Renders each page to A4 PDF via Playwright's Chromium engine — no visible browser window |
| 📑 Single-file merge | Combines all page PDFs into one output file using PyPDF2, named after the book title |
| 🧹 Auto cleanup | Temporary per-page PDFs are deleted automatically after the final merge |
| 🎨 Coloured CLI output | ANSI-coloured progress messages with emoji indicators for a clear, friendly experience |
| ⚡ CLI & interactive modes | Pass a URL via --url flag or enter it interactively when prompted |
┌──────────────┐ ┌────────────────┐ ┌──────────────────┐ ┌──────────────┐
│ User enters │────▶│ Validate URL │────▶│ Fetch metadata │────▶│ Discover │
│ book URL │ │ & normalise │ │ (package.xml) │ │ XHTML pages │
└──────────────┘ └────────────────┘ └──────────────────┘ └──────┬───────┘
│
┌────────────────┐ ┌──────────────────┐ │
│ Output final │◀────│ Merge all PDFs │◀──────────┘
│ <title>.pdf │ │ (PyPDF2) │ Render each page
└────────────────┘ └──────────────────┘ via Playwright
- URL validation — The scraper normalises the input URL, confirms the host is reachable (HTTP 200), and checks that the EPUB structure (
epub/EPUB/package.xml) exists. - Metadata extraction — Parses the OPF
package.xmlusingxml.etree.ElementTreeto pull the<dc:title>(used as the output filename) and the full list ofapplication/xhtml+xmlitems from the<manifest>. - Page rendering — For every discovered XHTML page, Playwright launches headless Chromium, navigates to the page URL, waits for
networkidle, and prints it to a temporary A4 PDF. - Merge & cleanup — All temporary PDFs are appended in order with
PdfMerger, written as<BookTitle>.pdf, and the temp files are deleted.
| Dependency | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Runtime |
| requests | ≥ 2.31.0 | HTTP requests for URL validation & metadata fetch |
| playwright | ≥ 1.40.0 | Headless Chromium browser for page-to-PDF rendering |
| PyPDF2 | ≥ 3.0.0 | Merging individual page PDFs into a single file |
Standard-library modules used: asyncio, argparse, urllib.parse, xml.etree.ElementTree, os, time, subprocess, sys.
python setup.pyThis performs an editable install (pip install -e .) using requirements.txt, then installs Playwright's Chromium build — everything in one command.
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate
pip install -r requirements.txt
playwright install chromiumpip install -e .Note
If you skip python setup.py, you must run playwright install chromium once yourself before first use.
python main.pyYou will be prompted to enter the book URL, then confirm to start the conversion.
python main.py --url "https://example.com/path/to/book/"| Flag | Short | Description |
|---|---|---|
--url |
-u |
URL of the book to extract (skips the interactive prompt) |
╔══════════════════════════════════════╗
║ 📚 HU Book Extractor ║
╚══════════════════════════════════════╝
🔗 Enter Book URL:
> https://example.com/books/my-textbook/
⏳ Validating URL...
✅ URL validated https://example.com/books/my-textbook/
📦 Fetching metadata...
📄 Book Title: Introduction to Computer Science
📚 Found 12 pages
🚀 Start conversion? (y/n): y
🚀 Starting conversion...
📥 Converting page 1/12: chapter01.xhtml
📥 Converting page 2/12: chapter02.xhtml
...
🎉 Done successfully! Saved as: Introduction to Computer Science.pdf
HUBook/
├── main.py # CLI entry point (interactive & argparse modes)
├── setup.py # Automated setup: editable install + Playwright Chromium
├── requirements.txt # Third-party pip dependencies
├── lib/
│ ├── __init__.py # Package marker
│ └── ebooks_scraper.py # Core EbookScraper class (validate, parse, render, merge)
└── src/
├── logo.svg # Square project logo
└── banner.svg # README banner graphic
| Module | Class / Function | Responsibility |
|---|---|---|
main.py |
Colors |
ANSI colour codes for terminal output |
main.py |
run(base_url) |
Async orchestration: validate → extract metadata → convert → save |
main.py |
argparse setup |
Parses --url / -u flag for non-interactive use |
lib/ebooks_scraper.py |
EbookScraper |
All scraping logic encapsulated in one class |
| ↳ | .normalize_url() |
Strips index.html, ensures trailing /, cleans query/fragment |
| ↳ | .validate_url() |
Confirms HTTP 200 for both the root URL and the package.xml |
| ↳ | .PDFnameExtractor() |
Parses OPF metadata to get <dc:title> |
| ↳ | .PagesExtractor() |
Reads <manifest> items with media-type="application/xhtml+xml" |
| ↳ | .html_to_pdf() |
Launches headless Chromium, renders a single page to A4 PDF |
| ↳ | .convert_all() |
Iterates all pages, renders each, merges with PdfMerger, cleans up |
- The target site must serve the book in the EPUB-over-HTTP structure this tool expects (
epub/EPUB/package.xmlwith an OPF manifest). Unsupported layouts will fail during validation. - Temporary PDFs (
temp_1.pdf,temp_2.pdf, …) are created in the current working directory during conversion and are automatically removed after the merge completes. - On Windows, the CLI uses ANSI escape codes for colour — use Windows Terminal, VS Code terminal, or another modern terminal for proper display.
I accept no responsibility for misuse of this tool in any form or by any means. You alone are responsible for how you use it.
- Use this software only in ways that comply with applicable laws, website terms of service, and copyright or licensing rules where you fetch content from.
- The authors and contributors are not liable for any direct or indirect damage, loss, legal claims, or other consequences arising from use or misuse of this project.
- This tool is provided "as is", without warranty of any kind (including fitness for a particular purpose or non-infringement).
- Nothing in this README grants permission to copy, distribute, or convert material you are not allowed to access or reproduce.
Caution
If you are unsure whether your use is allowed, do not use this tool for that purpose.
AkiraOmran — built for HU students.