HU Book Extractor

Convert web-hosted EPUB-style books into a single, clean PDF — right from your terminal.

✨ Features

Feature	Description
🔗 Smart URL handling	Automatically normalises URLs — strips `index.html`, ensures trailing slashes, and validates reachability
📖 EPUB metadata parsing	Reads the book's `package.xml` (OPF manifest) to discover the title and every XHTML page
🖨️ Headless PDF rendering	Renders each page to A4 PDF via Playwright's Chromium engine — no visible browser window
📑 Single-file merge	Combines all page PDFs into one output file using PyPDF2, named after the book title
🧹 Auto cleanup	Temporary per-page PDFs are deleted automatically after the final merge
🎨 Coloured CLI output	ANSI-coloured progress messages with emoji indicators for a clear, friendly experience
⚡ CLI & interactive modes	Pass a URL via `--url` flag or enter it interactively when prompted

🏗️ How It Works

┌──────────────┐     ┌────────────────┐     ┌──────────────────┐     ┌──────────────┐
│  User enters │────▶│  Validate URL  │────▶│  Fetch metadata  │────▶│  Discover    │
│  book URL    │     │  & normalise   │     │  (package.xml)   │     │  XHTML pages │
└──────────────┘     └────────────────┘     └──────────────────┘     └──────┬───────┘
                                                                           │
                     ┌────────────────┐     ┌──────────────────┐           │
                     │  Output final  │◀────│  Merge all PDFs  │◀──────────┘
                     │  <title>.pdf   │     │  (PyPDF2)        │  Render each page
                     └────────────────┘     └──────────────────┘  via Playwright

URL validation — The scraper normalises the input URL, confirms the host is reachable (HTTP 200), and checks that the EPUB structure (epub/EPUB/package.xml) exists.
Metadata extraction — Parses the OPF package.xml using xml.etree.ElementTree to pull the <dc:title> (used as the output filename) and the full list of application/xhtml+xml items from the <manifest>.
Page rendering — For every discovered XHTML page, Playwright launches headless Chromium, navigates to the page URL, waits for networkidle, and prints it to a temporary A4 PDF.
Merge & cleanup — All temporary PDFs are appended in order with PdfMerger, written as <BookTitle>.pdf, and the temp files are deleted.

📋 Requirements

Dependency	Version	Purpose
Python	3.10+	Runtime
requests	≥ 2.31.0	HTTP requests for URL validation & metadata fetch
playwright	≥ 1.40.0	Headless Chromium browser for page-to-PDF rendering
PyPDF2	≥ 3.0.0	Merging individual page PDFs into a single file

Standard-library modules used: asyncio, argparse, urllib.parse, xml.etree.ElementTree, os, time, subprocess, sys.

🚀 Setup

Option A — Automated (`setup.py`)

python setup.py

This performs an editable install (pip install -e .) using requirements.txt, then installs Playwright's Chromium build — everything in one command.

Option B — Manual

python -m venv .venv

# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate

pip install -r requirements.txt
playwright install chromium

Option C — Editable package only

pip install -e .

Note

If you skip python setup.py, you must run playwright install chromium once yourself before first use.

💻 Usage

Interactive mode

python main.py

You will be prompted to enter the book URL, then confirm to start the conversion.

Direct URL mode

python main.py --url "https://example.com/path/to/book/"

Flag	Short	Description
`--url`	`-u`	URL of the book to extract (skips the interactive prompt)

Example session

╔══════════════════════════════════════╗
║        📚 HU Book Extractor          ║
╚══════════════════════════════════════╝

🔗 Enter Book URL:
> https://example.com/books/my-textbook/

⏳ Validating URL...
✅ URL validated https://example.com/books/my-textbook/
📦 Fetching metadata...
📄 Book Title: Introduction to Computer Science
📚 Found 12 pages

🚀 Start conversion? (y/n): y

🚀 Starting conversion...
📥 Converting page 1/12: chapter01.xhtml
📥 Converting page 2/12: chapter02.xhtml
...
🎉 Done successfully! Saved as: Introduction to Computer Science.pdf

📂 Project Layout

HUBook/
├── main.py                  # CLI entry point (interactive & argparse modes)
├── setup.py                 # Automated setup: editable install + Playwright Chromium
├── requirements.txt         # Third-party pip dependencies
├── lib/
│   ├── __init__.py          # Package marker
│   └── ebooks_scraper.py    # Core EbookScraper class (validate, parse, render, merge)
└── src/
    ├── logo.svg             # Square project logo
    └── banner.svg           # README banner graphic

Key components

Module	Class / Function	Responsibility
`main.py`	`Colors`	ANSI colour codes for terminal output
`main.py`	`run(base_url)`	Async orchestration: validate → extract metadata → convert → save
`main.py`	`argparse` setup	Parses `--url` / `-u` flag for non-interactive use
`lib/ebooks_scraper.py`	`EbookScraper`	All scraping logic encapsulated in one class
↳	`.normalize_url()`	Strips `index.html`, ensures trailing `/`, cleans query/fragment
↳	`.validate_url()`	Confirms HTTP 200 for both the root URL and the `package.xml`
↳	`.PDFnameExtractor()`	Parses OPF metadata to get `<dc:title>`
↳	`.PagesExtractor()`	Reads `<manifest>` items with `media-type="application/xhtml+xml"`
↳	`.html_to_pdf()`	Launches headless Chromium, renders a single page to A4 PDF
↳	`.convert_all()`	Iterates all pages, renders each, merges with `PdfMerger`, cleans up

⚠️ Notes

The target site must serve the book in the EPUB-over-HTTP structure this tool expects (epub/EPUB/package.xml with an OPF manifest). Unsupported layouts will fail during validation.
Temporary PDFs (temp_1.pdf, temp_2.pdf, …) are created in the current working directory during conversion and are automatically removed after the merge completes.
On Windows, the CLI uses ANSI escape codes for colour — use Windows Terminal, VS Code terminal, or another modern terminal for proper display.

⚖️ Disclaimer

I accept no responsibility for misuse of this tool in any form or by any means. You alone are responsible for how you use it.

Use this software only in ways that comply with applicable laws, website terms of service, and copyright or licensing rules where you fetch content from.
The authors and contributors are not liable for any direct or indirect damage, loss, legal claims, or other consequences arising from use or misuse of this project.
This tool is provided "as is", without warranty of any kind (including fitness for a particular purpose or non-infringement).
Nothing in this README grants permission to copy, distribute, or convert material you are not allowed to access or reproduce.

Caution

If you are unsure whether your use is allowed, do not use this tool for that purpose.

👤 Author

AkiraOmran — built for HU students.

_{Made with ❤️ for HU students}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HU Book Extractor

✨ Features

🏗️ How It Works

📋 Requirements

🚀 Setup

Option A — Automated (`setup.py`)

Option B — Manual

Option C — Editable package only

💻 Usage

Interactive mode

Direct URL mode

Example session

📂 Project Layout

Key components

⚠️ Notes

⚖️ Disclaimer

👤 Author

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
lib		lib
src		src
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

HU Book Extractor

✨ Features

🏗️ How It Works

📋 Requirements

🚀 Setup

Option A — Automated (setup.py)

Option B — Manual

Option C — Editable package only

💻 Usage

Interactive mode

Direct URL mode

Example session

📂 Project Layout

Key components

⚠️ Notes

⚖️ Disclaimer

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Option A — Automated (`setup.py`)