Skip to content

Abdo-omran2206/HU_Book_Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HU Book banner

HU Book Extractor

Convert web-hosted EPUB-style books into a single, clean PDF — right from your terminal.

Python 3.10+ Playwright PyPDF2 License


✨ Features

Feature Description
🔗 Smart URL handling Automatically normalises URLs — strips index.html, ensures trailing slashes, and validates reachability
📖 EPUB metadata parsing Reads the book's package.xml (OPF manifest) to discover the title and every XHTML page
🖨️ Headless PDF rendering Renders each page to A4 PDF via Playwright's Chromium engine — no visible browser window
📑 Single-file merge Combines all page PDFs into one output file using PyPDF2, named after the book title
🧹 Auto cleanup Temporary per-page PDFs are deleted automatically after the final merge
🎨 Coloured CLI output ANSI-coloured progress messages with emoji indicators for a clear, friendly experience
CLI & interactive modes Pass a URL via --url flag or enter it interactively when prompted

🏗️ How It Works

┌──────────────┐     ┌────────────────┐     ┌──────────────────┐     ┌──────────────┐
│  User enters │────▶│  Validate URL  │────▶│  Fetch metadata  │────▶│  Discover    │
│  book URL    │     │  & normalise   │     │  (package.xml)   │     │  XHTML pages │
└──────────────┘     └────────────────┘     └──────────────────┘     └──────┬───────┘
                                                                           │
                     ┌────────────────┐     ┌──────────────────┐           │
                     │  Output final  │◀────│  Merge all PDFs  │◀──────────┘
                     │  <title>.pdf   │     │  (PyPDF2)        │  Render each page
                     └────────────────┘     └──────────────────┘  via Playwright
  1. URL validation — The scraper normalises the input URL, confirms the host is reachable (HTTP 200), and checks that the EPUB structure (epub/EPUB/package.xml) exists.
  2. Metadata extraction — Parses the OPF package.xml using xml.etree.ElementTree to pull the <dc:title> (used as the output filename) and the full list of application/xhtml+xml items from the <manifest>.
  3. Page rendering — For every discovered XHTML page, Playwright launches headless Chromium, navigates to the page URL, waits for networkidle, and prints it to a temporary A4 PDF.
  4. Merge & cleanup — All temporary PDFs are appended in order with PdfMerger, written as <BookTitle>.pdf, and the temp files are deleted.

📋 Requirements

Dependency Version Purpose
Python 3.10+ Runtime
requests ≥ 2.31.0 HTTP requests for URL validation & metadata fetch
playwright ≥ 1.40.0 Headless Chromium browser for page-to-PDF rendering
PyPDF2 ≥ 3.0.0 Merging individual page PDFs into a single file

Standard-library modules used: asyncio, argparse, urllib.parse, xml.etree.ElementTree, os, time, subprocess, sys.


🚀 Setup

Option A — Automated (setup.py)

python setup.py

This performs an editable install (pip install -e .) using requirements.txt, then installs Playwright's Chromium build — everything in one command.

Option B — Manual

python -m venv .venv

# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate

pip install -r requirements.txt
playwright install chromium

Option C — Editable package only

pip install -e .

Note

If you skip python setup.py, you must run playwright install chromium once yourself before first use.


💻 Usage

Interactive mode

python main.py

You will be prompted to enter the book URL, then confirm to start the conversion.

Direct URL mode

python main.py --url "https://example.com/path/to/book/"
Flag Short Description
--url -u URL of the book to extract (skips the interactive prompt)

Example session

╔══════════════════════════════════════╗
║        📚 HU Book Extractor          ║
╚══════════════════════════════════════╝

🔗 Enter Book URL:
> https://example.com/books/my-textbook/

⏳ Validating URL...
✅ URL validated https://example.com/books/my-textbook/
📦 Fetching metadata...
📄 Book Title: Introduction to Computer Science
📚 Found 12 pages

🚀 Start conversion? (y/n): y

🚀 Starting conversion...
📥 Converting page 1/12: chapter01.xhtml
📥 Converting page 2/12: chapter02.xhtml
...
🎉 Done successfully! Saved as: Introduction to Computer Science.pdf

📂 Project Layout

HUBook/
├── main.py                  # CLI entry point (interactive & argparse modes)
├── setup.py                 # Automated setup: editable install + Playwright Chromium
├── requirements.txt         # Third-party pip dependencies
├── lib/
│   ├── __init__.py          # Package marker
│   └── ebooks_scraper.py    # Core EbookScraper class (validate, parse, render, merge)
└── src/
    ├── logo.svg             # Square project logo
    └── banner.svg           # README banner graphic

Key components

Module Class / Function Responsibility
main.py Colors ANSI colour codes for terminal output
main.py run(base_url) Async orchestration: validate → extract metadata → convert → save
main.py argparse setup Parses --url / -u flag for non-interactive use
lib/ebooks_scraper.py EbookScraper All scraping logic encapsulated in one class
.normalize_url() Strips index.html, ensures trailing /, cleans query/fragment
.validate_url() Confirms HTTP 200 for both the root URL and the package.xml
.PDFnameExtractor() Parses OPF metadata to get <dc:title>
.PagesExtractor() Reads <manifest> items with media-type="application/xhtml+xml"
.html_to_pdf() Launches headless Chromium, renders a single page to A4 PDF
.convert_all() Iterates all pages, renders each, merges with PdfMerger, cleans up

⚠️ Notes

  • The target site must serve the book in the EPUB-over-HTTP structure this tool expects (epub/EPUB/package.xml with an OPF manifest). Unsupported layouts will fail during validation.
  • Temporary PDFs (temp_1.pdf, temp_2.pdf, …) are created in the current working directory during conversion and are automatically removed after the merge completes.
  • On Windows, the CLI uses ANSI escape codes for colour — use Windows Terminal, VS Code terminal, or another modern terminal for proper display.

⚖️ Disclaimer

I accept no responsibility for misuse of this tool in any form or by any means. You alone are responsible for how you use it.

  • Use this software only in ways that comply with applicable laws, website terms of service, and copyright or licensing rules where you fetch content from.
  • The authors and contributors are not liable for any direct or indirect damage, loss, legal claims, or other consequences arising from use or misuse of this project.
  • This tool is provided "as is", without warranty of any kind (including fitness for a particular purpose or non-infringement).
  • Nothing in this README grants permission to copy, distribute, or convert material you are not allowed to access or reproduce.

Caution

If you are unsure whether your use is allowed, do not use this tool for that purpose.


👤 Author

AkiraOmran — built for HU students.

Made with ❤️ for HU students

About

HU Book Extractor is a Python tool that converts online books hosted as HTML pages into a single PDF file. The tool fetches all pages in order and merges them into a ready-to-read PDF, making it ideal for educational or reference materials.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages