Skip to content

akaTatago/BPFetcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BPFetcher

Python curl-cffi playwright pandas

A web scraping CLI tool for price comparison across Portuguese bookstores.

Scrapes book prices and availability from Wook, Bertrand, Fnac, and Almedina using ISBN or text search.

Getting StartedFeaturesUsageScrapers

Getting Started

Clone the repository

git clone https://github.com/akaTatago/BPFetcher.git
cd BPFetcher

Install dependencies

pip install -r requirements.txt

Install Playwright browsers (required for Fnac scraping)

playwright install chromium

Run the scraper

python -m src.main input.csv --output results.csv --stores all

Features

  • Multiple Search Modes: Search by ISBN-13 or by Title + Author
  • Multiple Stores Support: Scrapes Wook, Bertrand, Fnac, and Almedina
  • Concurrent Scraping: Fast parallel processing for non-browser scrapers
  • Smart Matching: Validates book matches using normalized titles and authors
  • Price Tracking: Detects sale prices and availability status
  • CSV Export: Clean, structured output with all store data

Usage

Basic Usage

python -m src.main input.csv

Search by ISBN (default)

python -m src.main books.csv --mode isbn --output results.csv

Your CSV should contain an ISBN13 column:

ISBN13 Title Author
9780316769174 The Catcher in... J.D. Salinger

Search by Text

python -m src.main books.csv --mode text --output results.csv

Your CSV should contain Title and Author columns:

Title Author
The Catcher in the Rye J.D. Salinger

Select Specific Stores

python -m src.main input.csv --stores wook fnac

Available stores: wook, bertrand, fnac, almedina, or all (default)


Scrapers

Request-Based Scrapers

Wook, Bertrand, and Almedina use curl-cffi for fast HTTP requests with browser impersonation.

  • Randomized delays to avoid rate limiting
  • Concurrent execution via ThreadPoolExecutor
  • Efficient parsing with BeautifulSoup

Browser-Based Scraper

Fnac uses Playwright due to anti-bot protection:

  • Headless Chromium browser
  • Automatic CAPTCHA detection (manual solve required)
  • Cookie consent handling
  • Stealth mode with webdriver detection disabled

Output Format

ISBN Mode

Creates one row per book with columns for each store:

Title, Author, Wook Status, Wook Price, Wook On Sale, Wook Link, Bertrand Status, ...

Text Mode

Creates multiple rows per book (one per matching result):

Title, Author, Store, Title Found, Author Found, Status, Price, On Sale, Link

Architecture

BPFetcher/
├── main.py                 # CLI entry point
├── requirements.txt
├── src/
│   ├── scrapers/
│   │   ├── base_scraper.py    # Abstract base class
│   │   ├── wook.py            # Wook scraper
│   │   ├── bertrand.py        # Bertrand scraper
│   │   ├── fnac.py            # Fnac scraper (Playwright)
│   │   └── almedina.py        # Almedina scraper
│   └── utils/
│       ├── scraping_helper.py # Shared scraping utilities
│       └── csv_helper.py      # CSV loading/saving
└── data/
    └── results.csv         # Default output location

About

Python-based web scraper for automated book price extraction. Processes book lists from .csv files, scrapes data from bookstores, and generates price reports using Pandas.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages