Web Scrapers

Programs to scrape websites using the Scrapy package in Python.

Requirements

Python 2.7
Scrapy

Goodreads Reviews Wayback Spider

The spider located in the goodreadsreviewswayback folder scrapes the Internet Archive Wayback Machine for archived Goodreads book reviews.

Using the spider

To start, you need a list of Wayback Machine URLs to scrape. You can use the included example wayback_goodreads_urls.txt, or use goodreads_urls_to_wayback_urls.py to generate your own list of Wayback Machine URLs from a list of Goodreads book URLs like the one provided in goodreads_urls_sample.txt.
Clone or download this repo. If desired, replace wayback_goodreads_urls.txt with your own URLs.
Navigate to the goodreadsreviewswayback folder in a cmd prompt, then enter scrapy crawl goodreadsreviewswayback -o goodreadsreviewswayback.json to run the spider. You can replace the .json filename with any descriptive name. The spider will now start running.

Output

When the spider is finished scraping, you will have two files, a log file (log.txt) and a JSON file with a list of JSON objects in this format:

{
  "book_title": ["The Girl with the Dragon Tattoo (Millennium, #1)"],
  "book_author": ["Stieg Larsson", "Reg Keeland"],
  "book_author_url": ["https://www.goodreads.com/author/show/706255.Stieg_Larsson", "https://www.goodreads.com/author/show/2565625.Reg_Keeland"],
  "review_pub_date": ["Nov 09, 2015"],
  "book_genre": ["mystery"],
  "rating": [],
  "body": ["<div class=\"reviewText mediumText description readable\" itemprop=\"reviewBody\">\n"],
  "reading_progress": [],
  "reviewer_name": ["Luc\u00eda"],
  "reviewer_url": ["https://www.goodreads.com/user/show/4260544-luc-a"],
  "reviewer_bookshelves": ["dnf"],
  "reviewer_bookshelf_urls": ["https://www.goodreads.com/review/list/4260544-luc-a?shelf=dnf"],
  "review_likes": [],
  "review_likes_url": [],
  "url": "https://www.goodreads.com/review/show/1438215969?book_show_action=false&from_review_page=2",
  "site": "goodreadsreviews",
  "review_commenters": ["Gerardo", "Luc\u00eda"],
  "review_comments": ["<div class=\"comment u-anchorTarget\" id=\"comment_142601968\">\n", "<div class=\"comment u-anchorTarget\" id=\"comment_142607052\">\n"],
  "review_comment_urls": [
    "https://www.goodreads.com/review/show/1438215969?book_show_action=false&from_review_page=2#comment_142601968", 
    "https://www.goodreads.com/review/show/1438215969?book_show_action=false&from_review_page=2#comment_142607052"],
  "review_comment_dates": [
    "Nov 09, 2015 03:06PM",
    "Nov 09, 2015 05:12PM"],
  "review_commenter_urls": ["https://www.goodreads.com/user/show/40991072-gerardo", "https://www.goodreads.com/user/show/4260544-luc-a"],
  "review_comment_ids": ["comment_142601968", "comment_142607052"]
}

You can read this JSON file into a Python program for analysis.

Goodreads Shelves Wayback Spider

The spider located in the goodreadsshelveswayback folder scrapes the Internet Archive Wayback Machine for archived Goodreads shelf counts like the ones you see here:

Using the spider

To start, you need a list of Wayback Machine URLs to scrape. You can use the included example wayback_goodreads_urls.txt, or use goodreads_urls_to_wayback_urls.py to generate your own list of Wayback Machine URLs from a list of Goodreads book URLs like the one provided in goodreads_urls_sample.txt.
Clone or download this repo. If desired, replace wayback_goodreads_urls.txt with your own URLs.
Navigate to the goodreadsshelveswayback folder in a cmd prompt, then enter scrapy crawl goodreadsshelveswayback -o goodreadsshelveswayback.json to run the spider. You can replace the .json filename with any descriptive name. The spider will now start running.

Output

When the spider is finished scraping, you will have two files, a log file (log.txt) and a JSON file with a list of JSON objects in this format:

{
  "book_title": [
    "Life Code: The New Rules For Winning in the Real World"
  ],
  "people": [
    "22 users",
    "19 users",
    "15 users"
  ],
  "url": "https://web.archive.org/web/20160826202815/http://www.goodreads.com/book/show/17155775-life-code",
  "shelf": [
    "Self Help",
    "Psychology",
    "Nonfiction"
  ],
  "book_url": [
  ],
  "site": "goodreadsshelveswayback",
  "total_shelves": 1,
  "people_url": [
    "https://web.archive.org/web/20160826202815/http://www.goodreads.com/shelf/users/17155775-life-code?shelf=self-help",
    "https://web.archive.org/web/20160826202815/http://www.goodreads.com/shelf/users/17155775-life-code?shelf=psychology",
    "https://web.archive.org/web/20160826202815/http://www.goodreads.com/shelf/users/17155775-life-code?shelf=non-fiction"
  ],
  "shelf_url": [
    "https://web.archive.org/web/20160826202815/http://www.goodreads.com/genres/self-help",
    "https://web.archive.org/web/20160826202815/http://www.goodreads.com/genres/psychology",
    "https://web.archive.org/web/20160826202815/http://www.goodreads.com/genres/non-fiction"
  ]
}

You can read this JSON file into a Python program like genre_clustering.py for analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scrapers

Requirements

Goodreads Reviews Wayback Spider

Using the spider

Output

Goodreads Shelves Wayback Spider

Using the spider

Output

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
goodreadsreviewswayback		goodreadsreviewswayback
goodreadsshelveswayback		goodreadsshelveswayback
images		images
README.md		README.md
goodreads_urls_sample.txt		goodreads_urls_sample.txt
goodreads_urls_to_wayback_urls.py		goodreads_urls_to_wayback_urls.py
wayback_goodreads_urls.txt		wayback_goodreads_urls.txt

ahegel/web-scrapers

Folders and files

Latest commit

History

Repository files navigation

Web Scrapers

Requirements

Goodreads Reviews Wayback Spider

Using the spider

Output

Goodreads Shelves Wayback Spider

Using the spider

Output

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages