Feature Additions and Improvements: Goodreads Scraper #43

GrimmXoXo · 2024-05-14T12:19:48Z

Description

Hi there, I was working on a Book recommendation system project and wanted to get data for my models. I came across your repository and liked the work, so I decided to improve upon this and implement it in my project. If it helps, I would also like to contribute a bit towards Goodreads Scraper.

Changes Made

List out the key changes made in this pull request, including any new features added, bugs fixed, or improvements made.

Added functionality to retrieve Book IDs from the lists present in Goodreads.
Implemented a feature to output the scraped Book IDs into a database.
Enhanced the functionality by adding an option to convert the extracted IDs into a text file, making the output relevant to the preceding scripts.
Added argparse to simplify the usage of the file.
Updated get_books.py to ensure compatibility with the current model of the Goodreads website.
Added Instructions on how to use the script in README.
Added argparser to requirements just in case there is an error on the user's side even if the lib is a standard Python lib.

Files Modified

List the files modified/added in this pull request.

get_books.py
get_book_ids.py
README.md
requirements.txt

Checklist

Ensure that the following tasks have been completed:

Updated the documentation to reflect the changes.
Added appropriate comments and docstrings in the code.
Ensured that the code passes all tests and errors are handled gracefully.
Tested the new functionality locally.
Verified that the README file includes usage instructions for the new features.

Preview

This is the output for a particular Category/Collection of goodreads

This is the Output for the json files which we get from get_books.py

{
    "book_id": "320",
    "cover_image_uri": "https://images-na.ssl-images-amazon.com/images/S/compressed.photo.goodreads.com/books/1327881361i/320.jpg",
    "book_title": "One Hundred Years of Solitude",
    "top_5_other_editions": [],
    "format": ["417 pages, Mass Market Paperback"],
    "publication_info": ["First published January 1, 1967"],
    "authorlink": "https://www.goodreads.com/author/show/13450.Gabriel_Garc_a_M_rquez",
    "author": "Gabriel García Márquez",
    "num_pages": ["417"],
    "genres": ["Fiction", "Magical Realism", "Literature", "Fantasy", "Novels", "Historical Fiction", "Spanish Literature"],
    "num_ratings": "976523",
    "num_reviews": "46268",
    "average_rating": "4.11",
    "rating_distribution": {
        "5": "480,125",
        "4": "260,614",
        "3": "140,529",
        "2": "57,483",
        "1": "37,772"
    }
}

…(.db)

…iginal file.

Functions added: - Custom Scrape: Implemented custom scraping for specific Goodreads lists. - Added arguments for easy script use. See ReadMe.md for detailed information on available arguments. - Implemented functionality to export book IDs from the database to a text format. The new features enhance the flexibility and usability of the script, allowing users to specify custom scraping parameters and export data in a more accessible format.

Modified the file to fix non-working functions and added new functionalities such as format info and publication info.

…extra library as its a standard library module in Python

maria-antoniak · 2024-05-14T15:57:01Z

Hi there! This looks wonderful; thank you for all your work! I haven't had time to test yet but will try to do this ASAP. If we integrate your changes, would you like to be credited in the README? If so, how would you like to be credited (username, name, something else)? These are significant changes that we haven't had time to make ourselves, and I want to make sure you get the credit you deserve.

top_5_editions (Inside DOM) Handled- If book id doesn't exist i.e the book/page doesn't exist,it will skip to next id in txt file

GrimmXoXo · 2024-05-14T18:27:00Z

Hi there! This looks wonderful; thank you for all your work! I haven't had time to test yet but will try to do this ASAP. If we integrate your changes, would you like to be credited in the README? If so, how would you like to be credited (username, name, something else)? These are significant changes that we haven't had time to make ourselves, and I want to make sure you get the credit you deserve.

Hi! If this goes well this will be my first contribution ^_^ I would love my username(GrimmXoXo or GM) appearing on the contribution but please make sure that this works well.

- Implemented get_reviews.py, a script that extracts reviews for given book IDs from Goodreads. - The script reads input from a text file containing book IDs and outputs the reviews in a SQLite database. - Added a log file for easier debugging of the script. - Updated requirements.txt to include langdetect, which filters out non-English reviews. - Included an example on how to run the script in the README.md file.

GrimmXoXo · 2024-05-15T17:06:54Z

That would be all i think, added a new script to fetch reviews,added a log file for reviews(can also be added to book_id,book_details) to debug problems,added readme inside the folder to use the new script with working example.

GrimmXoXo added 6 commits May 14, 2024 10:44

Added book_id retriever, This converts the data extracted to database…

db609fd

…(.db)

Added function main() instead of manually using the script for the or…

67055b9

…iginal file.

Fixed non-working functions and added new features

4998af3

Modified the file to fix non-working functions and added new functionalities such as format info and publication info.

Updated Readme to include get_book_ids.py

c21db2f

Added Argparser for convenience sake, user wouldn't need to download …

01aee79

…extra library as its a standard library module in Python

Removed-

3ae2013

top_5_editions (Inside DOM) Handled- If book id doesn't exist i.e the book/page doesn't exist,it will skip to next id in txt file

GrimmXoXo added 3 commits May 15, 2024 00:28

Syntax Error fixed

62cd18e

Added book_details(summary of books) and indent for json for readability

ef622ab

yuetongwu7 mentioned this pull request Jan 22, 2025

Datasets theinvisiblelab/invisible-books#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Additions and Improvements: Goodreads Scraper #43

Feature Additions and Improvements: Goodreads Scraper #43

GrimmXoXo commented May 14, 2024

maria-antoniak commented May 14, 2024

GrimmXoXo commented May 14, 2024 •

edited

Loading

GrimmXoXo commented May 15, 2024

Feature Additions and Improvements: Goodreads Scraper #43

Are you sure you want to change the base?

Feature Additions and Improvements: Goodreads Scraper #43

Conversation

GrimmXoXo commented May 14, 2024

Description

Changes Made

Files Modified

Checklist

Preview

maria-antoniak commented May 14, 2024

GrimmXoXo commented May 14, 2024 • edited Loading

GrimmXoXo commented May 15, 2024

GrimmXoXo commented May 14, 2024 •

edited

Loading