Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Additions and Improvements: Goodreads Scraper #43

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

GrimmXoXo
Copy link

Description

Hi there, I was working on a Book recommendation system project and wanted to get data for my models. I came across your repository and liked the work, so I decided to improve upon this and implement it in my project. If it helps, I would also like to contribute a bit towards Goodreads Scraper.

Changes Made

List out the key changes made in this pull request, including any new features added, bugs fixed, or improvements made.

  • Added functionality to retrieve Book IDs from the lists present in Goodreads.
  • Implemented a feature to output the scraped Book IDs into a database.
  • Enhanced the functionality by adding an option to convert the extracted IDs into a text file, making the output relevant to the preceding scripts.
  • Added argparse to simplify the usage of the file.
  • Updated get_books.py to ensure compatibility with the current model of the Goodreads website.
  • Added Instructions on how to use the script in README.
  • Added argparser to requirements just in case there is an error on the user's side even if the lib is a standard Python lib.

Files Modified

List the files modified/added in this pull request.

  • get_books.py
  • get_book_ids.py
  • README.md
  • requirements.txt

Checklist

Ensure that the following tasks have been completed:

  • Updated the documentation to reflect the changes.
  • Added appropriate comments and docstrings in the code.
  • Ensured that the code passes all tests and errors are handled gracefully.
  • Tested the new functionality locally.
  • Verified that the README file includes usage instructions for the new features.

Preview

This is the output for a particular Category/Collection of goodreads
database_scraper

This is the Output for the json files which we get from get_books.py

{
    "book_id": "320",
    "cover_image_uri": "https://images-na.ssl-images-amazon.com/images/S/compressed.photo.goodreads.com/books/1327881361i/320.jpg",
    "book_title": "One Hundred Years of Solitude",
    "top_5_other_editions": [],
    "format": ["417 pages, Mass Market Paperback"],
    "publication_info": ["First published January 1, 1967"],
    "authorlink": "https://www.goodreads.com/author/show/13450.Gabriel_Garc_a_M_rquez",
    "author": "Gabriel García Márquez",
    "num_pages": ["417"],
    "genres": ["Fiction", "Magical Realism", "Literature", "Fantasy", "Novels", "Historical Fiction", "Spanish Literature"],
    "num_ratings": "976523",
    "num_reviews": "46268",
    "average_rating": "4.11",
    "rating_distribution": {
        "5": "480,125",
        "4": "260,614",
        "3": "140,529",
        "2": "57,483",
        "1": "37,772"
    }
}

GrimmXoXo added 6 commits May 14, 2024 10:44
Functions added:
- Custom Scrape: Implemented custom scraping for specific Goodreads lists.
- Added arguments for easy script use. See ReadMe.md for detailed information on available arguments.
- Implemented functionality to export book IDs from the database to a text format.

The new features enhance the flexibility and usability of the script, allowing users to specify custom scraping parameters and export data in a more accessible format.
Modified the file to fix non-working functions and added new functionalities such as format info and publication info.
…extra library as its a standard library module in Python
@maria-antoniak
Copy link
Owner

Hi there! This looks wonderful; thank you for all your work! I haven't had time to test yet but will try to do this ASAP. If we integrate your changes, would you like to be credited in the README? If so, how would you like to be credited (username, name, something else)? These are significant changes that we haven't had time to make ourselves, and I want to make sure you get the credit you deserve.

top_5_editions (Inside DOM)
Handled-
If book id doesn't exist i.e the book/page doesn't exist,it will skip to next id in txt file
@GrimmXoXo
Copy link
Author

GrimmXoXo commented May 14, 2024

Hi there! This looks wonderful; thank you for all your work! I haven't had time to test yet but will try to do this ASAP. If we integrate your changes, would you like to be credited in the README? If so, how would you like to be credited (username, name, something else)? These are significant changes that we haven't had time to make ourselves, and I want to make sure you get the credit you deserve.

Hi! If this goes well this will be my first contribution ^_^ I would love my username(GrimmXoXo or GM) appearing on the contribution but please make sure that this works well.

GrimmXoXo added 3 commits May 15, 2024 00:28
- Implemented get_reviews.py, a script that extracts reviews for given book IDs from Goodreads.
- The script reads input from a text file containing book IDs and outputs the reviews in a SQLite database.
- Added a log file for easier debugging of the script.
- Updated requirements.txt to include langdetect, which filters out non-English reviews.
- Included an example on how to run the script in the README.md file.
@GrimmXoXo
Copy link
Author

That would be all i think, added a new script to fetch reviews,added a log file for reviews(can also be added to book_id,book_details) to debug problems,added readme inside the folder to use the new script with working example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants