book-recommendation-system

A book recommendation system built using Pyspark to assist the users by suggesting books of their relevant interests as well as predicting the user rating for a particular book.

This project was implemented as a part of the Big Data Analytics course taken at Concordia University. It uses content-based and collaborative filtering approaches.

Code Structure

model/
- dataExploration.ipynb
- data_preprocessing.ipynb
- content_based.ipynb - model 1
- collaborative_filtering.ipynb - model 2

Data Preprocessing

Data Cleaning

The ratings dataset does not contain any null or invalid values, hence no data cleaning is required. However, the book and book tag dataset include multiple null and invalid data.

Data Imbalance

As illustrated in fig 1, there is a data imbalance, with the rating values (e.g., 1, 2) being very low. Therefore, we look at various techniques (Resample, k-fold cross-validation, etc ) for dealing with data imbalance.

Fig 1: illustrates the data imbalance in our dataset

Model 1: Content-based

Advantage: No cold start, able to recommend new and unpopular books
Disadvantage: Difficult to recommend to new users

One-Hot Vector Encoding

To deal with categorical data, we perform one-hot vector encoding to build our item and user profile.

Algorithm

Using cosine distance, we recommend books to users similar to previous books highly rated by the user.

Item profile: The item profile is built with a column having values ranging from 0 to 1 for each distinct category of the book features.
User profile: The user profile is the aggregate of the item profiles of books that the user has highly rated, normalized to the range 0-1.

Model 2: Collaborative Filtering

Advantage: Feature selection is not needed, and it helps users discover new interests.
Disadvantage: Cold start problem, popularity bias

Algorithms

Matrix Factorization - Generate latent features and identify relationships between users and items based on other user’s interactions.

Evaluation metrics

Root mean squared error
Confusion Matrix

Dataset Used: goodbooks-10k

Reference: https://github.com/zygmuntz/goodbooks-10k/blob/master/quick_look.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
fig		fig
model		model
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

book-recommendation-system

Code Structure

Data Preprocessing

Data Cleaning

Data Imbalance

Model 1: Content-based

One-Hot Vector Encoding

Algorithm

Model 2: Collaborative Filtering

Algorithms

Evaluation metrics

Dataset Used: goodbooks-10k

About

Uh oh!

Releases

Packages

Languages

License

rucha97/book-recommendation-system

Folders and files

Latest commit

History

Repository files navigation

book-recommendation-system

Code Structure

Data Preprocessing

Data Cleaning

Data Imbalance

Model 1: Content-based

One-Hot Vector Encoding

Algorithm

Model 2: Collaborative Filtering

Algorithms

Evaluation metrics

Dataset Used: goodbooks-10k

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages