This repository contains the solution for collaborative filtering tasks, implemented with Python’s Surprise library. The project includes optimizing user-based K-Nearest Neighbors (K-NN) by tuning the number of neighbors (K) to minimize Mean Absolute Error (MAE) under different levels of sparsity, addressing sparsity issues with SVD (Singular Value Decomposition) and its Funk variant, and comparing the performance of K-NN and SVD in generating Top-N recommendations with varying levels of missing ratings.
- Clone the repository:
git clone [email protected]:ikajdan/collaborative_filtering_benchmarking.git
cd collaborative_filtering_benchmarking
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate
- Install the required packages:
pip install -r requirements.txt
- Download the data:
curl -O https://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
Task Description
Given the dataset and the algorithm of K-NN (K-Nearest Neighbors), for user-based CF (Collaborative Filtering):
- Find out the value for K that minimizes the MAE (Mean Absolute Error) with 25% of missing ratings.
- Sparsity problem: find out the value for K that minimizes the MAE with 75% of missing ratings.
The MAE is consistently lower for the 25% missing ratings case across all K values. This is expected, as more data generally leads to better predictive accuracy. In the case of 75% missing ratings, the algorithm struggles with sparsity, as fewer neighbors have overlapping ratings with the target user, leading to less reliable recommendations. Therefore, higher data sparsity may benefit from larger neighborhoods to aggregate more data points and reduce errors.
For both sparsity levels, the MAE decreases as K increases, reaching a minimum before leveling off. For this dataset the optimal K values are:
- 25% missing ratings: 58 with MAE equal to 0.7504.
- 75% missing ratings: 55 with MAE equal to 0.7885.
Task Description
Mitigation of sparsity problem: show how SVD (Singular Value Decomposition), the Funk variant, can provide a better MAE than user-based K-NN using the provided dataset.
In both cases, 25% and 75% missing ratings, Funk SVD outperforms K-NN. This suggests that SVD is more effective in handling high sparsity, as it can better generalize from the data and make more accurate predictions. The precision, recall, and F1 scores also show that SVD performs better than K-NN in both cases, indicating that SVD is more effective in capturing relevant items and making accurate recommendations.

MAE, Precision, Recall, and F1 Score comparison between SVD and K-NN for 25% and 75% missing ratings.
Task Description
Top-N recommendations: calculate the precision, recall, and F1 with different values for N (10 to 100) using user-based K-NN (with the best Ks) and SVD. To do this, you must suppose that the relevant recommendations for a specific user are those rated with 4 or 5 stars in the data set. Perform the calculations for both 25% and 75% of missing ratings.
Explain why you think that the results reported in the three tasks make sense.
As N increases, precision decreases while recall remains high. With higher N, the model recommends more items, which increases recall (capturing more relevant items) but reduces precision (introducing more irrelevant items). For both 25% and 75% missing ratings, recall stays high, indicating that relevant items are mostly retrieved, while precision drops with larger N. The F1 score reflects this trade-off, peaking at smaller N and decreasing as N grows, particularly under higher sparsity. Both K-NN and SVD models show similar performance, suggesting that their predictive capabilities are comparable in this scenario.