An algorithm performance report on the itemKNN algorithm using collaborative filtering paradigm as the learning method. I created this report during my summer internship at B2BPlanet.com, marking the start of the Smart Search Algorithm Project which will be implemented on the live server soon. The model was created with open-source code provided by Arthur Fortes under MIT License. (See LICENSE.txt in root directory.)
Book Crossing dataset were collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems.
This data has been organized and cleaned up by Arthur Fortes [1] based on MovieLens 100k treatment [2], which removed all users and items who had less than 20 and 10 interactions, receptively, items that have no information and separated in files the explicit and implicit interactions.
Detailed descriptions of the data file can be found at the end of this file.
This dataset consists of:
- 272,679 interactions (explicit / implicit) from 2,946 users on 17,384 books.
- Ratings: 1,295 users and 14,684 books (62,657 ratings applied)
- History: 2,946 users and 17,384 books (272,679 accesses)
- Ratings are between 1 - 10. Implicit feedback are represented by 1.
- Simple demographic info for the users (age, gender, occupation, zip)
If you have any further questions or comments, please contact me [email protected].
Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication):
Improving Recommendation Lists Through Topic Diversification, Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan.
Here are brief descriptions of the data.
items_info.dat -- Information about the items (books); this is a tab separated list of Book_ID | ISBN | Book-Title | Book-Author | Year-Of-Publication | Publisher | Image-URL-S | Image-URL-M | Image-URL-L |
The item ids are the ones used in the book_history.dat
and book_ratings.dat files.
users_info.dat -- Demographic information about the users; this is a tab separated list of User-ID | Location | Age
The user ids are the ones used in the book_history.dat
and book_ratings.dat files.
book_history.dat -- The full history set, 272,679 accesses by 2,946 users on 17,384 books. Each user has accessed at least 20 books. Users and items are numbered consecutively from 1. The data is ordered by users ids. This is a tab separated list of user id | item id | accessed
book_ratings.dat -- The full ratings set, 62,657 ratings by 1,295 users on 14,684 books. Users and items are numbered consecutively from 1. The data is ordered by users ids. This is a tab separated list of user id | item id | ratings
[1] Da Costa, Arthur Fortes. PhD candidate at the Institute of Mathematical and Computational Sciences, University of São Paulo. URL: https://arthurfortes.github.io/
[2] MovieLens 100K Dataset. Stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998. URL: https://grouplens.org/datasets/movielens/100k/ Generated by GroupLens [Department of Computer Science and Engineering at the University of Minnesota].