A lightweight content-based movie recommendation system built using TMDB metadata. It extracts key movie features, processes them into vectors, computes similarity scores, and serves the recommendations through an interactive Streamlit UI.
-
Content-based filtering using cosine similarity
-
Metadata extraction:
- Genres
- Keywords
- Top 3 cast members
- Director
- Overview
-
Stemming of normalized text (NLTK)
-
Vectorization via CountVectorizer
-
Fast, precomputed similarity matrix
-
Supports downloading large models from Hugging Face
-
Clean, interactive Streamlit interface
.
βββ app.py # Streamlit UI application
βββ main.ipynb # Data preprocessing + similarity computation
βββ movies.pkl # Cleaned movie metadata
βββ requirements.txt # Dependencies for app & notebook
βββ runtime.txt # Python version (for Streamlit Cloud)
βββ README.md # Documentation
βββ data/
βββ tmdb_5000_movies.csv
βββ tmdb_5000_credits.csv
TMDB CSVs
β
βΌ
Data Cleaning & Feature Extraction (main.ipynb)
β
βββ create movies.pkl
βββ compute similarity matrix β similarity.pkl
β
βΌ
Streamlit App (app.py)
β
βββ load movies.pkl
βββ load similarity from local OR Hugging Face
βββ recommend top similar movies
python -m venv venvsource venv/Scripts/activate # Windows
source venv/bin/activate # Macos pip install -r requirements.txtstreamlit run app.pyYour app will open at:
Open and run:
main.ipynb
This notebook:
-
Cleans the TMDB dataset
-
Creates a
tagscolumn -
Computes cosine similarity
-
Saves artifacts:
movies.pklsimilarity.pklorsimilarity.npz(depending on your notebook code)
(NO large model files like similarity.pkl β use HF for that)
runtime.txt
python-3.10.12
requirements.txt (example)
streamlit
numpy==1.25.3
pandas
scikit-learn
joblib
requests
HF_RAW_URL = "https://huggingface.co/<username>/<repo>/resolve/main/similarity.pkl"Go to: https://share.streamlit.io
- New App
- Choose your GitHub repo
- Branch:
main - Entry point:
app.py - Deploy π
Streamlit will download the model from Hugging Face on first run.
Upload similarity.pkl to your HF repo.
Use the raw URL:
https://huggingface.co/<username>/<repo>/resolve/main/similarity.pkl
/resolve/main/ or /raw/main/.
movie_index = movies[movies['title'] == movie].index[0]
distances = similarity[movie_index]
movie_list = sorted(
list(enumerate(distances)),
reverse=True,
key=lambda x: x[1]
)[1:6]
return [movies.iloc[i[0]].title for i in movie_list]GitHub doesnβt allow >100MB. Solution:
- Upload large similarity file to Hugging Face
- Let
app.pydownload it at runtime
Use a wheel-friendly version:
numpy==1.25.3
Add runtime.txt:
python-3.10.12
Add a token to Streamlit Secrets.
- TMDB poster integration
- Movie detail pages
- Hybrid recommender (Content + Collaborative Filtering)
- Semantic similarity with Sentence Transformers
- Compressed sparse similarity matrix
- TMDB for the dataset
- Streamlit for the UI
- Hugging Face for large file hosting
- Scikit-learn & Pandas for preprocessing