This project compares apps from the Google Play Store (proprietary) with those from F-Droid (open source) by analyzing various features and building predictive models to gauge app popularity (using ratings and installs). The project covers all stages from data acquisition, cleaning, combining, and feature derivation to exploratory analysis and model building. The modeling phase includes both a regression task (predicting app rating) and a classification task (predicting platform membership using Random Forest variants).
Research Questions β’ Popularity Prediction: Can we accurately predict app popularity (e.g., ratings) using a set of derived features? β’ Platform Differences: Are there systematic differences between open-source (F-Droid) and proprietary (Google Play) apps based on attributes such as reviews, app age, size, and install counts? β’ Feature Importance & Insights: Which features most strongly influence app popularity metrics?
CMPT-353-PROJECT/
βββ data/
β βββ cleaned/
β β βββ fdroid_cleaned.csv # Cleaned F-Droid data (converted from JSON)
β β βββ googleplay_cleaned.csv # Cleaned Google Play data (from Kaggle)
β βββ combined/
β β βββ combined_apps.csv # Combined dataset (raw version)
β β βββ combined_apps_enhanced.csv # Combined dataset with additional derived features
β βββ uncleaned/
β βββ fdroid.json # Raw F-Droid JSON data scraped from F-Droid repository
β βββ googleplaystore.csv # Raw Google Play data downloaded from Kaggle
βββ images/
β βββ feature_importances.png # Barplot for top 10 feature importances (modeling)
β βββ platform_confusion_matrix.png # Confusion matrix for platform classification
β βββ platform_confusion_matrix_fair.png # Alternative confusion matrix (if used)
βββ notebooks/
β βββ Feature_Derivation.ipynb # Notebook for deriving additional features
β βββ Visualize_Trends.ipynb # Notebook for exploratory visualizations
βββ src/
β βββ acquire_fdroid.py # Script to download F-Droid data via API
β βββ acquire_googleplay.py # Script to download Google Play data from Kaggle
β βββ clean_fdroid.py # Script to clean and process F-Droid data
β βββ clean_googleplay.py # Script to clean and process Google Play data
β βββ combine_datasets.py # Script to merge FDroid and Google Play datasets
β βββ build_model.py # Script to build and evaluate the regression model
β βββ platform_classifier_RandomForests.py # Script for platform classification using Random Forests
βββ README.md # This file
βββ .gitignore # Files and directories to ignore in Git
βββ requirements.txt # Python package dependencies
The project is implemented in Python 3.x. The main packages required include:
β’ pandas β Data manipulation and cleaning
β’ numpy β Numerical operations
β’ matplotlib and seaborn β Data visualization
β’ scikit-learn β Model building, preprocessing, and evaluation
β’ requests β HTTP requests for data acquisition
β’ kaggle β To interact with the Kaggle API for data downloads
β’ jupyter β For running the notebooks
Install the libraries using:
pip install -r requirements.txtAs the files required to run the scripts are all included in the Github repo, you will have them locally once cloned, thus remove the need to acquire data or use inputs. However, the instructions are listed as though you are creating the project from scratch, and in the case of wanting to do further testing, you can use your own Kaggle API key to test different datasets.
- Data Acquisition:
β’ F-Droid Data:
Run the script to download F-Droid data:
python3 src/acquire_fdroid.pyThis will download the raw fdroid.json file into the data/uncleaned/ directory.
β’ Google Play Data:
Run the script to download the Google Play Store dataset from Kaggle:
python3 src/acquire_googleplay.pyMake sure you have a valid kaggle.json in the ~/.kaggle/ directory or set the environment variables accordingly.
The raw Google Play data will be saved in data/uncleaned/.
-
Data Cleaning:
β’ F-Droid Cleaning:
Process the raw JSON to produce cleaned CSV data:
python3 src/clean_fdroid.pyThis outputs fdroid_cleaned.csv in data/cleaned/.
β’ Google Play Cleaning:
Process the Kaggle CSV dataset:
python3 src/clean_googleplay.pyThis outputs googleplay_cleaned.csv in data/cleaned/.
- Combining Datasets:
Merge both cleaned datasets into one combined dataset with derived features:
python3 src/combine_datasets.pyThis creates one file:
β’ combined_apps.csv (merged with basic comparative organization)
- Exploratory Data Analysis (EDA):
Open and run the following Jupyter notebooks for additional exploratory analysis and further feature derivation:
β’ Visualization Notebook:
jupyter notebook notebooks/Visualize_Trends.ipynbThis notebook generates visualizations (e.g., distributions, correlations, trends) for further data insights.
- Feature Engineering:
β’ Feature Derivation Notebook:
jupyter notebook notebooks/Feature_Derivation.ipynbThis notebook loads combined_apps_enhanced.csv, derives new features (such as app age, binned installs, and flags), and saves the enhanced dataset.
β’ The resulting dataset with new features will be saved for use in modeling.
- Model Building & Evaluation:
β’ Regression Model (Predicting Ratings):
Run the regression script that uses a Random Forest pipeline with GridSearchCV:
python3 src/build_model.pyThe script prints best hyperparameters, test set RΒ² and RMSE metrics, and saves a feature importance barplot in the images/ directory.
β’ Platform Classification Model:
Run the script to build a Random Forest classifier for predicting the platform (F-Droid vs. Google Play):
python3 src/platform_classifier_RandomForests.pyThis outputs a classification report and saves the confusion matrix plot in the images/ directory.
Expected Outputs:
β’ Data Files:
- Cleaned data stored in data/cleaned/ and data/combined/ directories.
β’ Notebooks:
- Visualizations and derived feature datasets are produced from the notebooks in the notebooks/ folder.
β’ Model Metrics and Artifacts:
- Regression model evaluation metrics (RΒ², RMSE) printed to console.
- Feature importances plot saved as images/feature_importances.png.
- Classification report and confusion matrix plot saved as image files in the images/ directory.
Additional Notes:
β’ Make sure to update the kaggle.json file in your ~/.kaggle folder if you experience authentication issues with the Kaggle API.
β’ You may need to adjust column names or feature lists in the code if the format of the input data changes.
β’ All scripts assume that the project is run from the project root directory (i.e., CMPT-353-PROJECT/).
This README provides detailed documentation of code, instructions for running the scripts, dependencies, and the expected file outputs. It ensures reproducibility and helps usersβand evaluatorsβunderstand the flow of the project and the methods used to meet the assignment requirements.