Skip to content

arieltyson/CMPT-353-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Comparing Proprietary vs. Open-Source Apps: Predicting App Popularity πŸ“²

Overview

This project compares apps from the Google Play Store (proprietary) with those from F-Droid (open source) by analyzing various features and building predictive models to gauge app popularity (using ratings and installs). The project covers all stages from data acquisition, cleaning, combining, and feature derivation to exploratory analysis and model building. The modeling phase includes both a regression task (predicting app rating) and a classification task (predicting platform membership using Random Forest variants).

Research Questions β€’ Popularity Prediction: Can we accurately predict app popularity (e.g., ratings) using a set of derived features? β€’ Platform Differences: Are there systematic differences between open-source (F-Droid) and proprietary (Google Play) apps based on attributes such as reviews, app age, size, and install counts? β€’ Feature Importance & Insights: Which features most strongly influence app popularity metrics?

Project Structure

CMPT-353-PROJECT/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ cleaned/
β”‚   β”‚   β”œβ”€β”€ fdroid_cleaned.csv         # Cleaned F-Droid data (converted from JSON)
β”‚   β”‚   └── googleplay_cleaned.csv     # Cleaned Google Play data (from Kaggle)
β”‚   β”œβ”€β”€ combined/
β”‚   β”‚   β”œβ”€β”€ combined_apps.csv          # Combined dataset (raw version)
β”‚   β”‚   └── combined_apps_enhanced.csv # Combined dataset with additional derived features
β”‚   └── uncleaned/
β”‚       β”œβ”€β”€ fdroid.json                # Raw F-Droid JSON data scraped from F-Droid repository
β”‚       └── googleplaystore.csv        # Raw Google Play data downloaded from Kaggle
β”œβ”€β”€ images/
β”‚   β”œβ”€β”€ feature_importances.png        # Barplot for top 10 feature importances (modeling)
β”‚   β”œβ”€β”€ platform_confusion_matrix.png  # Confusion matrix for platform classification
β”‚   └── platform_confusion_matrix_fair.png # Alternative confusion matrix (if used)
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ Feature_Derivation.ipynb       # Notebook for deriving additional features
β”‚   └── Visualize_Trends.ipynb         # Notebook for exploratory visualizations
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ acquire_fdroid.py              # Script to download F-Droid data via API
β”‚   β”œβ”€β”€ acquire_googleplay.py          # Script to download Google Play data from Kaggle
β”‚   β”œβ”€β”€ clean_fdroid.py                # Script to clean and process F-Droid data
β”‚   β”œβ”€β”€ clean_googleplay.py            # Script to clean and process Google Play data
β”‚   β”œβ”€β”€ combine_datasets.py            # Script to merge FDroid and Google Play datasets
β”‚   β”œβ”€β”€ build_model.py                 # Script to build and evaluate the regression model
β”‚   └── platform_classifier_RandomForests.py  # Script for platform classification using Random Forests
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ .gitignore                         # Files and directories to ignore in Git
└── requirements.txt                   # Python package dependencies

Required Libraries

The project is implemented in Python 3.x. The main packages required include:

β€’ pandas – Data manipulation and cleaning

β€’ numpy – Numerical operations

β€’ matplotlib and seaborn – Data visualization

β€’ scikit-learn – Model building, preprocessing, and evaluation

β€’ requests – HTTP requests for data acquisition

β€’ kaggle – To interact with the Kaggle API for data downloads

β€’ jupyter – For running the notebooks

Install the libraries using:

pip install -r requirements.txt

Setup and Execution

As the files required to run the scripts are all included in the Github repo, you will have them locally once cloned, thus remove the need to acquire data or use inputs. However, the instructions are listed as though you are creating the project from scratch, and in the case of wanting to do further testing, you can use your own Kaggle API key to test different datasets.

  1. Data Acquisition:

β€’ F-Droid Data:

Run the script to download F-Droid data:

python3 src/acquire_fdroid.py

This will download the raw fdroid.json file into the data/uncleaned/ directory.

β€’ Google Play Data:

Run the script to download the Google Play Store dataset from Kaggle:

python3 src/acquire_googleplay.py

Make sure you have a valid kaggle.json in the ~/.kaggle/ directory or set the environment variables accordingly.

The raw Google Play data will be saved in data/uncleaned/.

  1. Data Cleaning:

    β€’ F-Droid Cleaning:

Process the raw JSON to produce cleaned CSV data:

python3 src/clean_fdroid.py

This outputs fdroid_cleaned.csv in data/cleaned/.

β€’ Google Play Cleaning:

Process the Kaggle CSV dataset:

python3 src/clean_googleplay.py

This outputs googleplay_cleaned.csv in data/cleaned/.

  1. Combining Datasets:

Merge both cleaned datasets into one combined dataset with derived features:

python3 src/combine_datasets.py

This creates one file:

β€’ combined_apps.csv (merged with basic comparative organization)

  1. Exploratory Data Analysis (EDA):

Open and run the following Jupyter notebooks for additional exploratory analysis and further feature derivation:

β€’ Visualization Notebook:

jupyter notebook notebooks/Visualize_Trends.ipynb

This notebook generates visualizations (e.g., distributions, correlations, trends) for further data insights.

  1. Feature Engineering:

β€’ Feature Derivation Notebook:

jupyter notebook notebooks/Feature_Derivation.ipynb

This notebook loads combined_apps_enhanced.csv, derives new features (such as app age, binned installs, and flags), and saves the enhanced dataset.

β€’ The resulting dataset with new features will be saved for use in modeling.

  1. Model Building & Evaluation:

β€’ Regression Model (Predicting Ratings):

Run the regression script that uses a Random Forest pipeline with GridSearchCV:

python3 src/build_model.py

The script prints best hyperparameters, test set RΒ² and RMSE metrics, and saves a feature importance barplot in the images/ directory.

β€’ Platform Classification Model:

Run the script to build a Random Forest classifier for predicting the platform (F-Droid vs. Google Play):

python3 src/platform_classifier_RandomForests.py

This outputs a classification report and saves the confusion matrix plot in the images/ directory.

Expected Outputs:

β€’ Data Files:

  • Cleaned data stored in data/cleaned/ and data/combined/ directories.

β€’ Notebooks:

  • Visualizations and derived feature datasets are produced from the notebooks in the notebooks/ folder.

β€’ Model Metrics and Artifacts:

  • Regression model evaluation metrics (RΒ², RMSE) printed to console.
  • Feature importances plot saved as images/feature_importances.png.
  • Classification report and confusion matrix plot saved as image files in the images/ directory.

Additional Notes:

β€’ Make sure to update the kaggle.json file in your ~/.kaggle folder if you experience authentication issues with the Kaggle API.

β€’ You may need to adjust column names or feature lists in the code if the format of the input data changes.

β€’ All scripts assume that the project is run from the project root directory (i.e., CMPT-353-PROJECT/).

This README provides detailed documentation of code, instructions for running the scripts, dependencies, and the expected file outputs. It ensures reproducibility and helps usersβ€”and evaluatorsβ€”understand the flow of the project and the methods used to meet the assignment requirements.

About

A data science/machine learning project applying the concepts and techniques covered in CMPT 353 - Computational Data Science at Simon Fraser University 🧬

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors