Youth Problematic Internet Usage Prediction

Overview

This project focuses on predicting the level of problematic internet usage exhibited by children and adolescents based on their physical activity and fitness data. Given the increasing prevalence of internet usage among youth, understanding and predicting problematic patterns is crucial for intervention and support strategies.

Problem Statement

The rise of the internet has brought significant benefits, but it has also led to concerning trends in problematic internet usage, particularly among children and adolescents. This project seeks to identify and predict these problematic usage patterns through machine learning techniques, allowing for proactive measures to be implemented by parents, educators, and health professionals.

Objective

The primary objective of this project is to build a predictive model that accurately assesses the severity of problematic internet usage based on various physical activity and fitness metrics. This involves:

Data collection and preprocessing
Feature engineering
Training multiple machine learning models
Evaluating model performance using appropriate metrics
Hyperparameter tuning to optimize model accuracy

Technologies

This project is developed using the following technologies:

Python: Programming language used for data manipulation and machine learning.
Pandas: Library for data manipulation and analysis.
NumPy: Library for numerical computations.
Scikit-learn: Machine learning library for model training and evaluation.
XGBoost: Optimized gradient boosting library.
LightGBM: Fast gradient boosting framework that uses tree-based learning algorithms.
CatBoost: Gradient boosting library that handles categorical features well.
Optuna: Hyperparameter optimization framework.
Seaborn: Data visualization library based on Matplotlib.
Matplotlib: Library for creating static, animated, and interactive visualizations in Python.
tqdm: Library for displaying progress bars.

Data Sources

The dataset is sourced from the Child Mind Institute Problematic Internet Use Dataset. The following files are utilized:

train.csv: Contains training data for model building.
test.csv: Contains test data for model evaluation.
sample_submission.csv: Format for submitting predictions.
Time series data in parquet format.

Installation

To run this project, you need to install the required packages. You can do this using pip:

pip install pandas numpy scikit-learn xgboost lightgbm catboost seaborn matplotlib optuna tqdm

Usage

Load the Data: The dataset is loaded using pandas.
Data Preprocessing: Missing values are handled, and relevant features are created and encoded.
Model Training: Different models are trained, and their performance is evaluated using the Quadratic Weighted Kappa metric.
Hyperparameter Tuning: Optuna is used for hyperparameter optimization.

Data Preprocessing

Data preprocessing is crucial for ensuring the quality and effectiveness of machine learning models. The following steps were taken to prepare the data:

Data Cleaning
- Missing values were identified and handled appropriately. For instance, specific physical attributes such as Physical-BMI, Physical-Height, Physical-Weight, Physical-Diastolic_BP, Physical-HeartRate, and Physical-Systolic_BP were imputed using the Basic_Demos-Age and Basic_Demos-Sex columns.
Handling Missing Values
- Identifying Missing Values: Assess missing values across different features.
- Dropping Columns: Features with more than 50% missing values are dropped.
- Missing values in the specified physical attributes were imputed based on age and sex groups, creating categories such as ‘Children (5-12)’, ‘Adolescents (13-18)’, and ‘Adults (19-22)’.
- The PreInt_EduHx-computerinternet_hoursday feature was imputed using the Severity Impairment Index (SII).
Feature Engineering
- Creating New Features: Aggregate scores derived from existing features enhance dataset informativeness.
Encoding Categorical Variables
- Categorical features were transformed into numeric values using one-hot encoding to ensure compatibility with machine learning algorithms.
Data Normalization
- Feature scaling was performed to standardize the range of independent variables, which is especially important for distance-based algorithms.
Splitting Data
- The dataset is divided into features (X) and target variable (y), further splitting into training and validation sets.

Key Functions

load_time_series(dirname): Loads time series data from specified directories.
dropMissingValueFeatures(train, test): Drops features with excessive missing values from datasets.
imputing(df): Fills in missing values using appropriate methods.

Model Training

The following models are implemented:

Decision Tree Classifier
Random Forest Classifier
XGBoost
LightGBM
CatBoost
Stacked Ensemble Techniques

Evaluation

Model performance is evaluated using the Quadratic Weighted Kappa metric, measuring agreement between predicted and actual values while accounting for chance agreements.

Results

Upon training the models, the following average scores are obtained:

Model	Average Quadratic Weighted Kappa
Decision Tree	0.5020
Random Forest	0.6656
XGBoost	0.6402
LightGBM	0.6486
CatBoost	0.6530
Stacked Ensemble Techniques (Best)	0.6624

Random Forest is choosen as the final model.

Hyperparameter Tuning

Optuna is used for hyperparameter optimization involving:

Defining an objective function for model training and evaluation.
Specifying search space for hyperparameters.
Running optimization to find best parameter values.

Stacked Ensemble Techniques

Stacked ensemble techniques involve:

Training diverse base models (e.g., Decision Trees, Random Forests).
Training a meta-model on predictions of base models to enhance accuracy.
Using cross-validation to prevent overfitting during meta-model training.
Evaluating stacked model performance against individual base models.

Conclusion

This project successfully developed a machine learning model to predict problematic internet usage levels based on physical activity and fitness data. The results demonstrate the effectiveness of the selected algorithms. Random forest after hyperparameter tuning performed the best and is selected for the final prediction.

Contributing

If you’d like to contribute to this project, please fork the repository and submit a pull request with your changes. Contributions are welcome :)

License

This project is licensed under the Apache-2.0 license. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
problematic-internet-use-prediction.ipynb		problematic-internet-use-prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Youth Problematic Internet Usage Prediction

Overview

Problem Statement

Objective

Table of Contents

Technologies

Data Sources

Installation

Usage

Data Preprocessing

Key Functions

Model Training

Evaluation

Results

Hyperparameter Tuning

Stacked Ensemble Techniques

Conclusion

Contributing

License

About

Releases

Packages

Languages

License

YashsTiwari/Youth-Problematic-Internet-Usage-Prediction

Folders and files

Latest commit

History

Repository files navigation

Youth Problematic Internet Usage Prediction

Overview

Problem Statement

Objective

Table of Contents

Technologies

Data Sources

Installation

Usage

Data Preprocessing

Key Functions

Model Training

Evaluation

Results

Hyperparameter Tuning

Stacked Ensemble Techniques

Conclusion

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages