Cardiovascular Diseases Risk Prediction

Heart disease is a major global health concern. This project aims to leverage data science and machine learning techniques to predict the risk of cardiovascular diseases based on various attributes such as exercise habits, diet, medical history, and lifestyle factors. By improving early detection and providing actionable insights, this project strives to enhance healthcare outcomes.

Introduction

This project explores the prediction of cardiovascular disease risk using a dataset of health-related attributes. By analyzing risk factors such as BMI, exercise habits, smoking history, and dietary patterns, we aim to develop predictive models and provide interpretability for actionable insights.

Dataset

The dataset is sourced from Kaggle. It contains 308,774 rows and 19 attributes, including:

General health
Checkup frequency
Exercise habits
Medical history (diabetes, skin cancer, etc.)
BMI, height, and weight
Dietary habits

Data Preprocessing

Removed duplicates (80 rows).
Encoded categorical data (e.g., yes/no replaced with 1/0).
Applied outlier detection and removal using the IQR method.
Balanced the dataset using SMOTE for minority oversampling.

Project Workflow

The project follows this structured workflow:

Dataset Exploration: Univariate, bivariate, and correlation analysis.
Data Preprocessing: Cleaning, feature encoding, outlier handling, and oversampling.
Feature Engineering: Identified important features for heart disease prediction.
Model Development:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Artificial Neural Network
Evaluation: Performance metrics (accuracy, AUC-ROC) and model interpretability.

Models and Techniques Used

Baseline Model: Logistic Regression
- Accuracy: 82%
- Key Insights: Transparent and interpretable coefficients.
Decision Tree Classifier
- Accuracy: 88%
- Advantages: Non-linear relationships and feature importance.
Random Forest Classifier
- Accuracy: 93%
- Advantages: Reduced overfitting and robust performance.
Artificial Neural Network
- Accuracy: 83%
- Advantages: Handles complex relationships and latent patterns.

Key Visualizations

Correlation heatmaps
Top 10 features by importance
AUC-ROC curves
Distribution plots for key variables like exercise, BMI, and smoking history

Results

Best Performing Model: Random Forest Classifier with an accuracy of 93%.
Interpretability: Decision Tree and Logistic Regression provided clear insights into key risk factors.
Key Risk Factors Identified:
- Age
- Diabetes
- Smoking history
- BMI
- Exercise habits

Getting Started

Prerequisites

Python 3.7 or later
Required libraries: pandas, numpy, sklearn, seaborn, matplotlib, plotly, imblearn, tensorflow, sqlalchemy, pandasql

Installation

Clone the repository:

git clone https://github.com/yourusername/cvd-risk-prediction.git
cd cvd-risk-prediction

Install dependencies:
```
pip install -r requirements.txt
```
Run the project:
```
python big_data_final_project.py
```

How to Contribute

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a feature branch:
```
git checkout -b feature-name
```
Commit your changes:
```
git commit -m "Add feature-name"
```
Push to the branch:
```
git push origin feature-name
```
Create a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Feel free to customize this template based on your specific project requirements or to include additional sections, such as "Challenges Faced" or "Future Work."

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Project		Project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cardiovascular Diseases Risk Prediction

Table of Contents

Introduction

Dataset

Data Preprocessing

Project Workflow

Models and Techniques Used

Key Visualizations

Results

Getting Started

Prerequisites

Installation

How to Contribute

License

About

Releases

Packages

Languages

samarthchandrawat/CVD_Prediction

Folders and files

Latest commit

History

Repository files navigation

Cardiovascular Diseases Risk Prediction

Table of Contents

Introduction

Dataset

Data Preprocessing

Project Workflow

Models and Techniques Used

Key Visualizations

Results

Getting Started

Prerequisites

Installation

How to Contribute

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages