Heart disease is a major global health concern. This project aims to leverage data science and machine learning techniques to predict the risk of cardiovascular diseases based on various attributes such as exercise habits, diet, medical history, and lifestyle factors. By improving early detection and providing actionable insights, this project strives to enhance healthcare outcomes.
- Introduction
- Dataset
- Project Workflow
- Models and Techniques Used
- Results
- Getting Started
- How to Contribute
- License
This project explores the prediction of cardiovascular disease risk using a dataset of health-related attributes. By analyzing risk factors such as BMI, exercise habits, smoking history, and dietary patterns, we aim to develop predictive models and provide interpretability for actionable insights.
The dataset is sourced from Kaggle. It contains 308,774 rows and 19 attributes, including:
- General health
- Checkup frequency
- Exercise habits
- Medical history (diabetes, skin cancer, etc.)
- BMI, height, and weight
- Dietary habits
- Removed duplicates (80 rows).
- Encoded categorical data (e.g., yes/no replaced with 1/0).
- Applied outlier detection and removal using the IQR method.
- Balanced the dataset using SMOTE for minority oversampling.
The project follows this structured workflow:
- Dataset Exploration: Univariate, bivariate, and correlation analysis.
- Data Preprocessing: Cleaning, feature encoding, outlier handling, and oversampling.
- Feature Engineering: Identified important features for heart disease prediction.
- Model Development:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Artificial Neural Network
- Evaluation: Performance metrics (accuracy, AUC-ROC) and model interpretability.
-
Baseline Model: Logistic Regression
- Accuracy: 82%
- Key Insights: Transparent and interpretable coefficients.
-
Decision Tree Classifier
- Accuracy: 88%
- Advantages: Non-linear relationships and feature importance.
-
Random Forest Classifier
- Accuracy: 93%
- Advantages: Reduced overfitting and robust performance.
-
Artificial Neural Network
- Accuracy: 83%
- Advantages: Handles complex relationships and latent patterns.
- Correlation heatmaps
- Top 10 features by importance
- AUC-ROC curves
- Distribution plots for key variables like exercise, BMI, and smoking history
- Best Performing Model: Random Forest Classifier with an accuracy of 93%.
- Interpretability: Decision Tree and Logistic Regression provided clear insights into key risk factors.
- Key Risk Factors Identified:
- Age
- Diabetes
- Smoking history
- BMI
- Exercise habits
- Python 3.7 or later
- Required libraries:
pandas
,numpy
,sklearn
,seaborn
,matplotlib
,plotly
,imblearn
,tensorflow
,sqlalchemy
,pandasql
- Clone the repository:
git clone https://github.com/yourusername/cvd-risk-prediction.git cd cvd-risk-prediction
- Install dependencies:
pip install -r requirements.txt
- Run the project:
python big_data_final_project.py
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a feature branch:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add feature-name"
- Push to the branch:
git push origin feature-name
- Create a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
Feel free to customize this template based on your specific project requirements or to include additional sections, such as "Challenges Faced" or "Future Work."