This project focuses on predicting diabetes using multiple machine learning classification algorithms. The dataset is analyzed, cleaned, visualized, and modeled using classical ML techniques.
- Source: Pima Indians Diabetes Dataset
- Features include:
- Pregnancies
- Glucose
- BloodPressure
- SkinThickness
- Insulin
- BMI
- DiabetesPedigreeFunction
- Age
- Target variable: Outcome (0: Non-diabetic, 1: Diabetic)
- Dataset overview (
info,describe) - Pairwise feature relationships (Pairplot)
- Correlation heatmap
- Outlier detection using IQR method
- Outlier removal with Interquartile Range (IQR)
- Train-test split (75% train / 25% test)
- Feature scaling using StandardScaler
- Logistic Regression
- Decision Tree
- K-Nearest Neighbors (KNN)
- Naive Bayes
- Support Vector Machine (SVM)
- AdaBoost
- Gradient Boosting
- Random Forest
Model performance evaluated using 10-Fold Cross Validation.
- GridSearchCV applied on Decision Tree
- Best parameters selected using 5-Fold CV
- Confusion Matrix
- Classification Report (Precision, Recall, F1-score)
- Accuracy comparison using boxplots
The trained model can predict diabetes for new patient data:
new_data = [[6,149,72,35,0,34.6,0.627,51]]