This project predicts whether a breast tumor is malignant or benign using machine learning models trained on the Wisconsin Breast Cancer Diagnostic Dataset. The workflow includes data preprocessing, statistical analysis, feature selection, model training, and evaluation using metrics such as accuracy and ROC-AUC.
- Source: Kaggle – Breast Cancer Wisconsin (Diagnostic) Data Set
- Samples: 569
- Target Variable:
diagnosis— Malignant (M) or Benign (B)
-
Import Libraries
-
Data Loading
- Loaded the dataset.
-
Data Exploration
- Explored dataset shape, data types, column names, descriptive statistics, and missing values.
-
Data Cleaning
- Dropped irrelevant columns (
id,Unnamed: 32). - Plotted boxplots to detect outliers across numeric variables.
- Detected and capped outliers using the IQR method.
- Dropped irrelevant columns (
-
Data Visualization
- Visualized diagnosis label distribution (Malignant vs. Benign).
- Plotted Bar chart to compare mean values of each feature by diagnosis category.
- Plotted Histograms with KDE to assess feature distributions grouped by diagnosis.
-
Data Preprocessing (Encoding)
- Encoded target variable:
'Malignant':1,'Benign':0.
- Encoded target variable:
-
Statistical Tests
- Conducted Shapiro-Wilk test for normality.
- Performed Mann-Whitney U test to assess distribution differences between diagnosis groups.
- Performed Spearman correlation to examine relationships between features and the target variable(diagnosis).
-
Feature Selection
- Selected the top 20 features based on Spearman correlation values.
-
Train-Test Split
- Split the dataset into 80% training and 20% testing sets.
-
Machine Learning Models
- Applied StandardScaler to normalize features for SVM and Logistic Regression models.
- Trained the following models:
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)
- Decision Tree
- XGBoost
-
Model Accuracy Comparison
The following models were trained and evaluated using the selected top 20 features:
| Model | Accuracy |
|---|---|
| Logistic Regression | 98.25% |
| Random Forest | 96.49% |
| Support Vector Machine | 96.49% |
| XGBoost | 95.61% |
| Decision Tree | 94.74% |
Logistic Regression achieved the highest accuracy.
- Receiver Operating Characteristic Curve (ROC) Comparison
The ROC curve evaluates classification performance across thresholds. The AUC (Area Under the Curve) provides a single score indicating model separability.
| Model | AUC Score |
|---|---|
| Logistic Regression | 1.00 |
| Random Forest | 1.00 |
| Support Vector Machine | 1.00 |
| XGBoost | 0.99 |
| Decision Tree | 0.95 |
All models demonstrated strong classification performance, with Logistic Regression, Random Forest, and SVM achieving AUC scores of 1.00.
Logistic Regression was selected as the optimal model because, despite Random Forest, Logistic Regression, and Support Vector Machine (SVM) all attaining perfect AUC scores of 1.00, Logistic Regression achieved the highest accuracy of 98.25%. This performance exceeded that of Random Forest and SVM (both 96.49%), XGBoost (95.61%), and Decision Tree (94.74%). Its superior accuracy and discriminative capacity render it particularly suitable for this task.

