Breast-Cancer-Prediction-Using-Machine-Learning

This project predicts whether a breast tumor is malignant or benign using machine learning models trained on the Wisconsin Breast Cancer Diagnostic Dataset. The workflow includes data preprocessing, statistical analysis, feature selection, model training, and evaluation using metrics such as accuracy and ROC-AUC.

Dataset Overview

Source: Kaggle – Breast Cancer Wisconsin (Diagnostic) Data Set
Samples: 569
Target Variable: diagnosis — Malignant (M) or Benign (B)

Project Workflow

Import Libraries
Data Loading
- Loaded the dataset.
Data Exploration
- Explored dataset shape, data types, column names, descriptive statistics, and missing values.
Data Cleaning
- Dropped irrelevant columns (id, Unnamed: 32).
- Plotted boxplots to detect outliers across numeric variables.
- Detected and capped outliers using the IQR method.
Data Visualization
- Visualized diagnosis label distribution (Malignant vs. Benign).
- Plotted Bar chart to compare mean values of each feature by diagnosis category.
- Plotted Histograms with KDE to assess feature distributions grouped by diagnosis.
Data Preprocessing (Encoding)
- Encoded target variable: 'Malignant': 1, 'Benign': 0.
Statistical Tests
- Conducted Shapiro-Wilk test for normality.
- Performed Mann-Whitney U test to assess distribution differences between diagnosis groups.
- Performed Spearman correlation to examine relationships between features and the target variable(diagnosis).
Feature Selection
- Selected the top 20 features based on Spearman correlation values.
Train-Test Split
- Split the dataset into 80% training and 20% testing sets.
Machine Learning Models
- Applied StandardScaler to normalize features for SVM and Logistic Regression models.
- Trained the following models:
  - Logistic Regression
  - Random Forest
  - Support Vector Machine (SVM)
  - Decision Tree
  - XGBoost
Model Accuracy Comparison

The following models were trained and evaluated using the selected top 20 features:

Model	Accuracy
Logistic Regression	98.25%
Random Forest	96.49%
Support Vector Machine	96.49%
XGBoost	95.61%
Decision Tree	94.74%

Logistic Regression achieved the highest accuracy.

Receiver Operating Characteristic Curve (ROC) Comparison

The ROC curve evaluates classification performance across thresholds. The AUC (Area Under the Curve) provides a single score indicating model separability.

Model	AUC Score
Logistic Regression	1.00
Random Forest	1.00
Support Vector Machine	1.00
XGBoost	0.99
Decision Tree	0.95

All models demonstrated strong classification performance, with Logistic Regression, Random Forest, and SVM achieving AUC scores of 1.00.

Logistic Regression was selected as the optimal model because, despite Random Forest, Logistic Regression, and Support Vector Machine (SVM) all attaining perfect AUC scores of 1.00, Logistic Regression achieved the highest accuracy of 98.25%. This performance exceeded that of Random Forest and SVM (both 96.49%), XGBoost (95.61%), and Decision Tree (94.74%). Its superior accuracy and discriminative capacity render it particularly suitable for this task.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Breast Cancer ML Project.ipynb		Breast Cancer ML Project.ipynb
Model Accuracy Comparison.png		Model Accuracy Comparison.png
README.md		README.md
Receiver Operating Characteristic Curve Comparison.png		Receiver Operating Characteristic Curve Comparison.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breast-Cancer-Prediction-Using-Machine-Learning

Dataset Overview

Project Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Breast-Cancer-Prediction-Using-Machine-Learning

Dataset Overview

Project Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages