Skip to content

minotikedare/Breast-Cancer-Prediction-Using-Machine-Learning

Repository files navigation

Breast-Cancer-Prediction-Using-Machine-Learning

This project predicts whether a breast tumor is malignant or benign using machine learning models trained on the Wisconsin Breast Cancer Diagnostic Dataset. The workflow includes data preprocessing, statistical analysis, feature selection, model training, and evaluation using metrics such as accuracy and ROC-AUC.


Dataset Overview


Project Workflow

  1. Import Libraries

  2. Data Loading

    • Loaded the dataset.
  3. Data Exploration

    • Explored dataset shape, data types, column names, descriptive statistics, and missing values.
  4. Data Cleaning

    • Dropped irrelevant columns (id, Unnamed: 32).
    • Plotted boxplots to detect outliers across numeric variables.
    • Detected and capped outliers using the IQR method.
  5. Data Visualization

    • Visualized diagnosis label distribution (Malignant vs. Benign).
    • Plotted Bar chart to compare mean values of each feature by diagnosis category.
    • Plotted Histograms with KDE to assess feature distributions grouped by diagnosis.
  6. Data Preprocessing (Encoding)

    • Encoded target variable: 'Malignant': 1, 'Benign': 0.
  7. Statistical Tests

    • Conducted Shapiro-Wilk test for normality.
    • Performed Mann-Whitney U test to assess distribution differences between diagnosis groups.
    • Performed Spearman correlation to examine relationships between features and the target variable(diagnosis).
  8. Feature Selection

    • Selected the top 20 features based on Spearman correlation values.
  9. Train-Test Split

    • Split the dataset into 80% training and 20% testing sets.
  10. Machine Learning Models

    • Applied StandardScaler to normalize features for SVM and Logistic Regression models.
    • Trained the following models:
      • Logistic Regression
      • Random Forest
      • Support Vector Machine (SVM)
      • Decision Tree
      • XGBoost
  11. Model Accuracy Comparison

The following models were trained and evaluated using the selected top 20 features:

Model Accuracy
Logistic Regression 98.25%
Random Forest 96.49%
Support Vector Machine 96.49%
XGBoost 95.61%
Decision Tree 94.74%

Logistic Regression achieved the highest accuracy.

Model Accuracy Comparison

  1. Receiver Operating Characteristic Curve (ROC) Comparison

The ROC curve evaluates classification performance across thresholds. The AUC (Area Under the Curve) provides a single score indicating model separability.

Model AUC Score
Logistic Regression 1.00
Random Forest 1.00
Support Vector Machine 1.00
XGBoost 0.99
Decision Tree 0.95

All models demonstrated strong classification performance, with Logistic Regression, Random Forest, and SVM achieving AUC scores of 1.00.

ROC Curve Comparison

Logistic Regression was selected as the optimal model because, despite Random Forest, Logistic Regression, and Support Vector Machine (SVM) all attaining perfect AUC scores of 1.00, Logistic Regression achieved the highest accuracy of 98.25%. This performance exceeded that of Random Forest and SVM (both 96.49%), XGBoost (95.61%), and Decision Tree (94.74%). Its superior accuracy and discriminative capacity render it particularly suitable for this task.

About

Developed machine learning models to predict whether breast tumors are malignant or benign using the Breast Cancer Dataset using Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors