Objective: To move beyond traditional histology and identify molecular signatures that drive breast cancer subtypes. Data Source: High-throughput RNA-seq data (GSE45827) containing Basal, Her2, Luminal A, and Luminal B subtypes.
This project implements a hybrid Bioinformatics + Machine Learning pipeline to:
- Identify subtype-specific biomarkers using Differential Expression Analysis (Limma).
- Validate these markers using Random Forest Classification.
- Map dysregulated genes to biological pathways (Cell Cycle, p53 Signaling).
The analysis integrates "Wet Lab" biological logic with "Dry Lab" data science:
- Differential Expression: Utilized
limmato model gene expression changes, applying stringent FDR thresholds (< 0.05). - Enrichment Analysis: Mapped DEGs to biological functions using
clusterProfiler(GO & KEGG). - Machine Learning: Implemented a Random Forest Classifier to rank genes by predictive power (
MeanDecreaseAccuracy), filtering noise to find true driver genes.
- Distinct Signatures: PCA and Heatmap analysis revealed that the Basal subtype has a unique transcriptomic profile distinct from Luminal types.
- Pathway Dysregulation: Discovered that Cell Cycle (hsa04110) and p53 signaling pathways are heavily altered in aggressive subtypes.
- Biomarker Discovery: The Random Forest model identified X1553613_s_at and EPHB3 as top predictive features.
(Note: These images are stored in the plots/ folder)
| PCA Plot | Feature Importance |
|---|---|
![]() |
![]() |
| Distinct clustering of Basal Subtypes | Top predictive genes from Random Forest |
| Heatmap of Top DEGs | Volcano Plot (Luminal A vs Basal) |
|---|---|
![]() |
![]() |
| Expression patterns of top biomarkers | Visualizing significance vs. magnitude of DEGs |
- Language: R (v4.4.2)
- Bioinformatics:
limma,clusterProfiler,GEOquery - ML:
randomForest,caret - Visualization:
EnhancedVolcano,ggplot2
- Full analysis script available in
analysis_pipeline.R. - Session information (library versions) available in
results/session_info.txt.



