Skip to content

Samriddhi-Sen/Transcriptomic-Biomarker-Discovery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Integrated Transcriptomic Profiling & ML Biomarker Discovery for Precision Oncology

1. Project Overview

Objective: To move beyond traditional histology and identify molecular signatures that drive breast cancer subtypes. Data Source: High-throughput RNA-seq data (GSE45827) containing Basal, Her2, Luminal A, and Luminal B subtypes.

This project implements a hybrid Bioinformatics + Machine Learning pipeline to:

  1. Identify subtype-specific biomarkers using Differential Expression Analysis (Limma).
  2. Validate these markers using Random Forest Classification.
  3. Map dysregulated genes to biological pathways (Cell Cycle, p53 Signaling).

2. Technical Pipeline

The analysis integrates "Wet Lab" biological logic with "Dry Lab" data science:

  • Differential Expression: Utilized limma to model gene expression changes, applying stringent FDR thresholds (< 0.05).
  • Enrichment Analysis: Mapped DEGs to biological functions using clusterProfiler (GO & KEGG).
  • Machine Learning: Implemented a Random Forest Classifier to rank genes by predictive power (MeanDecreaseAccuracy), filtering noise to find true driver genes.

3. Key Findings

  • Distinct Signatures: PCA and Heatmap analysis revealed that the Basal subtype has a unique transcriptomic profile distinct from Luminal types.
  • Pathway Dysregulation: Discovered that Cell Cycle (hsa04110) and p53 signaling pathways are heavily altered in aggressive subtypes.
  • Biomarker Discovery: The Random Forest model identified X1553613_s_at and EPHB3 as top predictive features.

4. Visualizations

(Note: These images are stored in the plots/ folder)

PCA Plot Feature Importance
PCA RF
Distinct clustering of Basal Subtypes Top predictive genes from Random Forest
Heatmap of Top DEGs Volcano Plot (Luminal A vs Basal)
Heatmap Volcano
Expression patterns of top biomarkers Visualizing significance vs. magnitude of DEGs

5. Tools & Libraries

  • Language: R (v4.4.2)
  • Bioinformatics: limma, clusterProfiler, GEOquery
  • ML: randomForest, caret
  • Visualization: EnhancedVolcano, ggplot2

6. Reproducibility

  • Full analysis script available in analysis_pipeline.R.
  • Session information (library versions) available in results/session_info.txt.

About

R/Bioconductor pipeline for Breast Cancer biomarker discovery using Limma and Random Forest.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages