Skip to content

36JungKwan/Bio_sequence_Research_AITALAB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

133 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 Bioinformatics Sequence Research - AiTA Lab

Multi-task deep learning framework for biosequence analysis, pathogenicity prediction, and protein/nucleotide feature extraction.

Research Focus: Utilize pre-trained language models (Nucleotide Transformer, ESM-2) to build models for predicting biological properties of genetic variants.


πŸ“‹ Project Overview

This project focuses on 3 main tasks:

Task Description Data Model
Task 1: Splicing Prediction Predict splicing site type (donor/acceptor) Sequence ~200bp NT embeddings
Task 2: Protein Prediction Predict protein properties from sequence Protein sequence ESM-2 embeddings
Task 3: Variant Pathogenicity Classify variants (pathogenic/benign) ClinVar + DNA/Protein seq Multi-modal Fusion Model

πŸ“ Directory Structure

Bio_sequence_Research_AITALAB/
β”‚
β”œβ”€β”€ data_processing/                          # πŸ“Š Data preprocessing & preparation
β”‚   β”œβ”€β”€ dataset1_ClinVar_preprocess_variant_summary.ipynb
β”‚   β”œβ”€β”€ dataset2_map_csq_hgvsc_aDun.ipynb     # Map CSQ & HGVS-C
β”‚   β”œβ”€β”€ dataset2_map_ref_alt_sequence_dna.ipynb
β”‚   β”œβ”€β”€ dataset2_map_ref_alt_sequence_protein.ipynb
β”‚   β”œβ”€β”€ dataset3_sequence_gencode.ipynb       # Extract sequences from GENCODE
β”‚
β”œβ”€β”€ tools/                                     # πŸ› οΈ Supporting tools
β”‚   β”œβ”€β”€ gnomAD_map_vep.ipynb                  # Map VEP annotations
β”‚   β”œβ”€β”€ test_parse_hgvsc_offset.ipynb         # Parse HGVS-C format
β”‚
β”œβ”€β”€ train/                                     # 🎯 Training pipelines
β”‚   β”‚
β”‚   β”œβ”€β”€ task1_splicing_prediction/            # Splicing site prediction
β”‚   β”‚   β”œβ”€β”€ data_preparation/
β”‚   β”‚   β”‚   β”œβ”€β”€ data_prepare.ipynb            # Data preparation
β”‚   β”‚   β”‚   β”œβ”€β”€ train_test_split.py
β”‚   β”‚   β”‚   β”œβ”€β”€ ratio_split.py
β”‚   β”‚   β”‚   └── extract_embed.py
β”‚   β”‚   └── training/
β”‚   β”‚       β”œβ”€β”€ main.ipynb                    # Training notebook
β”‚   β”‚       β”œβ”€β”€ model.py                      # LSTM model
β”‚   β”‚       β”œβ”€β”€ dataset.py                    # PyTorch Dataset
β”‚   β”‚       β”œβ”€β”€ train_set.py
β”‚   β”‚       β”œβ”€β”€ train_full.py
β”‚   β”‚       β”œβ”€β”€ metrics.py
β”‚   β”‚       β”œβ”€β”€ cm_visualize.py
β”‚   β”‚       └── fileio.py
β”‚   β”‚
β”‚   β”œβ”€β”€ task2_protein_prediction/             # Protein property prediction
β”‚   β”‚
β”‚   └── task3_variant_prediction/             # ⭐ Variant pathogenicity prediction (MAIN)
β”‚       β”œβ”€β”€ config.py                         # Configuration
β”‚       β”œβ”€β”€ split_data.py                     # Split by chromosome
β”‚       β”œβ”€β”€ precompute_embeddings.py          # Extract NT + ESM-2 embeddings
β”‚       β”œβ”€β”€ dataset.py                        # PyTorch Dataset
β”‚       β”œβ”€β”€ model.py                          # Multi-modal Fusion model
β”‚       β”œβ”€β”€ train.py                          # Training with tracking
β”‚       β”œβ”€β”€ main.ipynb                        # Full pipeline
β”‚       β”œβ”€β”€ README.md                         # Task-specific guide
β”‚       β”œβ”€β”€ data/                             # Train/Val/Test splits
β”‚       β”œβ”€β”€ embeddings/                       # Precomputed embeddings
β”‚       β”œβ”€β”€ experiments/                      # Experiment configs & results
β”‚       └── runs/                             # TensorBoard logs
β”‚
└── README.md                                  # This file

πŸš€ Quick Start

Prerequisites

# Python 3.9+
# CUDA 11.8+ (recommended for GPU support)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets numpy pandas scikit-learn matplotlib seaborn tensorboard jupyter
pip install pyarrow biopython pysam  # Bioinformatics tools

Environment Configuration

Create a .env file or set environment variables:

# Task 3 data path
TASK3_PARQUET=<path_to>/variant_protein_sequence_101aa.parquet

# Hugging Face token (if needed)
HUGGING_FACE_HUB_TOKEN=<your_token>

πŸ“Š Project Pipeline

1️⃣ Data Processing (data_processing/)

Goal: Prepare data from different sources (ClinVar, GENCODE, VEP) into standard format

Notebook Purpose
dataset1_ClinVar_preprocess_variant_summary.ipynb Filter & preprocess ClinVar variants
dataset2_map_csq_hgvsc_aDun.ipynb Map CSQ β†’ HGVS-C nomenclature
dataset2_map_ref_alt_sequence_dna.ipynb Extract DNA sequences (ref & alt)
dataset2_map_ref_alt_sequence_protein.ipynb Extract protein sequences
dataset3_sequence_gencode.ipynb Get sequences from GENCODE reference

Output: Parquet files with variant + sequences (DNA 601bp, Protein 101aa)


2️⃣ Task 1: Splicing Prediction (train/task1_splicing_prediction/)

Goal: Predict splicing site type from DNA sequence

Pipeline:

Raw Data (.csv) β†’ Train/Test Split β†’ Val Split β†’ Model Training β†’ Metrics

Run:

cd train/task1_splicing_prediction/data_preparation/
jupyter notebook data_prepare.ipynb

cd ../training/
jupyter notebook main.ipynb

3️⃣ Task 2: Protein Prediction (train/task2_protein_prediction/)

Goal: Predict protein properties/functions (TODO/In Progress)


4️⃣ Task 3: Variant Pathogenicity Prediction ⭐ (train/task3_variant_prediction/)

Goal: Classify genetic variants as Pathogenic or Benign

Model Architecture:

Input: DNA & Protein sequences from variants
       ↓
[DNA Seq] β†’ Nucleotide Transformer (NT) β†’ E_dna_ref, E_dna_alt
[Prot Seq] β†’ ESM-2 β†’ E_prot_ref, E_prot_alt
       ↓
Fusion Layer: [E_ref, E_alt, E_alt - E_ref]
       ↓
Concat DNA + Protein embeddings
       ↓
MLP Classifier β†’ Pathogenic (1) / Benign (0)

Pipeline:

cd train/task3_variant_prediction/

# 1. Split data by chromosome (chr20/21 β†’ test, rest β†’ train/val)
python split_data.py

# 2. Precompute embeddings (NT + ESM-2)
python precompute_embeddings.py

# 3. Run training with experiment tracking
python train.py

# Or run full pipeline from notebook
jupyter notebook main.ipynb

Key Features:

  • βœ… Multi-modal fusion (DNA + Protein)
  • βœ… Automatic experiment tracking (config, results, checkpoints)
  • βœ… TensorBoard logging
  • βœ… Best model selection
  • βœ… Train/Val/Test splits

View Results:

# TensorBoard
tensorboard --logdir=runs/

# Results JSON
cat experiments/experiment_*/results.json

🧠 Models & Pre-trained Embeddings

Model Purpose Source Input Size
Nucleotide Transformer (NT) DNA embedding extraction InstaDeepAI/nucleotide-transformer-500m-human-ref 601bp
ESM-2 Protein embedding extraction facebook/esm2_t33_650M_UR50D 101aa
Custom MLP Classifier Pathogenicity prediction Fusion model 1024 (512*2)

πŸ“ˆ Data Statistics

Task 3 (Variant Prediction)

  • Source: ClinVar variants + mapped sequences
  • Splits:
    • Train: All variants except chr20/21
    • Val: 15% of training (stratified)
    • Test: chr20, chr21
  • Labels: Pathogenic (1), Benign (0)
  • Sequence Length: DNA 601bp, Protein 101aa

πŸ”§ Configuration

Main Config File: train/task3_variant_prediction/config.py

# Hyperparameters
LR = 1e-3
EPOCHS = 30
BATCH_SIZE = 128
DROPOUT = 0.2
PATIENCE = 5

# Embeddings
PROJ_DIM = 512
FUSION_HIDDEN = [512, 256]

# Paths
TEST_CHROMS = {"chr20", "chr21"}
VAL_RATIO = 0.15
SEED = 42

πŸ“Š Results & Monitoring

Experiment Tracking

Each training run saves:

  • args.json: Command-line arguments
  • config.json: Configuration parameters
  • config.py: Copy of config file
  • results.json: Final metrics (accuracy, precision, recall, F1)
  • tensorboard/: TensorBoard events
experiments/
β”œβ”€β”€ experiment_1/
β”‚   β”œβ”€β”€ args.json
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ results.json
β”‚   └── tensorboard/
└── experiment_N/
    └── ...

View Results

# List all experiments
ls train/task3_variant_prediction/experiments/

# View results
cat train/task3_variant_prediction/experiments/experiment_4/results.json

πŸ’‘ Usage Examples

Inference (New Variants)

import torch
from train.task3_variant_prediction.model import FusionClassifier
from train.task3_variant_prediction.dataset import VariantDataset

# Load trained model
model = FusionClassifier(dna_emb_dim=1024, prot_emb_dim=1024)
model.load_state_dict(torch.load('best_fusion_model.pt'))

# Make predictions
logits = model(dna_embedding, prot_embedding)
predictions = torch.sigmoid(logits)

Add New Dataset

  1. Add preprocessing script to data_processing/
  2. Output parquet format: [variant_id, sequence_dna, sequence_protein, label, chrom]
  3. Update config.py path
  4. Run training pipeline

πŸ“š References


🀝 Contributing

To add features or fix bugs:

  1. Create feature branch: git checkout -b feature/your-feature
  2. Commit changes: git commit -m "Add your feature"
  3. Push: git push origin feature/your-feature
  4. Create pull request

πŸ“ Notes

  • All embeddings are precomputed from pre-trained models (not fine-tuned)
  • Test set is fixed as chr20/21 for benchmarking
  • Experiment tracking is automatic - no manual logging needed
  • Stratified train/val split is used to balance classes

πŸ“ž Contact & Support

  • Lab: AiTA Lab, FPTU
  • Project: Biosequence Research & Variant Prediction
  • Date: January 2026

Last Updated: 2026-01-07

About

Bio_sequence_Research_AITALAB is a deep learning framework that predicts pathogenicity of genetic variants and analyzes biological sequences using pre-trained language models (Nucleotide Transformer & ESM-2)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors