Multi-task deep learning framework for biosequence analysis, pathogenicity prediction, and protein/nucleotide feature extraction.
Research Focus: Utilize pre-trained language models (Nucleotide Transformer, ESM-2) to build models for predicting biological properties of genetic variants.
This project focuses on 3 main tasks:
| Task | Description | Data | Model |
|---|---|---|---|
| Task 1: Splicing Prediction | Predict splicing site type (donor/acceptor) | Sequence ~200bp | NT embeddings |
| Task 2: Protein Prediction | Predict protein properties from sequence | Protein sequence | ESM-2 embeddings |
| Task 3: Variant Pathogenicity | Classify variants (pathogenic/benign) | ClinVar + DNA/Protein seq | Multi-modal Fusion Model |
Bio_sequence_Research_AITALAB/
β
βββ data_processing/ # π Data preprocessing & preparation
β βββ dataset1_ClinVar_preprocess_variant_summary.ipynb
β βββ dataset2_map_csq_hgvsc_aDun.ipynb # Map CSQ & HGVS-C
β βββ dataset2_map_ref_alt_sequence_dna.ipynb
β βββ dataset2_map_ref_alt_sequence_protein.ipynb
β βββ dataset3_sequence_gencode.ipynb # Extract sequences from GENCODE
β
βββ tools/ # π οΈ Supporting tools
β βββ gnomAD_map_vep.ipynb # Map VEP annotations
β βββ test_parse_hgvsc_offset.ipynb # Parse HGVS-C format
β
βββ train/ # π― Training pipelines
β β
β βββ task1_splicing_prediction/ # Splicing site prediction
β β βββ data_preparation/
β β β βββ data_prepare.ipynb # Data preparation
β β β βββ train_test_split.py
β β β βββ ratio_split.py
β β β βββ extract_embed.py
β β βββ training/
β β βββ main.ipynb # Training notebook
β β βββ model.py # LSTM model
β β βββ dataset.py # PyTorch Dataset
β β βββ train_set.py
β β βββ train_full.py
β β βββ metrics.py
β β βββ cm_visualize.py
β β βββ fileio.py
β β
β βββ task2_protein_prediction/ # Protein property prediction
β β
β βββ task3_variant_prediction/ # β Variant pathogenicity prediction (MAIN)
β βββ config.py # Configuration
β βββ split_data.py # Split by chromosome
β βββ precompute_embeddings.py # Extract NT + ESM-2 embeddings
β βββ dataset.py # PyTorch Dataset
β βββ model.py # Multi-modal Fusion model
β βββ train.py # Training with tracking
β βββ main.ipynb # Full pipeline
β βββ README.md # Task-specific guide
β βββ data/ # Train/Val/Test splits
β βββ embeddings/ # Precomputed embeddings
β βββ experiments/ # Experiment configs & results
β βββ runs/ # TensorBoard logs
β
βββ README.md # This file
# Python 3.9+
# CUDA 11.8+ (recommended for GPU support)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets numpy pandas scikit-learn matplotlib seaborn tensorboard jupyter
pip install pyarrow biopython pysam # Bioinformatics toolsCreate a .env file or set environment variables:
# Task 3 data path
TASK3_PARQUET=<path_to>/variant_protein_sequence_101aa.parquet
# Hugging Face token (if needed)
HUGGING_FACE_HUB_TOKEN=<your_token>Goal: Prepare data from different sources (ClinVar, GENCODE, VEP) into standard format
| Notebook | Purpose |
|---|---|
dataset1_ClinVar_preprocess_variant_summary.ipynb |
Filter & preprocess ClinVar variants |
dataset2_map_csq_hgvsc_aDun.ipynb |
Map CSQ β HGVS-C nomenclature |
dataset2_map_ref_alt_sequence_dna.ipynb |
Extract DNA sequences (ref & alt) |
dataset2_map_ref_alt_sequence_protein.ipynb |
Extract protein sequences |
dataset3_sequence_gencode.ipynb |
Get sequences from GENCODE reference |
Output: Parquet files with variant + sequences (DNA 601bp, Protein 101aa)
Goal: Predict splicing site type from DNA sequence
Pipeline:
Raw Data (.csv) β Train/Test Split β Val Split β Model Training β Metrics
Run:
cd train/task1_splicing_prediction/data_preparation/
jupyter notebook data_prepare.ipynb
cd ../training/
jupyter notebook main.ipynbGoal: Predict protein properties/functions (TODO/In Progress)
Goal: Classify genetic variants as Pathogenic or Benign
Model Architecture:
Input: DNA & Protein sequences from variants
β
[DNA Seq] β Nucleotide Transformer (NT) β E_dna_ref, E_dna_alt
[Prot Seq] β ESM-2 β E_prot_ref, E_prot_alt
β
Fusion Layer: [E_ref, E_alt, E_alt - E_ref]
β
Concat DNA + Protein embeddings
β
MLP Classifier β Pathogenic (1) / Benign (0)
Pipeline:
cd train/task3_variant_prediction/
# 1. Split data by chromosome (chr20/21 β test, rest β train/val)
python split_data.py
# 2. Precompute embeddings (NT + ESM-2)
python precompute_embeddings.py
# 3. Run training with experiment tracking
python train.py
# Or run full pipeline from notebook
jupyter notebook main.ipynbKey Features:
- β Multi-modal fusion (DNA + Protein)
- β Automatic experiment tracking (config, results, checkpoints)
- β TensorBoard logging
- β Best model selection
- β Train/Val/Test splits
View Results:
# TensorBoard
tensorboard --logdir=runs/
# Results JSON
cat experiments/experiment_*/results.json| Model | Purpose | Source | Input Size |
|---|---|---|---|
| Nucleotide Transformer (NT) | DNA embedding extraction | InstaDeepAI/nucleotide-transformer-500m-human-ref | 601bp |
| ESM-2 | Protein embedding extraction | facebook/esm2_t33_650M_UR50D | 101aa |
| Custom MLP Classifier | Pathogenicity prediction | Fusion model | 1024 (512*2) |
- Source: ClinVar variants + mapped sequences
- Splits:
- Train: All variants except chr20/21
- Val: 15% of training (stratified)
- Test: chr20, chr21
- Labels: Pathogenic (1), Benign (0)
- Sequence Length: DNA 601bp, Protein 101aa
Main Config File: train/task3_variant_prediction/config.py
# Hyperparameters
LR = 1e-3
EPOCHS = 30
BATCH_SIZE = 128
DROPOUT = 0.2
PATIENCE = 5
# Embeddings
PROJ_DIM = 512
FUSION_HIDDEN = [512, 256]
# Paths
TEST_CHROMS = {"chr20", "chr21"}
VAL_RATIO = 0.15
SEED = 42Each training run saves:
- args.json: Command-line arguments
- config.json: Configuration parameters
- config.py: Copy of config file
- results.json: Final metrics (accuracy, precision, recall, F1)
- tensorboard/: TensorBoard events
experiments/
βββ experiment_1/
β βββ args.json
β βββ config.json
β βββ results.json
β βββ tensorboard/
βββ experiment_N/
βββ ...
# List all experiments
ls train/task3_variant_prediction/experiments/
# View results
cat train/task3_variant_prediction/experiments/experiment_4/results.jsonimport torch
from train.task3_variant_prediction.model import FusionClassifier
from train.task3_variant_prediction.dataset import VariantDataset
# Load trained model
model = FusionClassifier(dna_emb_dim=1024, prot_emb_dim=1024)
model.load_state_dict(torch.load('best_fusion_model.pt'))
# Make predictions
logits = model(dna_embedding, prot_embedding)
predictions = torch.sigmoid(logits)- Add preprocessing script to
data_processing/ - Output parquet format:
[variant_id, sequence_dna, sequence_protein, label, chrom] - Update
config.pypath - Run training pipeline
- ClinVar: https://www.ncbi.nlm.nih.gov/clinvar/
- Nucleotide Transformer: https://github.com/instadeepai/nucleotide-transformer
- ESM-2: https://github.com/facebookresearch/protein-folding
- VEP: https://www.ensembl.org/info/docs/tools/vep/
To add features or fix bugs:
- Create feature branch:
git checkout -b feature/your-feature - Commit changes:
git commit -m "Add your feature" - Push:
git push origin feature/your-feature - Create pull request
- All embeddings are precomputed from pre-trained models (not fine-tuned)
- Test set is fixed as chr20/21 for benchmarking
- Experiment tracking is automatic - no manual logging needed
- Stratified train/val split is used to balance classes
- Lab: AiTA Lab, FPTU
- Project: Biosequence Research & Variant Prediction
- Date: January 2026
Last Updated: 2026-01-07