Modeling eukaryotic genomes using machine learning to identify functional elements like enhancers, introns, and promoter strength.
This project leverages deep learning to parse and predict various genomic features within eukaryotic DNA sequences. By training on established genomic datasets, these models provide high-accuracy mapping of regulatory and structural elements.
Robust scripts for converting raw FASTA or nucleotide sequences into machine-learning-ready formats using one-hot encoding or k-mer embeddings.
Implementation of specialized neural networks (such as CNNs or Transformers) tailored for long-range sequence analysis.
Dedicated modules for:
- Enhancer Regions: Identifying distal regulatory elements
- Intron/Exon Boundaries: Predicting splice sites and gene structure
- Promoter Strength: Quantifying potential transcriptional activity
To set up the environment and install necessary dependencies, run the following commands in your terminal:
# Clone the repository
git clone https://github.com/helloftroy/eukaryote_neural_net.git
# Navigate to the project directory
cd eukaryote_neural_net
# Install requirements
pip install -r requirements.txtPreprocess your raw genomic data into the required tensor format:
python src/preprocess.py --input data/raw_genome.fastaTrain the model using a configuration file:
python src/train.py --config configs/model_v1.yamlRun predictions on a specific sequence:
python src/predict.py --sequence "ATGCGAT..."- Target Organisms: Eukaryotic genomes (adaptable for model organisms like S. cerevisiae or Arabidopsis)
- Framework: PyTorch
- Evaluation Metrics: Precision-Recall curves, F1-scores, and Mean Squared Error (MSE) for promoter quantification
This project is a work in progress. Future updates may include support for non-model organisms and expanded regulatory element detection.