Skip to content

bioscan-ml/BarcodeMAE

Repository files navigation

BarcodeMAE

A PyTorch implementation of BarcodeMAE, a model for enhancing DNA foundation models to address masking inefficiencies.

drawing

Check out our paper

Model checkpoint is available here: BarcodeMAE

Quick start

Use this jupyter notebook for quick start: Quick start

Setup

  1. Clone this repository
  2. Install the required libraries
pip install -r requirements.txt
pip install -e .

Preparing the data

  1. Download the metadata file and copy it into the data folder
  2. Split the metadata file into smaller files according to the different partitions as presented in the BIOSCAN-5M paper
cd data/
python data_split.py BIOSCAN-5M_Dataset_metadata.tsv

Reproducing the results

  1. Download the checkpoint and copy it to the model_checkpoints directory
  2. Run KNN evaluation
python barcodebert/knn_probing.py \
  --run-name knn_evaluation \
  --data-dir ./data/ \
  --pretrained-checkpoint "./model_checkpoints/best_pretraining.pt"\
  --log-wandb \
  --dataset BIOSCAN-5M

Pretraining from scratch

  1. Run pretraining
python barcodebert/pretraining.py \
  --dataset=BIOSCAN-5M \
  --k_mer=6 \
  --n_layers=6 \
  --n_heads=6 \
  --decoder-n-layers=6 \
  --decoder-n-heads=6 \
  --data_dir=data/ \
  --checkpoint=model_checkpoints/BIOSCAN-5M/6-6-6/model_checkpoint.pt

Citation

If you find BarcodeMAE useful in your research please consider citing:

@article{safari2025barcodemae,
  title={Enhancing DNA Foundation Models to Address Masking Inefficiencies},
  author={Monireh Safari
    and Pablo Millan Arias
    and Scott C. Lowe
    and Lila Kari
    and Angel X. Chang
    and Graham W. Taylor
  },
  journal={arXiv preprint arXiv:2502.18405},
  year={2025},
  eprint={2502.18405},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arXiv.2502.18405},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published