A PyTorch implementation of BarcodeMAE, a model for enhancing DNA foundation models to address masking inefficiencies.
Check out our paper
Model checkpoint is available here: BarcodeMAE
Use this jupyter notebook for quick start: Quick start
- Clone this repository
- Install the required libraries
pip install -r requirements.txt
pip install -e .
- Download the metadata file and copy it into the data folder
- Split the metadata file into smaller files according to the different partitions as presented in the BIOSCAN-5M paper
cd data/
python data_split.py BIOSCAN-5M_Dataset_metadata.tsv
- Download the checkpoint and copy it to the model_checkpoints directory
- Run KNN evaluation
python barcodebert/knn_probing.py \
--run-name knn_evaluation \
--data-dir ./data/ \
--pretrained-checkpoint "./model_checkpoints/best_pretraining.pt"\
--log-wandb \
--dataset BIOSCAN-5M
- Run pretraining
python barcodebert/pretraining.py \
--dataset=BIOSCAN-5M \
--k_mer=6 \
--n_layers=6 \
--n_heads=6 \
--decoder-n-layers=6 \
--decoder-n-heads=6 \
--data_dir=data/ \
--checkpoint=model_checkpoints/BIOSCAN-5M/6-6-6/model_checkpoint.pt
If you find BarcodeMAE useful in your research please consider citing:
@article{safari2025barcodemae,
title={Enhancing DNA Foundation Models to Address Masking Inefficiencies},
author={Monireh Safari
and Pablo Millan Arias
and Scott C. Lowe
and Lila Kari
and Angel X. Chang
and Graham W. Taylor
},
journal={arXiv preprint arXiv:2502.18405},
year={2025},
eprint={2502.18405},
archivePrefix={arXiv},
primaryClass={cs.LG},
doi={10.48550/arXiv.2502.18405},
}