Skip to content

Imageomics/biocap

Repository files navigation

BioCAP DOI

This repository contains the code for BioCAP training, evaluation, caption generation, and Wikipedia scraper. We developed this repository based on BioCLIP and OpenCLIP. BioCAP is trained on the TreeOfLife-10M dataset paired with a new TreeOfLife-10M Captions dataset, curated for this model. The BioCAP website is hosted from the gh-pages branch of this repository.

BioCAP is a CLIP model trained on the 10M-image dataset with both taxonomic labels and fine-grained synthetic captions. BioCAP achieves strong performance on biology-related tasks, including zero-shot classification and text-image retrieval.

Table of Contents

  1. Model
  2. Training and Evaluation Commands
  3. Paper, website, and data
  4. Citation

Model

The main differences in the training implementation between BioCAP and BioCLIP are the adopted model architecture and the introduction of captions. BioCAP uses two separate visual projectors. This part of the code is transformer.py. In addition, we incorporate synthetic captions as complementary supervision. Synthetic captions help bridge this gap by providing descriptive, trait-focused supervision. This part of the code is data.py and train.py. We provide the weight of BioCAP in the BioCAP model repo.

Commands

Clone this repository, then install the requirements:

conda env create -f biocap_requirements.yml

For more details on the training and evaluation processes and downloading the requisit data, please see the BioCAP Pipeline. A summary for training and evaluating on the different tasks is provided below.

Training

First download the data from TreeOfLife-10M and TreeOfLife-10M Captions to reproduce the model training.

To train the model, run:

sbatch slurm/train.sh

Evaluation

Species classification

We evaluated BioCAP on zero-shot classification evaluation using the same test datasets as BioCLIP 2. The metadata used in evaluation zero-shot classification is provided in data/classification_annotation. Please be sure to update the directories accordingly to reflect the locations of these data and metadata in slurm/eval_zero_shot.sh, then run:

sbatch slurm/eval_zero_shot.sh

Image Re-ranking (Query)

For this task, we evaluated on INQUIRE-Rerank, which assesses a model’s ability to reorder 100 initially retrieved images per query so that relevant ones appear higher in the ranking.

This evaluation can be performed by running:

sbatch slurm/eval_inquire.sh

Text-to-Image Retrieval Benchmarks

We also evaluate our model on the text-image retrieval task using datasets collected from Cornell Lab of Ornithology, Macaulay Library and PlantID.net. The metadata used is provided in data/retrieval_annotations. Please be sure to update the directories accordingly to reflect the locations of these data and metadata in slurm/eval_retrieval.sh, then run:

sbatch slurm/eval_retrieval.sh

Caption generation

We use vLLM with InternVL-3-38B to generate fine-grained captions for images. The caption generation process enriches species images with detailed descriptions of visual traits and characteristics. The domain-specific contexts, including Wikipedia-derived visual information and taxon-tailored format examples, are obtained from the TreeOfLife-10M dataset and should be downloaded and placed under data/wiki_and_format_example/ before running the generation scripts.

To generate captions, configure the paths in slurm/run_caption_gen.sh, then run:

sbatch slurm/run_caption_gen.sh

Wiki scraper

We provide scripts to scrape species descriptions from Wikipedia. The scraper extracts visual and morphological information for species based on their binomial names. Species lists are provided in data/wiki_species/, which include both unique and ambiguous species names.

To run the Wikipedia scraper:

sbatch slurm/scrape_wiki.sh

Note that Wikipedia is not versioned, so this process is not perfectly reproducible. This is why we provide the results of this webscraping in the TreeOfLife-10M-Captions dataset.

Paper, Website, and Data

We have a preprint on arXiv and a project website.

Our data is published on Hugging Face: TreeOfLife-10M-Captions, as is the existing TreeOfLife-10M to which the captions are applied (this is the source of the images and their associated taxonomic ranks).

Citation

Please cite our papers and the associated repositories if you use our code or results.

@article{zhang2025biocap,
  title    = {Bio{CAP}: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models},
  author   = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun Chao and Jianyang Gu},
  year     = {2025},
  eprint   = {2510.20095},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.20095}
}

Our code (this repository):

@software{biocapcode,
  author = {Ziheng Zhang and Xinyue Ma and Elizabeth G. Campolongo and Matthew J. Thompson and Net Zhang and Jianyang Gu},
  doi = {10.5281/zenodo.17437591},
  title = {{B}io{CAP}},
  version = {1.0.0},
  month = {oct},
  year = {2025}
}

Also consider citing OpenCLIP and BioCLIP:

@software{ilharco_gabriel_2021_5143773,
  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
  title={OpenCLIP},
  year={2021},
  doi={10.5281/zenodo.5143773},
}

Original BioCLIP Paper:

@inproceedings{stevens2024bioclip,
 title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life}, 
 author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
 booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
 year = {2024},
 pages = {19412-19424}
}

Original Code:

@software{bioclip2023code,
  author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn},
  doi = {10.5281/zenodo.10895871},
  title = {BioCLIP},
  version = {v1.0.0},
  year = {2024}
}

BioCLIP 2 Code:

@software{bioclip2code,
  author = {Jianyang Gu and Samuel Stevens and Elizabeth G. Campolongo and Matthew J. Thompson and Net Zhang and Jiaman Wu and Zheda Mai},
  doi = {10.5281/zenodo.15644363},
  title = {{B}io{CLIP} 2},
  version = {1.0.1},
  month = {sep},
  year = {2025}
}

License

BioCAP is released under the MIT License. Some elements of the code are copyright by others (see LICENSE); detailed provenance information is provided in HISTORY.md.