PowerPlant is a Python package that leverages deep learning to forecast the success of DNA extraction from herbarium samples. This tool is designed to assist botanical researchers in optimizing their selection of herbarium specimens for genomic studies.
PowerPlant employs a deep learning algorithm that integrates multiple data sources to predict ancient DNA extraction success:
- Morphological features from scanned herbarium images
- Sample color information
- Metadata including sample age and locality
- DNA quantity metrics from previously processed samples
Trained on a dataset of approximately 2,000 herbarium specimens from the PAFTOL project, spanning nearly two centuries (1832 to present), PowerPlant aims to revolutionize the approach to working with herbarium-derived DNA.
- Linux or macOS operating system.
- Python 3.11 (later versions may not be fully supported by some dependencies).
- GPU support recommended for optimal performance.
PowerPlant integrates several deep learning tools. While most are
distributed as Python packages and can be installed via pip
or
conda
, the image segmentation component relies on the
PaddleSeg framework, which
requires specific installation steps.
- Clone the PowerPlant repository:
git clone https://github.com/sales-lab/powerplant.git
- Create and activate a virtual environment:
cd powerplant
python3 -m venv .venv
source .venv/bin/activate
- Install PowerPlant:
pip install ./
- Install PaddleSeg:
cd vendor
sh install-paddleseg.sh
Note: The default installation is CPU-only. For GPU support, modify the script to install the appropriate PaddlePaddle and PaddleSeg variants for your hardware.
- Download trained weights for segmentation:
cd checkpoints
curl --location --output segmentation-checkpoint.zip 'https://figshare.com/ndownloader/files/52146800'
unzip segmentation-checkpoint.zip
PowerPlant processes herbarium sheet images in JPEG format (with .jpg
extension).
The package automatically performs two key operations on your images: - Segmentation to isolate plant material and remove extraneous elements such as annotations, labels, stamps, and envelopes. - Resizing of images so that the longest side is at most 1024 pixels long.
Copy your original herbarium sheet images to the images/original
directory, then run the following command:
powerplant-segment
The processed images (segmented and resized) will be stored in the
images/masked
directory.
PowerPlant employs a convolutional neural network (CNN) coupled with metadata analysis to predict DNA yield from herbarium specimens. This dual-input model processes both segmented images and associated specimen data to generate accurate yield estimates.
To use this feature:
- Retrieve the segmented herbarium images from the
images/masked
directory (generated during the preprocessing step described in the Image Segmentation section) and copy them into thedataset
directory. Divide these images into training and test sets by placing them in the correspondingdataset/train
anddataset/test
subdirectories. - Prepare your metadata in a CSV file named
metadata.csv
and place it inside thedataset
directory. This file should contain relevant information for each specimen, including:- Specimen age;
- Location of sample collection;
- Taxonomic information.
An example file metadata/samples.csv
is included in this repository to
guide you in formatting your metadata correctly.
To train the prediction model, run the following command:
powerplant-train
This script processes the images and metadata from the dataset
directory, trains the machine learning model, and saves the trained
model in the checkpoints/prediction
directory.
GNU Affero General Public License, version 3.
For questions and support, please open an issue on our GitHub repository.