This repository documents a pipeline for extracting structured morphological traits from botanical species descriptions using large language models. It follows the Smithsonian AI Model Catalog format. Although this pipeline is designed for Pedicularis, its prompt-based approach makes it easily adaptable to other plant clades or non-plant groups with comparabale documentation.
- Clone or fork this repository to access the AI-based trait extraction pipeline
- Install required dependencies listed in requirements.txt or the user guide
- Prepare the input data: species descriptions in plain text or CSV format
- Run the extraction pipelne (scripts and notebooks in examples)
- Load the structured outputs in CSV or JSON format for subsequent ecological and evolutionary analyses
- Core documentation describing model scope, usage, and metadata (Smithsonian AI Model Catalog Standard)
- MIT License
- Version history and update log
- Clone the repository
- Set up a Python environment and install dependencies
- Format species descriptions (see examples in sample-data)
- Follow instructions in user guide
- Update README.md and metadata fields
- Adjust prompt and schema for target clade and trait set
- Add or modify usage examples
- Log changes in CHANGELOG.md
- Use consistent, structured input for better extraction accuracy
- Document trait definitions and prompt formats
- Update regularly
- Share modified pipeline by linking to this repository or citing it
ai-catalog-binary-trait-extractor/
├── README.md # Main model documentation
├── LICENSE # Model licensing information (MIT)
├── CHANGELOG.md # Version history and update log
├── .gitignore
├── examples/ # Usage examples
│ ├── Trait_extraction_pipeline.ipynb # Trait extraction pipeline and visualizations
│ └── Trait_extraction_pipeline.py # Python script version of pipeline
├── docs/ # Additional documentation
│ ├── user-guide.md # How to use the model
│ └── technical-specs.md # Technical specifications and reproducibility
└── assets/ # Supporting files
├── images/ # Example visualizations (e.g., Dichotomous key, UMAP)
└── Data/ # Datasets
Copy this template and fill in the information to the best of your ability. Keep in mind, certain sections may not be appropriate to your use case.
- Model Name: Binary Trait Extractor for Pedicularis
- Version: v1.0.0
- Description: This tool uses a large language model (e.g., OpenAI GPT-3.5) to extract standardized morphological traits from botanical species descriptions (i.e. taxonomic treatments). It encodes binary, numeric, and categorical traits into structured matrices that subsequently enable trait-based diversity analysis, detection of unique character combinations, and generation of dichotomous keys.
- Type: LLM-based text-to-structure pipeline
- Release Date: 2025-07-24
- Developer/Owner: (Marc-Elie Adaime (Smithsonian Data Science Lab)
- Total Parameters: Not Applicable (uses OpenAI GPT-3.5 via API)
- Training Dataset Size: Not Applicable (model is pre-trained)
- Primary Metric: Manual trait extraction accuracy: XX% (estimated based on validation of output against expert-annotated traits)
- Last Updated: 2025-07-25
- Evaluation Repository: [Link to ai-eval-model_name if exists]
- Model Files: [Link to model weights/files]
- Additional Links: [Link to additional documentation if applicable]
- Extracting binary, numeric, and categorical morphological, life history, and physiological traits from botanical species descriptions using large language models
- Generating structured trait matrices for downstream analyses such as trait uniqueness evaluation, trait-based clustering (using ordination methods), and dichotomous key construction
- Extraction of traits from poorly formatted textual sources
- Use as a substitute for expert taxonomic validation without post-processing
- Reliance on the consistency and certainty of species descriptions; results may vary with ambiguous or uncertain descriptions, as well as with highly unstructured texts
- Trait extraction depends on prompt sensitivity and the behavior of the selected LLM (e.g., OpenAI)
- The pipeline lacks confidence or uncertainty quantification in its current form
This model supports biodiversity and conservation research aligned with the Smithsonian's mission to advance understanding of natural history and evolution. It helps researchers standardize phenotypic data for a broad range of ecological and evolutionary studies.
- Algorithm: Prompt-based extraction using large language models (LLMs)
- Architecture: API calls to GPT-3.5 using structured JSON prompts and validation logic
- Model Size: Not applicable (model is externally hosted by OpenAI)
- Input Requirements: Cleaned species descriptions in plain text or CSV format, consisting of English-language morphological descriptions
- Output Format: Structured JSON or CSV files representing trait matrices, encoding binary/numeric/categorical traits by species
- Hardware Requirements: Minimal for local usage (CPU); requires internet access and OpenAI API key
- Sources: None used locally; the LLM is pre-trained on a wide corpus including scientific literature relevant to botanical descriptions
- Dataset Size: 352 species from the Flora of China; 45 traits
- Data Preprocessing: Standardize parsing and formatting of botanical species descriptions before sending to API
- Methodology: No local training; inference-only pipeline via prompt engineering
- Training Infrastructure: Not applicable
While accuracy, precision, and recall are standard metrics in AI model evaluation, we have not yet formally assessed these in the context of this trait extraction pipeline. In our case, performance should ultimately be evaluated based on the following:
- Accuracy of trait extraction, by comparing LLM-derived traits against expert-annotated descriptions and curated character matrices
- Effectiveness of species discrimination, and the construction of a valid dichotomous key for identifying species (here, within Pedicularis)
- Diversity quantification, by evaluating whether the encoded trait matrix captures meaningful variation within the clade (e.g., via clustering or ordination) and reflects ecological and evolutionary patterns
No formal benchmark comparison yet; accuracy evaluated through manual revision
Bias may result from inconsistencies in species descriptions or biases associated with the language model itself
- Use highly structured prompts and validation schemes
- Manual review of extracted traits and expert consultation when needed
Potential inconsistent performance across clades with highly variable taxonomic descriptions
Not applicable (no personal attributes involved)
- Local Python environment (Jupyter notebook / script)
- Requires access to OpenAI API (e.g., GPT-3.5)
- Python 3.8+
- OpenAI
- Pandas
- Numpy
- Tqdm
- Compute: CPU sufficient; No GPU required
- Memory: ≥4 GB RAM recommended for processing large datasets
- Storage: < 500 MB for output files (JSON/CSV); more if processing additional species
- Can be integrated with downstream ecological and evolutionary analysis pipelines
- Manual run and review; no continuous monitoring currently
- Pipeline does not involve sensitive data
- Minimal risks associated with incorrect trait extraction; downstream analyses should involved expert review
- Users must supply their own OpenAI API key (if using an OpenAI model)
Oversight is provided by the pipeline developer, Marc-Elie Adaime, who is responsible for updating the code and documentation.
This model is released under [LICENSE NAME]. See LICENSE for details.
[Licenses covering training data]
@misc{[binary_trait_extractor],
title={[Binary Trait Extractor for _Pedicularis_]},
author={[Marc-Elie Adaime]},
year={[2025]},
url={[https://github.com/madaime2/ai-catalog-binary-trait-extractor]}
}
The code and prompt are original and developed by the author; no proprietary components included
- OpenAI GPT-3.5
- Python open-source libraries
See User Guide for detailed instructions.
- Occasional LLM hallucinations and inconsistencies in trait extraction
- Some rare traits may be missed without proper prompt engineering
- Technical Support: [email protected]
- Model Maintainer: Marc-Elie Adaime
This pipeline will be maintained through periodic updates to the codebase, prompt templates, and trait selection. Future improvements may include adopting different LLMs, expanding the trait list, and designing and running further analyses.
Marc-Elie Adaime
When necessary, particularly when user feedback is incoporated.
Not applicable - the system relies on pre-trained LLMs (e.g., OpenAI GPT-3.5) and prompt engineering. Retraining is not currently part of the architecture/pipeline.
See CHANGELOG.md for detailed version history.
- Main Repository: https://github.com/madaime2/ai-catalog-binary-trait-extractor
- Training Code: Not applicable - uses OpenAI LLMs via API without fine-tuning
See Reproducibility Guide for detailed instructions, including:
- How to extract and load species descriptions
- How to format trait extraction prompts
- How to run the extraction pipeline and store outputs
- How to perform various analyses, including diversity quantification
N/A - No random seed control available via OpenAI's API. However, deterministic mode (temperature = 0) is used to ensure repeatability.
Manual comparison of extract traits against-expert annotated descriptions for a subset of species.
No formal review yet. This tool is under independent development.