The ESP-AVES2 model collection is available on HuggingFace: EarthSpeciesProject/esp-aves2
| Model Name | Architecture | Training Data | HuggingFace |
|---|---|---|---|
esp_aves2_sl_beats_all |
BEATs | All (AudioSet + Bio) | Link |
esp_aves2_sl_beats_bio |
BEATs | Bioacoustics | Link |
esp_aves2_naturelm_audio_v1_beats |
BEATs + NatureLM | All | Link |
esp_aves2_eat_all |
EAT | All (AudioSet + Bio) | Link |
esp_aves2_eat_bio |
EAT | Bioacoustics | Link |
esp_aves2_sl_eat_all_ssl_all |
EAT (SSL) | All | Link |
esp_aves2_sl_eat_bio_ssl_all |
EAT (SSL) | Bioacoustics | Link |
esp_aves2_effnetb0_all |
EfficientNet-B0 | All (AudioSet + Bio) | Link |
esp_aves2_effnetb0_bio |
EfficientNet-B0 | Bioacoustics | Link |
esp_aves2_effnetb0_audioset |
EfficientNet-B0 | AudioSet | Link |
- BEATs: Bidirectional Encoder representation from Audio Transformers
- EAT: Efficient Audio Transformer models
- EfficientNet: EfficientNet-based models adapted for audio classification
- AVES: AVES model for bioacoustics
- BirdMAE: BirdMAE masked autoencoder for bioacoustic representation learning
| Capability | Description |
|---|---|
| Classification with labels | Model has a trained classifier head and a class mapping (e.g. label_map.json). Use load_model("model_name", device="cpu") to get logits and use model.label_mapping for human-readable class names. |
| Features / embeddings only | Any model can be loaded for embedding extraction by passing return_features_only=True. The model then returns feature tensors instead of classification logits. |
How to see which models offer what
- At runtime: Call
list_models()— the printed table has a "Trained Classifier" column (✅ = has checkpoint + class mapping, ❌ = backbone/features only). The returned dict includeshas_trained_classifierandnum_classesper model. - Per model: Call
describe_model("model_name", verbose=True)to see "Has Trained Classifier", checkpoint path, class mapping path, and number of classes.
All official ESP-AVES2 models have both a checkpoint and a class mapping, so they support classification with labels. They also support embedding extraction with load_model(..., return_features_only=True).
Models are configured using YAML files which contain the model specifications model_spec. The official config files are in the avex/api/configs/official_models/ directory. These files define the model architecture, audio preprocessing parameters, and optional checkpoint/label mapping paths.
Minimal Model Configuration:
# Example: my_model.yml - Minimal configuration for model loading
model_spec:
name: efficientnet
pretrained: false
device: cuda
audio_config:
sample_rate: 16000
representation: mel_spectrogram
n_mels: 128
efficientnet_variant: b0Full Model Configuration (with checkpoint):
# Example: esp_aves2_effnetb0_all.yml - Complete configuration
# Optional: Default checkpoint path
checkpoint_path: hf://EarthSpeciesProject/esp-aves2-effnetb0-all/esp-aves2-effnetb0-all.safetensors
# Optional: Label mapping for human-readable predictions
class_mapping_path: hf://EarthSpeciesProject/esp-aves2-effnetb0-all/label_map.json
# Required: Model specification
model_spec:
name: efficientnet
pretrained: false
device: cuda
audio_config:
sample_rate: 16000
representation: mel_spectrogram
n_mels: 128
target_length_seconds: 10
efficientnet_variant: b0These configurations can be loaded directly with load_model("path/to/config.yml"). See the Custom Model Registration section for usage examples.