Comprehensive collection of transformer and foundation models for audio, vision, multimodal, and NLP use cases.
- Whisper - Multilingual speech recognition
- Moonshine - Automatic speech recognition
- Wav2Vec2 - Keyword spotting
- Moshi - Speech-to-speech generation
- MusicGen - Text-to-audio generation
- Bark - Text-to-speech synthesis
- SuperGlue Outdoor - Keypoint detection
- SuperGlue - Keypoint matching
- RT-DETRv2 - Object detection
- Qwen2-Audio - Audio and text to text
- LayoutLMv3 - Document understanding
- TAPAS - Table question answering
- Emu3 - Unified multimodal understanding
- MiniCPM-o - Omni multimodal model from OpenBMB
- Llava-OneVision - Vision to text
- Llava - Visual question answering
- Kosmos-2 - Visual referring expression
- ModernBERT - Masked word completion
- Gemma - Named entity recognition
- Mixtral - Question answering
- BART - Summarization
- T5 - Translation
- Llama - Text generation
- Qwen - Text classification
- Megatron-LM - Large-scale transformer training framework by NVIDIA
| Task Type | Recommended Models | Typical Use Case |
|---|---|---|
| Speech Recognition | Whisper, Moonshine | Multilingual transcription |
| Image Understanding | SAM, DINO v2 | Visual analysis |
| Multimodal Tasks | Qwen-VL, Llava, MiniCPM-o | Cross-modal reasoning |
| Text Processing | BART, T5, Qwen | Language tasks |
| Audio Generation | MusicGen, Bark | Audio synthesis |
- STT Models - Speech-to-text recognition
- TTS Models - Text-to-speech synthesis
- Text-to-Image - Image generation
- GenAI APIs - API access to models
- Choose task-specific models first.
- Check resource constraints early.
- Verify licensing for your deployment.
- Prefer models with active maintenance.
- Use quantization for lower cost inference.
- Batch requests for better throughput.
- Cache repeated prompts and embeddings.
- Use GPU acceleration when available.