Skip to content

Latest commit

 

History

History
120 lines (88 loc) · 4.85 KB

File metadata and controls

120 lines (88 loc) · 4.85 KB

Widely-used Transformer Models

Comprehensive collection of transformer and foundation models for audio, vision, multimodal, and NLP use cases.


Table of Contents


Audio Processing

Speech Recognition and Classification

Audio Generation and Synthesis

  • Moshi - Speech-to-speech generation
  • MusicGen - Text-to-audio generation
  • Bark - Text-to-speech synthesis

Computer Vision

Image Understanding

  • SAM - Automatic mask generation
  • DepthPro - Depth estimation
  • DINO v2 - Image classification

Object Detection and Recognition

Pose and Segmentation


Multimodal

Audio-Text Integration

Image-Text Processing

Advanced Multimodal


Natural Language Processing

Text Understanding

Text Generation and Processing

  • BART - Summarization
  • T5 - Translation
  • Llama - Text generation
  • Qwen - Text classification
  • Megatron-LM - Large-scale transformer training framework by NVIDIA

Model Selection Guide

Task Type Recommended Models Typical Use Case
Speech Recognition Whisper, Moonshine Multilingual transcription
Image Understanding SAM, DINO v2 Visual analysis
Multimodal Tasks Qwen-VL, Llava, MiniCPM-o Cross-modal reasoning
Text Processing BART, T5, Qwen Language tasks
Audio Generation MusicGen, Bark Audio synthesis

Related Resources


Best Practices

Model Selection

  • Choose task-specific models first.
  • Check resource constraints early.
  • Verify licensing for your deployment.
  • Prefer models with active maintenance.

Performance Optimization

  • Use quantization for lower cost inference.
  • Batch requests for better throughput.
  • Cache repeated prompts and embeddings.
  • Use GPU acceleration when available.