Audio

Jump to bottom Edit New page

Melvin Carvalho edited this page Jan 31, 2025 · 1 revision

Audio

speech2txt

OpenAI's Whisper

Github: https://github.com/openai/whisper
Distil-Whisper: https://github.com/huggingface/distil-whisper/issues/4
Insanely fast whisper: https://github.com/Vaibhavs10/insanely-fast-whisper
WhisperKit for Apple devices: https://www.takeargmax.com/blog/whisperkit
Whisper turbo: https://github.com/openai/whisper/discussions/2363
Whisper Medusa: https://github.com/aiola-lab/whisper-medusa
Tips against hallucinations: https://www.reddit.com/r/LocalLLaMA/comments/1fx7ri8/comment/lql41mk/
Whisper Standalone Win: https://github.com/Purfview/whisper-standalone-win
Whisperfile: https://github.com/Mozilla-Ocho/llamafile/blob/main/whisper.cpp/doc/getting-started.md
WhisperX: https://github.com/m-bain/whisperX

Other

Nvidia's Canary (with translation): https://nvidia.github.io/NeMo/blogs/2024/2024-02-canary/
Qwen2-Audio-7B: https://huggingface.co/Qwen/Qwen2-Audio-7B
Speech2Speech pipeline: https://github.com/huggingface/speech-to-speech
Moonshine: https://github.com/usefulsensors/moonshine
Article about Speech recognition (comparisons and insights): https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription
DeepFilter for filtering noisy audio: https://github.com/duohub-ai/deepfilter-lambda-container

txt2speech and txt2audio

Fishaudio

Fish Speech 1.4: https://huggingface.co/fishaudio/fish-speech-1.4
Fish Speech 1.5: https://huggingface.co/fishaudio/fish-speech-1.5

XTTS

XTTS v1: https://huggingface.co/coqui/XTTS-v1
XTTS v2: https://huggingface.co/coqui/XTTS-v2

Kokoro

Model: https://huggingface.co/hexgrad/Kokoro-82M
ONNX variant: https://huggingface.co/onnx-community/Kokoro-82M-ONNX
Dockerized: https://github.com/remsky/Kokoro-FastAPI
Kokoros (Rust based engine): https://github.com/lucasjinreal/Kokoros
KokoDOS (GlaDOS fork): https://github.com/kaminoer/KokoDOS
kokoro-js: https://www.npmjs.com/package/kokoro-js

OuteTTS

v0.2 (onnx model for transformer.js WebGPU inference): https://huggingface.co/onnx-community/OuteTTS-0.2-500M
v0.3: https://huggingface.co/collections/OuteAI/outetts-03-6786b1ebc7aeb757bc17a2fa

Other

TTS Arena Leaderboard: https://huggingface.co/spaces/TTS-AGI/TTS-Arena
VoiceCraft: https://github.com/jasonppy/VoiceCraft
AudioLDM2: https://github.com/haoheliu/audioldm2
Bark: https://github.com/suno-ai/bark
Tracker page for open access text2speech models: https://github.com/Vaibhavs10/open-tts-tracker
MetaVoice: https://github.com/metavoiceio/metavoice-src
Pheme TTS framework: https://github.com/PolyAI-LDN/pheme
OpenAI TTS: https://platform.openai.com/docs/guides/text-to-speech
OpenVoice: https://github.com/myshell-ai/OpenVoice
Stable Audio Open: https://huggingface.co/stabilityai/stable-audio-open-1.0
MARS5-TTS: https://github.com/Camb-ai/MARS5-TTS
Alibaba's FunAudioLLM framework (includes CosyVoice & SenseVoice): https://github.com/FunAudioLLM
MeloTTS: https://github.com/myshell-ai/MeloTTS
Parler TTS: https://github.com/huggingface/parler-tts
WhisperSpeech: https://github.com/collabora/WhisperSpeech
ChatTTS: https://huggingface.co/2Noise/ChatTTS
ebook2audiobook: https://github.com/DrewThomasson/ebook2audiobookXTTS
GPT-SoVITS-WebUI: https://github.com/RVC-Boss/GPT-SoVITS
Example script for text to voice: https://github.com/dynamiccreator/voice-text-reader
F5 TTS: https://github.com/SWivid/F5-TTS
MaskGCT: https://huggingface.co/amphion/MaskGCT
Audiocraft Plus: https://github.com/GrandaddyShmax/audiocraft_plus
TTS server: https://github.com/matatonic/openedai-speech
Voqal (voice native AI agent): https://github.com/voqal/voqal
Piper (local TTS system): https://github.com/rhasspy/piper
Auralis (speed focussed TTS inference engine): https://github.com/astramind-ai/Auralis
Speaches (server for STT, translation, TTS): https://github.com/speaches-ai/speaches
TTS library: https://github.com/idiap/coqui-ai-TTS
German
- Thorsten voice: https://github.com/thorstenMueller/Thorsten-Voice
- German TTS on Huggingface: https://huggingface.co/models?search=German%20tts

Music production

LANDR mastering plugin: https://www.gearnews.de/landr-mastering-plugin/
Drumloop.ai: https://www.gearnews.de/drumloop-ai-baut-euch-automatisch-beats-und-drumloops-durch-ki/
Sample generator: https://huggingface.co/adlb/Audialab_EDM_Elements
RC stable audio tools (Gradio app for using audio models): https://github.com/RoyalCities/RC-stable-audio-tools

Other audio models and related tools

LAION AI Voice Assistant BUD-E: https://github.com/LAION-AI/natural_voice_assistant
AI Language Tutor
- https://www.univerbal.app/
- https://yourteacher.ai/
Speech Note Offline STT, TTS and Machine Translation: https://github.com/mkiol/dsnote
DenseAV (locates sound and learns meaning of words): https://github.com/mhamilton723/DenseAV
Moshi (speech2speech foundation model): https://huggingface.co/collections/kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd
Open VTuber App: https://github.com/t41372/Open-LLM-VTuber
Voicechat implementation: https://github.com/lhl/voicechat2
Podcastfy: https://github.com/souzatharsis/podcastfy
Open ASR Leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
Ebook2audiobook: https://github.com/DrewThomasson/ebook2audiobookpiper-tts
Voice Conversion: https://github.com/IAHispano/Applio
TTS comparison: https://tts.x86.st/
Voice cloning tutorial: https://techshinobi.org/posts/voice-vits/
LocalGlaDOS: https://github.com/dnhkng/GlaDOS
ClearerVoice-Studio: https://github.com/modelscope/ClearerVoice-Studio/tree/main
OmniAudio 2.6B (edge device setup for taking input audio and integrate LLM): https://huggingface.co/NexaAIDev/OmniAudio-2.6B
BlahST (speech2txt tool based on whisper for linux): https://github.com/QuantiusBenignus/BlahST
Weebo (speech-to-speech chatbot using whisper, llama, kokoro): https://github.com/amanvirparhar/weebo