This repository collects links to models, datasets, and tools for Ukrainian Speech-to-Text and Text-to-Speech.
We have datasets/models/leaderboards on Hugging Face, check it out:
- Discord: https://bit.ly/discord-uds
- Speech Recognition: https://t.me/speech_recognition_uk
- Speech Synthesis: https://t.me/speech_synthesis_uk
wav2vec2-bert
wav2vec2
- 300M params (with language model based on Wikipedia texts): https://huggingface.co/Yehor/w2v-xls-r-uk
- 300M params: https://huggingface.co/robinhad/wav2vec2-xls-r-300m-uk
- 1B params: https://huggingface.co/arampacha/wav2vec2-xls-r-1b-uk
You can check demos out here: https://github.com/egorsmkv/wav2vec2-uk-demo
Citrinet
- NVIDIA Streaming Citrinet 1024 (uk): https://huggingface.co/nvidia/stt_uk_citrinet_1024_gamma_0_25
- NVIDIA Streaming Citrinet 512 (uk): https://huggingface.co/neongeckocom/stt_uk_citrinet_512_gamma_0_25
ContextNet
-
NVIDIA Streaming ContextNet 512 (uk): https://huggingface.co/theodotus/stt_uk_contextnet_512
FastConformer
-
FastConformer Hybrid Transducer-CTC Large P&C: https://huggingface.co/nvidia/stt_ua_fastconformer_hybrid_large_pc
-
FastConformer Hybrid Transducer-CTC Large P&C: https://huggingface.co/theodotus/stt_ua_fastconformer_hybrid_large_pc
Squeezeformer
-
Squeezeformer-CTC ML: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_ml
-
Squeezeformer-CTC SM: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_sm
-
Squeezeformer-CTC XS: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_xs
Whisper
- official whisper: https://github.com/openai/whisper
- whisper (small, fine-tuned for Ukrainian): https://github.com/egorsmkv/whisper-ukrainian
- whisper (large, fine-tuned for Ukrainian): https://huggingface.co/arampacha/whisper-large-uk-2
- https://huggingface.co/mitchelldehaven/whisper-medium-uk
- https://huggingface.co/mitchelldehaven/whisper-large-v2-uk
Quantized variants:
OWSM, OWSM-CTC, and OWLS
DeepSpeech
- DeepSpeech using transfer learning from English model: https://github.com/robinhad/voice-recognition-ua
- v0.5: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.5 (1230+ hours)
- v0.4: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.4 (1230 hours)
- v0.3: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.3 (751 hours)
This benchmark uses Common Voice 10 test split.
- WER: Word Error Rate
- CER: Character Error Rate
Model | WER | CER | Accuracy (words) |
---|---|---|---|
Yehor/w2v-bert-uk (FP16) | 6.6% | 1.34% | 93.4% |
Yehor/w2v-bert-uk-v2.1 (FP16) | 17.34% | 3.33% | 82.66% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
Yehor/w2v-xls-r-uk | 20.24% | 3.64% | 79.76% |
robinhad/wav2vec2-xls-r-300m-uk | 27.36% | 5.37% | 72.64% |
arampacha/wav2vec2-xls-r-1b-uk | 16.52% | 2.93% | 83.48% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
Yehor/hubert-uk (FP16) | 37.07% | 6.87% | 62.93% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
nvidia/stt_uk_citrinet_1024_gamma_0_25 | 4.32% | 0.94% | 95.68% |
neongeckocom/stt_uk_citrinet_512_gamma_0_25 | 7.46% | 1.6% | 92.54% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
theodotus/stt_uk_contextnet_512 | 6.69% | 1.45% | 93.31% |
This model supports text punctuation and capitalization
Model | WER | CER | Accuracy (words) |
---|---|---|---|
nvidia/stt_ua_fastconformer_hybrid_large_pc | 4.52% | 1% | 95.48% |
theodotus/stt_ua_fastconformer_hybrid_large_pc | 4% | 1.02% | 96% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
theodotus/stt_uk_squeezeformer_ctc_xs | 10.78% | 2.29% | 89.22% |
theodotus/stt_uk_squeezeformer_ctc_sm | 8.2% | 1.75% | 91.8% |
theodotus/stt_uk_squeezeformer_ctc_ml | 5.91% | 1.26% | 94.09% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
taras-sereda/uk-pods-conformer | 6.75% | 1.41% | 93.25% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
tiny | 63.08% | 18.59% | 36.92% |
base | 52.1% | 14.08% | 47.9% |
small | 30.57% | 7.64% | 69.43% |
medium | 18.73% | 4.4% | 81.27% |
large (v1) | 16.42% | 3.93% | 83.58% |
large (v2) | 13.72% | 3.18% | 86.28% |
large (v3) | 20.53% | 5.28% | 79.478% |
turbo | 22.83% | 7.05% | 77.17% |
Quantized versions:
Model | WER | CER | Accuracy (words) |
---|---|---|---|
Yehor/whisper-large-v2-quantized-uk | 14.95% | 4.23% | 85.05% |
Yehor/whisper-large-v3-turbo-quantized-uk | 12.75% | 3.25% | 87.25% |
If you want to fine-tune a Whisper model on own data, then use this repository: https://github.com/egorsmkv/whisper-ukrainian
Model | WER | CER | Accuracy (words) |
---|---|---|---|
Flashlight Conformer | 19.15% | 2.44% | 80.85% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
robinhad/data2vec-large-uk | 31.17% | 7.31% | 68.83% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
v3 | 53.25% | 38.78% | 46.75% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
speechbrain/m-ctc-t-large | 57% | 10.94% | 43% |
Model | WER | CER | Accuracy (words) |
---|---|---|---|
v0.5 | 70.25% | 20.09% | 29.75% |
- How to train own model using Kaldi
- How to train a KenLM model based on Ukrainian Wikipedia data: https://github.com/egorsmkv/ukwiki-kenlm
- Export a traced JIT version of wav2vec2 models: https://github.com/egorsmkv/wav2vec2-jit
- Dataset: https://nx16725.your-storageshare.de/s/cAbcBeXtdz7znDN, use Wget to download, downloading in a browser has speed limitations, or use torrent file
- Ukrainian subset: https://huggingface.co/datasets/google/fleurs/viewer/uk_ua/train
- Ukrainian broadcast speech: https://huggingface.co/datasets/Yehor/broadcast-speech-uk
- Ukrainian subsets:
- Transcriptions: https://www.dropbox.com/s/ohj3y2cq8f4207a/transcriptions.zip?dl=0
- Audio files: https://www.dropbox.com/s/v8crgclt9opbrv1/data.zip?dl=0
- ASR Corpus created using a Telegram bot for Ukrainian: https://huggingface.co/datasets/Yehor/tg-voices-uk
- Speech Dataset with Ukrainian: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
- Mozilla Common Voice has the Ukrainian dataset: https://commonvoice.mozilla.org/uk/datasets
- M-AILABS Ukrainian Corpus Ukrainian: http://www.caito.de/data/Training/stt_tts/uk_UK.tgz
- Espreso TV subset: https://blog.gdeltproject.org/visual-explorer-quick-workflow-for-downloading-belarusian-russian-ukrainian-transcripts-translations/
- VoxForge Repository: http://www.repository.voxforge1.org/downloads/uk/Trunk/
- Ukrainian LMs: https://huggingface.co/Yehor/kenlm-uk
- WFST for Ukrainian Inverse Text Normalization: https://github.com/lociko/ukraine_itn_wfst
- Punctuation and capitalization model: https://huggingface.co/dchaplinsky/punctuation_uk_bert (demo: https://huggingface.co/spaces/Yehor/punctuation-uk)
- NeMo Forced Aligner: https://github.com/NVIDIA/NeMo/tree/main/tools/nemo_forced_aligner
- Aligner for wav2vec2-bert models: https://github.com/egorsmkv/w2v2-bert-aligner
- Aligner based on FasterWhisper (mostly for TTS): https://github.com/patriotyk/narizaka
- Aligner based on Kaldi: https://github.com/proger/uk
- A space to calculate ASR metrics: https://huggingface.co/spaces/Yehor/evaluate-asr-outputs
Test sentence with stresses:
К+ам'ян+ець-Под+ільський - м+істо в Хмельн+ицькій +області Укра+їни, ц+ентр Кам'ян+ець-Под+ільської міськ+ої об'+єднаної територі+альної гром+ади +і Кам'ян+ець-Под+ільського рай+ону.
Without stresses:
Кам'янець-Подільський - місто в Хмельницькій області України, центр Кам'янець-Подільської міської об'єднаної територіальної громади і Кам'янець-Подільського району.
RAD-TTS
- RAD-TTS, the voice "Lada"
- RAD-TTS with three voices, voices of Lada, Tetiana, and Mykyta
demo.mp4
Coqui TTS
-
v1.0.0 using M-AILABS dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v1.0.0 (200,000 steps)
-
v2.0.0 using Mykyta/Olena dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v2.0.0 (140,000 steps)
tts_output.mp4
Neon TTS
- Coqui TTS model implemented in the Neon Coqui TTS Python Plugin. An interactive demo is available on huggingface. This model and others can be downloaded from huggingface and more information can be found at neon.ai
neon_tts.mp4
Balacoon TTS
- Balacoon TTS, voices of Lada, Tetiana and Mykyta. Blog post on model release.
balacoon_tts.mp4
- Open Text-to-Speech voices for 🇺🇦 Ukrainian: https://huggingface.co/datasets/Yehor/opentts-uk
- Voice LADA, female
- Voice TETIANA, female
- Voice KATERYNA, female
- Voice MYKYTA, male
- Voice OLEKSA, male
- https://github.com/NeonBohdan/ukrainian-accentor-transformer
- https://github.com/lang-uk/ukrainian-word-stress
- https://github.com/egorsmkv/ukrainian-accentor
ipa-uk:
Charsiu G2P:
- https://huggingface.co/charsiu/g2p_multilingual_byT5_tiny_16_layers_100
- https://huggingface.co/charsiu/g2p_multilingual_byT5_small_100
- https://huggingface.co/charsiu/g2p_multilingual_mT5_small
Other:
- https://github.com/dmort27/epitran
- https://montreal-forced-aligner.readthedocs.io/en/v1.0/pretrained_models.html
- https://huggingface.co/darkproger/ukpron
- Tool to make high quality text to speech (TTS) corpus from audio + text books: https://github.com/patriotyk/narizaka
- A model to do Text Normalization: https://huggingface.co/skypro1111/mbart-large-50-verbalization
- Audio Aesthetics for opentts-uk: https://huggingface.co/datasets/Yehor/opentts-uk-aesthetics