⚡ Some experiment with NeMo ⚡
- Model: QuartzNet is a smaller version of Jaser model
- I list the word error rate (WER) with and without LM of major ASR tasks.
Task | CER (%) | WER (%) | +LM WER (%) |
---|---|---|---|
VIVOS (TEST) | 6.80 | 18.02 | 15.72 |
VLSP2018 | 6.87 | 16.26 | N/A |
VLSP2020 T1 | 14.73 | 30.96 | N/A |
VLSP2020 T2 | 41.67 | 69.15 | N/A |
Model was trained with ~500 hours Vietnamese speech dataset, was collected from youtube, radio, call center(8k), text to speech data and some public dataset (vlsp, vivos, fpt). It is very small model (13M parameters) make it inference so fast ⚡
- ctcdecoder, kemlm for LM Decode
pip install ds-ctcdecoder
- and some python libraries:
torch, numpy, librosa, flask, flask_socketio, requests,...
-
Vietnamese Model (pretrained):
python flask_upload_record_vn.py
-
Video demo in Youtube: https://youtu.be/P3mhEngL1us
-
English Model (pretrained):
python flask_upload_record_en.py
- Conformer Model
- Transformer LM instead of kenlm
- Data augumentation: speed, noise, pitch shift, time shift,...
- FastAPI