Compares the performance of STT (transcription) models on classrom audio data.
https://tcu.app.box.com/file/1339927312124?s=gdr527kvqai17wtnhqc03kr8yq2iwc2h
Completely unprocessed.
AWS Transcribe 1
- https://aws.amazon.com/transcribe/
- Used English general model NO CUSTOMIZATION NO special vocabulary results in aws_transcribe_1.json
Whisper 1
- https://github.com/openai/whisper/
- Used Large v2 model
NO CUSTOMIZATION
NO special vocabulary
results in whisper_1.json
- Took around 10 minutes to run on 1 hour of audio, on ml.cs.tcu.edu RESULTS: (better than aws transcribe 1)
Whisper 2
- First used noise reduction on the audio
- Then used Large v2 model
NO CUSTOMIZATION
NO special vocabulary
results in whisper_2.json
- Took around 10 minutes to run on 1 hour of audio, on ml.cs.tcu.edu RESULTS: (worse than whisper 1)
- Original audio file
- Reduced noise audio file
- AWS Transcribe 1
- Whisper 1
- Whisper 2
- AssemblyAPI Results
I used the WER metric to compare the results of the models. The lower the WER, the better the model.
Unfortunately I do not have a ground truth for this audio.