This blueprint enables you to create your own Speech-to-Text / Automatic Speech Recognition (ASR) dataset, or use the Common Voice dataset, to finetune an ASR model to improve performance for your specific language & use-case. All of this can be done locally (even on your laptop!) ensuring no data leaves your machine, safeguarding your privacy. Using Common Voice as a backbone enables this blueprint to support an impressively wide variety of languages! More the exact list of languages supported please visit the Common Voice website.
📘 To explore this project further and discover other Blueprints, visit the Blueprints Hub.
👉 📖 For more detailed guidance on using this project, please visit our Docs here
- Python 3.10+
- Common Voice
- Hugging Face
- Gradio
Note: All scripts should be executed from the root directory of the repository.
This blueprint consists of three independent, yet complementary, components:
- Transcription app: A simple UI that lets you record your voice, pick any HF ASR model, and get an instant transcription.
- Dataset maker app: Another UI app that enables you to easily and quickly create your own Speech-to-Text dataset.
- Finetuning script: A script to finetune your own STT model, either using Common Voice data or your own local data created by the Dataset maker app.
- Use a virtual environment and install dependencies:
pip install -e .
& ffmpeg e.g. for Ubuntu:sudo apt install ffmpeg
, for Mac:brew install ffmpeg
- Try existing transcription HF models on your own language & voice locally:
python demo/transcribe_app.py
- If you are not happy with the results, you can finetune a model with data of your language from Common Voice
- Configure
config.yaml
with the model, Common Voice dataset id from HF and hyperparameters of your choice. - Finetune a model:
python src/speech_to_text_finetune/finetune_whisper.py
- Configure
- Try again the transcription app with your newly finetuned model.
- If the results are still not satisfactory, create your own Speech-to-Text dataset and model.
- Create a dataset:
python demo/make_local_dataset_app.py
- Configure
config.yaml
with the model, local data directory and hyperparameters of your choice. - Finetune a model:
python src/speech_to_text_finetune/finetune_whisper.py
- Create a dataset:
- Finally try again the transcription app with the new model finetuned specifically for your own voice!
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.