In this tutorial, we'll learn how to use the Google Cloud Speech API to transcribe an audio file. The trickiest part of this API is converting your audio data into the correct format, which we'll do using FFmpeg.
YouTube Audio Library has a number of public domain audio files. The audio file transcribed in this tutorial (John_F_Kennedy_Inaugural_Speech_January_20_1961.mp3) was downloaded from this library.
Audio longer than 1 minute must reside on Google Cloud Storage (GCS) and audio up to 80 minutes duration can be processed at a time (usage limit).
Since most audio files would be longer than 1 minute, we'll skip the part where files could be transcribed locally on your laptop and instead we'll learn how to transcribe files on GCS.
- If you haven't already, you may sign-up for the free GCP trial credit
- Set up your project on GCP
- Create a GCS bucket
- Upload the JSON file for the service account key that you just created in the step above
- Clone this repo on the cloud shell:
$ git clone https://github.com/dsnair/GCP.git
$ cd speech
$ lsOn the cloud shell, run the vm-setup.sh script to create an Ubuntu-based VM instance named instance-1:
$ chmod +x vm-setup.sh
$ ./vm-setup.shOR
Do it manually on the Console:
- Create a VM instance
- Select Ubuntu under 'Boot disk' for all the above installation commands to work
- 'Allow full access to all Cloud APIs' under 'Access scopes'
- Connect to your VM instance
- Clone this repo on your VM (different from the cloud shell)
- On your VM, run the
install.shscript to install FFmpeg, the Speech API client for Python, and gcsfuse:
$ chmod +x install.sh
$ ./install.shgcsfuse mounts a directory on your VM to a bucket on GCS. This allows the two directories on different machines to see each others content and be in sync when the directory contents change.
Now, let's mount a local directory named local_bucket on your VM to your GCS bucket.
$ mkdir local_bucket
$ gcsfuse your-bucket-name local_bucket
$ cd local_bucket/
$ lsSet the environment variable to point to the service account key:
$ export GOOGLE_APPLICATION_CREDENTIALS=path_to/service_account_file.json$ python transcribe_audio.py gs://your-bucket-name/John_F_Kennedy_Inaugural_Speech_January_20_1961.mp3Output: transcribe_audio.py formats the audio file to create .WAV file and outputs the audio transcription on your shell.
Here are the details on how the audio file is formatted using FFmpeg:
$ ffmpeg -i John_F_Kennedy_Inaugural_Speech_January_20_1961.mp3 -acodec pcm_s16le -ac 1 -f segment -segment_time 4800 John_F_Kennedy_Inaugural_Speech_January_20_1961_%d.wavOutput: The command line above creates John_F_Kennedy_Inaugural_Speech_January_20_1961_0.wav in local_bucket, which is also visible in your-bucket-name.
-itakes an input audio file-acodec pcm_s16lesets linear16 audio encoding-ac 1sets mono channel-segment_time 4800chunks the input audio file at every 4800 seconds (80 minutes) and names each chunk filename_0.wav, filename_1.wav, etc.
- Unmount your local directory
$ cd
$ fusermount -u local_bucket- Delete your bucket to avoid incurring charges to your account
- Delete your VM instance