This is the code for the paper
Vid2speech: Speech Reconstruction from Silent Video
Ariel Ephrat and
Shmuel Peleg
to appear at ICASSP 2017
If you find this code useful for your research, please cite
@inproceedings{ephrat2017vid2speech,
title = {Vid2Speech: speech reconstruction from silent video},
author = {Ariel Ephrat and Shmuel Peleg},
booktitle = {2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year = {2017},
}
The code depends on keras, h5py, numpy, cv2, scipy and moviepy, all of which can be easily installed using pip:
pip install keras h5py numpy scipy opencv-python moviepy
Keras was used with the TensorFlow backend.
Download one speaker's videos from the GRID Corpus, and save the videos directly in the /dataset
folder.
This code has been tested on the high quality videos of speakers 2 (male) and 4 (female).
Next, strip the audio part of each video and save as the same filename with extension .mpg
replaced with .wav
.
The supplied strip_audio.sh
script can be used (requires ffmpeg
).
cd dataset
sh strip_audio.sh
cd ../code
python process_data.py
python train.py
Training one entire GRID speaker (1000 videos) with the supplied settings takes ~12 hours on one Titan Black GPU.
python gen_samples.py
Samples will appear under ../results/samples/
Data must first be preprocessed with process_data.py
.
python predict.py --weight_path <path_to_weights>
python gen_samples.py --respath '../pretrained_results'
Weights for a pre-trained model of speaker 2 are supplied in pretrained_weights/s2.hdf5
.
python predict.py --weight_path '../pretrained_weights/s2.hdf5'
python gen_samples.py --respath '../pretrained_results'
Samples will appear under ../pretrained_results/samples/