This is the official code repository of Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis.
Demo Page: https://shivammehta25.github.io/Diff-TTSG/
Huggingface Space: https://huggingface.co/spaces/shivammehta25/Diff-TTSG
We present Diff-TTSG, the first diffusion model that jointly learns to synthesise speech and gestures together. Our method is probabilistic and non-autoregressive, and can be trained on small datasets from scratch. In addition, to showcase the efficacy of these systems and pave the way for their evaluation, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
-
Clone this repository
git clone https://github.com/shivammehta25/Diff-TTSG.git cd Diff-TTSG -
Create a new environment (optional)
conda create -n diff-ttsg python=3.10 -y conda activate diff-ttsg
-
Setup diff ttsg (This will install all the dependencies and download the pretrained models)
- Is you are using Linux or Mac OS, run the following command
make install
- else install all dependencies and alignment build simply by
pip install -e . -
Run gradio UI
gradio app.py
or use synthesis.ipynb
Pretrained checkpoint (Should be autodownloaded by running either make install or gradio app.py)
If you use or build on our method or code for your research, please cite our paper:
@inproceedings{mehta2023diff,
author={Mehta, Shivam and Wang, Siyang and Alexanderson, Simon and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
title={{D}iff-{TTSG}: {D}enoising probabilistic integrated speech and gesture synthesis},
year={2023},
booktitle={Proc. ISCA Speech Synthesis Workshop (SSW)},
pages={150--156},
doi={10.21437/SSW.2023-24}
}
The code in the repository is heavily inspired by the source code of
