-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding fa/en matcha-tts model for android #1779
Comments
Yes, you have to do that. Please refer to
You can use an arbitrary sample rate, as long as it matches the one used in the training. Please follow the above link
Yes, it is possible. But we choose the other way. The current C++ code in sherpa-onnx assumes you split it. The pro is that you can replace the vocoder as you like and keep the acoustic model unchanged.
You can use any one you like. However, bear in mind that v1, v2, v3 differ not only in model file size, but also in speed.
Please follow https://github.com/k2-fsa/icefall/blob/master/egs/ljspeech/TTS/matcha/export_onnx_hifigan.py#L40 If you want to use sherpa-onnx, then you must ensure that your vocoder's input and output matches the one in the above link. ( You can consider the vocoder as an API, as long as your vocoder matches the API specification, then it is OK. The internals of the API are invisible to you. You can use any implementation other than hifigan.
Please see the comment above. |
If you make your model public, then we can add it to sherpa-onnx and you don't need to write any code on your side. |
Thank you. I will report back. |
Uploaded two fa/en matcha models |
By the way, thank you for introducing the matcha models in your repository. I could not get close to this quality/speed with piper-vits... |
Notes:
After all, icefall was the only way I could export a good onnx model with metadata, the original repo was messy and needed many hard-codings to get correct, but was simpler and I shared the onnx created by original repo in the hugging face, so that it could be reproducible. I don't know if I did right... |
Here are the files I used to get up icefall, in case it be helpfull |
I also needed to do this, otherwise I could not export hifhi_gan with 32GB of pc RAM. |
As an un-experienced user, I had a noob opinion: maybe better to add a minimal code just to add metadata to an onnx model, like the one you provided for piper-vits... |
It is ok to not use icefall for the matcha tts training. It is also ok to use the original matcha tts repo for training. Please wait before we support your model in sherpa-onnx. |
Thanks. |
I also mensioned that the male voice(musa) used in https://huggingface.co/spaces/k2-fsa/text-to-speech is much noisy(worse than my samples). Are you using the right vocoder? It works best with univ_v1. |
where is your uni v1 vocoder. huggingface spacea uses hifigan vocoder |
sure, will rename it tomorrow. |
It is mensioned in the hifigan repo, They say that Universal vocoder is better for languages other than english. I tested and confirm this: Note: I do not know if this is your main problem as the female voice seems OK. |
Another note: We tested arm64 and armv7a android versions and they did not run on two devices which we tested. Previous versions worked... @csukuangfj |
can you show logcat logs? |
02-10 23:25:31.120 25999 25999 I sherpa-onnx-tts-engine: Init Next-gen Kaldi TTS |
The language should be set as "fas" inside TtsEngine.kt (not "fa"). Is it the problem? Could you provide the exact modification in TtsEngine as a reference? The tokens.txt is set as metadata in the onnx so there is no need for it. Am I correct?
|
I wish I could check the model speed and set the optimal number of steps accodingly... |
Fixed in #1841 |
Please provide the onnx model for your vocoder by following what we are doing in icefall. |
Hi
I have trained a matcha tts model for fa/en, It sounds very good. After reading the docs, I needed to get your opinion on some of the points before trying to merge it. First of Al, I saw the matcha model sample in the TtsEngine.kt file.
Should I add the metadata to the model as in vits/piper?
I have a 24KHz dataset which degrades when converted to 22050Hz. So I switched to 24KHz. Doing so, I also trained a vocoder. As 22050Hz was hardcoded in the matcha-tts itself, I wondered if any change would be needed in sherpa.
In the matcha-tts repo, there is an option to embed the vocoder in the onnx model (to perform like an end to end model). I saw in the documentation that you gave the vocoder explicitly. Is it possible (and isn't it better) to embed the vocoder and treat the model like and end to end model (like vits)?
Which hifi-gan version do you propose (v1, v2, or v3)? You used v2 in the examples, but the matcha-tts defaults to v1... I trained v1, and v3 but could not hear any difference.
I found two repositories training hifigan:
rhasspy/hifi-gan-train (containing onnx export)
jik876/hifi-gan (which I currently use, but I did not found onnx export, so I wrote one and I am working on it)
Is there any difference in sherpa to use which?
P.n: It is not directly relevant to sherpa but do you know any standard way to convert jik876/hifi-gan to onnx usable in sherpa? I did not find any.
The text was updated successfully, but these errors were encountered: