Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding fa/en matcha-tts model for android #1779

Closed
mah92 opened this issue Jan 31, 2025 · 22 comments · Fixed by #1834
Closed

Adding fa/en matcha-tts model for android #1779

mah92 opened this issue Jan 31, 2025 · 22 comments · Fixed by #1834

Comments

@mah92
Copy link

mah92 commented Jan 31, 2025

Hi
I have trained a matcha tts model for fa/en, It sounds very good. After reading the docs, I needed to get your opinion on some of the points before trying to merge it. First of Al, I saw the matcha model sample in the TtsEngine.kt file.

  1. Should I add the metadata to the model as in vits/piper?

  2. I have a 24KHz dataset which degrades when converted to 22050Hz. So I switched to 24KHz. Doing so, I also trained a vocoder. As 22050Hz was hardcoded in the matcha-tts itself, I wondered if any change would be needed in sherpa.

  3. In the matcha-tts repo, there is an option to embed the vocoder in the onnx model (to perform like an end to end model). I saw in the documentation that you gave the vocoder explicitly. Is it possible (and isn't it better) to embed the vocoder and treat the model like and end to end model (like vits)?

  4. Which hifi-gan version do you propose (v1, v2, or v3)? You used v2 in the examples, but the matcha-tts defaults to v1... I trained v1, and v3 but could not hear any difference.

  5. I found two repositories training hifigan:
    rhasspy/hifi-gan-train (containing onnx export)
    jik876/hifi-gan (which I currently use, but I did not found onnx export, so I wrote one and I am working on it)
    Is there any difference in sherpa to use which?

P.n: It is not directly relevant to sherpa but do you know any standard way to convert jik876/hifi-gan to onnx usable in sherpa? I did not find any.

@mah92 mah92 changed the title Adding a matcha-tts model for android Adding fa/en matcha-tts model for android Jan 31, 2025
@csukuangfj
Copy link
Collaborator

csukuangfj commented Feb 1, 2025

Should I add the metadata to the model as in vits/piper?

Yes, you have to do that. Please refer to
https://github.com/k2-fsa/icefall/blob/master/egs/ljspeech/TTS/matcha/export_onnx.py#L174
for how to do that.

As 22050Hz was hardcoded in the matcha-tts itself, I wondered if any change would be needed in sherpa.

You can use an arbitrary sample rate, as long as it matches the one used in the training. Please follow the above link
to specify the exact sample rate in the meta data. You don't need to do anything in sherpa-onnx for using a sample rate other than 22.050 kHz.

Is it possible (and isn't it better) to embed the vocoder and treat the model like and end to end model (like vits)?

Yes, it is possible. But we choose the other way. The current C++ code in sherpa-onnx assumes you split it.

The pro is that you can replace the vocoder as you like and keep the acoustic model unchanged.

Which hifi-gan version do you propose (v1, v2, or v3)?

You can use any one you like. However, bear in mind that v1, v2, v3 differ not only in model file size, but also in speed.
If you don't care about speed or model file size, then you can use v1.

Is there any difference in sherpa to use which?

Please follow https://github.com/k2-fsa/icefall/blob/master/egs/ljspeech/TTS/matcha/export_onnx_hifigan.py#L40

If you want to use sherpa-onnx, then you must ensure that your vocoder's input and output matches the one in the above link.

(matches here means number of inputs/outputs, and the shape of inputs/putputs)

You can consider the vocoder as an API, as long as your vocoder matches the API specification, then it is OK.

The internals of the API are invisible to you. You can use any implementation other than hifigan.

but do you know any standard way to convert jik876/hifi-gan to onnx usable in sherpa?

Please see the comment above.

@csukuangfj
Copy link
Collaborator

If you make your model public, then we can add it to sherpa-onnx and you don't need to write any code on your side.

@mah92
Copy link
Author

mah92 commented Feb 1, 2025

Thank you. I will report back.

@mah92
Copy link
Author

mah92 commented Feb 9, 2025

Uploaded two fa/en matcha models
https://huggingface.co/mah92/Khadijah-FA_EN-Matcha-TTS-Model (female)
https://huggingface.co/mah92/Musa-FA_EN-Matcha-TTS-Model (male)
They seem to be excellent models.
They use the standard universal_v1 vocoders(22050).
The tokens.txt file is provided in the above repos(they needed some extra ipa tokens for persian).
For these models to work, the system language in android should be set by user to persian (not system). Otherwise, persian sentences are not read.
I did not add sherpa meta data for above links (should I?)

@mah92
Copy link
Author

mah92 commented Feb 9, 2025

By the way, thank you for introducing the matcha models in your repository. I could not get close to this quality/speed with piper-vits...

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

Notes:
I trained many hifigan models but I could not get close to the standard universal v1 hifigan(22050Hz).
So I retreated and switched to 22050. The main problem for my voice quality was tokens not the vocoder, which was corrected when I switched to use sherpa tokens(my augmented version which I already used for Reza and Ibrahim now available in huggingface).
I still used original matcha repo for training instead of icefall. Mainly because of these problems:

  • It was not clear to how to use another language.
  • It was not clear to how to use different cleaners
  • It was not clear to how to use piper_phonemize
  • Also icefall was hard to install for me. After some struggles with many different python versions, I switched to docker which still needed to install some packages(the icefall installation guide is outdated).
  • The docker image could not use the graphic card using the provided commands(here), I needed to use a recipe which shared the graphic view completely(maybe gone too far).
  • At the end, it was ok, but the usage documentation still lacked (Yes I read here). I had to create a dummy ljspeech model and apply some hacks...

After all, icefall was the only way I could export a good onnx model with metadata, the original repo was messy and needed many hard-codings to get correct, but was simpler and I shared the onnx created by original repo in the hugging face, so that it could be reproducible. I don't know if I did right...

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

Here are the files I used to get up icefall, in case it be helpfull

icefall-recipe.zip

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

I also needed to do this, otherwise I could not export hifhi_gan with 32GB of pc RAM.
Inside export_onnx_hifigan.py, I did this to reduce memory usage:
convert 100000 to 10000
x = torch.ones(1, 80, 10000, dtype=torch.float32)

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

As an un-experienced user, I had a noob opinion: maybe better to add a minimal code just to add metadata to an onnx model, like the one you provided for piper-vits...
Or maybe remove the need for metadata in onnx, as it is sometimes hard for other noobs like me to get it right :)

@csukuangfj
Copy link
Collaborator

It is ok to not use icefall for the matcha tts training.

It is also ok to use the original matcha tts repo for training.

Please wait before we support your model in sherpa-onnx.

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

Thanks.
Would you kindly not remove the names? The male is musa and the female is khadijah... They would not be recognizable...
@csukuangfj

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

I also mensioned that the male voice(musa) used in https://huggingface.co/spaces/k2-fsa/text-to-speech is much noisy(worse than my samples). Are you using the right vocoder? It works best with univ_v1.
The female(khadijah) voice is good.
@csukuangfj

@csukuangfj
Copy link
Collaborator

where is your uni v1 vocoder.

huggingface spacea uses hifigan vocoder

@csukuangfj
Copy link
Collaborator

Thanks.
Would you kindly not remove the names? The male is musa and the female is khadijah... They would not be recognizable...
@csukuangfj

sure, will rename it tomorrow.

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

It is mensioned in the hifigan repo, They say that Universal vocoder is better for languages other than english. I tested and confirm this:
https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y

Note: I do not know if this is your main problem as the female voice seems OK.
@csukuangfj

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

Another note: We tested arm64 and armv7a android versions and they did not run on two devices which we tested. Previous versions worked... @csukuangfj

@csukuangfj
Copy link
Collaborator

can you show logcat logs?

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

02-10 23:25:31.120 25999 25999 I sherpa-onnx-tts-engine: Init Next-gen Kaldi TTS
02-10 23:25:31.120 25999 25999 I sherpa-onnx-tts-engine: data dir is matcha-tts-fa_en-female/espeak-ng-data
02-10 23:25:31.124 464 464 I Layer : id=5417 onRemoved com.sec.android.app.launcher/com.sec.android.app.launcher.activities.LauncherActivity[2273]#0
02-10 23:25:31.124 464 464 I BufferQueueConsumer: com.sec.android.app.launcher/com.sec.android.app.launcher.activities.LauncherActivity[2273]#0 disconnect(C)
02-10 23:25:31.124 464 464 I BufferQueue: com.sec.android.app.launcher/com.sec.android.app.launcher.activities.LauncherActivity[2273]#0 ~BufferQueueCore
02-10 23:25:31.170 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:31.299 1292 1608 D NetworkController.MobileSignalController(0/1): onSignalStrengthsChanged signalStrength=SignalStrength: 19 99 -12 -200 -12 -200 -1 36 -74 -13 40 2147483647 0 2147483647 19 46 -74 0x4000 P gsm|lte EEGE use_rsrp_and_rssnr_for_lte_level [-128, -118, -108, -98] [-115, -105, -95, -85] level=4
02-10 23:25:31.299 1292 1608 D NetworkController.MobileSignalController(0/1): getMobileIconGroup(): 13
02-10 23:25:31.374 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:31.614 2617 2617 I chatty : uid=10129(com.sec.android.inputmethod) identical 1 line
02-10 23:25:31.854 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:31.960 497 561 D AALLightSensor: newLux = 86, [76, 76] -> 86
02-10 23:25:32.069 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:32.171 1944 1944 D io_stats: !@ 179,0 r 4593819 95103978 w 759157 13512676 d 169491 7255924 f 204761 279542 iot 2947124 2836740 th 51200 51200 47232 pt 6848 inp 0 0 55080.821
02-10 23:25:32.318 1018 1235 D GameManagerService: handleForegroundChange(). pkgName: com.k2fsa.sherpa.onnx.tts.engine, clsName: com.k2fsa.sherpa.onnx.tts.engine.MainActivity,FgActivityName:com.k2fsa.sherpa.onnx.tts.engine/.MainActivity,userID:0
02-10 23:25:32.318 1018 1235 D GameManagerService: handleForegroundChange(). same package. game has never resumed yet. ignore
02-10 23:25:32.310 2617 2617 I chatty : uid=10129(com.sec.android.inputmethod) identical 1 line
02-10 23:25:32.550 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:32.552 25999 25999 I sherpa-onnx-tts-engine: newDataDir: /storage/emulated/0/Android/data/com.k2fsa.sherpa.onnx.tts.engine/files
02-10 23:25:32.554 25999 25999 D AndroidRuntime: Shutting down VM
02-10 23:25:32.562 25999 25999 E AndroidRuntime: FATAL EXCEPTION: main
02-10 23:25:32.562 25999 25999 E AndroidRuntime: Process: com.k2fsa.sherpa.onnx.tts.engine, PID: 25999
02-10 23:25:32.562 25999 25999 E AndroidRuntime: java.lang.RuntimeException: Unable to start activity ComponentInfo{com.k2fsa.sherpa.onnx.tts.engine/com.k2fsa.sherpa.onnx.tts.engine.MainActivity}: java.lang.IllegalArgumentException: Please specify a TTS model
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:3160)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:3303)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:78)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:108)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:68)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1991)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.os.Handler.dispatchMessage(Handler.java:106)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.os.Looper.loop(Looper.java:216)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread.main(ActivityThread.java:7258)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at java.lang.reflect.Method.invoke(Native Method)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:494)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:975)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: Caused by: java.lang.IllegalArgumentException: Please specify a TTS model
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.TtsKt.getOfflineTtsConfig(Tts.kt:218)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.TtsKt.getOfflineTtsConfig$default(Tts.kt:186)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.tts.engine.TtsEngine.initTts(TtsEngine.kt:194)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.tts.engine.TtsEngine.createTts(TtsEngine.kt:174)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.tts.engine.MainActivity.onCreate(MainActivity.kt:73)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.Activity.performCreate(Activity.java:7353)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.Activity.performCreate(Activity.java:7344)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1275)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:3140)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: ... 11 more

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

The language should be set as "fas" inside TtsEngine.kt (not "fa"). Is it the problem? Could you provide the exact modification in TtsEngine as a reference? The tokens.txt is set as metadata in the onnx so there is no need for it. Am I correct?

    // Example 8
    // matcha-icefall-en_US-ljspeech
    // https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/matcha.html#matcha-icefall-en-us-ljspeech-american-english-1-female-speaker
    // modelDir = "matcha-icefall-en_US-ljspeech"
    // acousticModelName = "model-steps-3.onnx"
    // vocoder = "hifigan_v2.onnx"
    // dataDir = "matcha-icefall-en_US-ljspeech/espeak-ng-data"
    // lang = "eng"

@mah92
Copy link
Author

mah92 commented Feb 10, 2025

I wish I could check the model speed and set the optimal number of steps accodingly...

@csukuangfj
Copy link
Collaborator

Another note: We tested arm64 and armv7a android versions and they did not run on two devices which we tested. Previous versions worked... @csukuangfj

Fixed in #1841

@csukuangfj
Copy link
Collaborator

I also mensioned that the male voice(musa) used in https://huggingface.co/spaces/k2-fsa/text-to-speech is much noisy(worse than my samples). Are you using the right vocoder? It works best with univ_v1. The female(khadijah) voice is good. @csukuangfj

Please provide the onnx model for your vocoder by following what we are doing in icefall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants