Adding fa/en matcha-tts model for android #1779

mah92 · 2025-01-31T08:09:54Z

Hi
I have trained a matcha tts model for fa/en, It sounds very good. After reading the docs, I needed to get your opinion on some of the points before trying to merge it. First of Al, I saw the matcha model sample in the TtsEngine.kt file.

Should I add the metadata to the model as in vits/piper?
I have a 24KHz dataset which degrades when converted to 22050Hz. So I switched to 24KHz. Doing so, I also trained a vocoder. As 22050Hz was hardcoded in the matcha-tts itself, I wondered if any change would be needed in sherpa.
In the matcha-tts repo, there is an option to embed the vocoder in the onnx model (to perform like an end to end model). I saw in the documentation that you gave the vocoder explicitly. Is it possible (and isn't it better) to embed the vocoder and treat the model like and end to end model (like vits)?
Which hifi-gan version do you propose (v1, v2, or v3)? You used v2 in the examples, but the matcha-tts defaults to v1... I trained v1, and v3 but could not hear any difference.
I found two repositories training hifigan:
rhasspy/hifi-gan-train (containing onnx export)
jik876/hifi-gan (which I currently use, but I did not found onnx export, so I wrote one and I am working on it)
Is there any difference in sherpa to use which?

P.n: It is not directly relevant to sherpa but do you know any standard way to convert jik876/hifi-gan to onnx usable in sherpa? I did not find any.

csukuangfj · 2025-02-01T12:13:49Z

Should I add the metadata to the model as in vits/piper?

Yes, you have to do that. Please refer to
https://github.com/k2-fsa/icefall/blob/master/egs/ljspeech/TTS/matcha/export_onnx.py#L174
for how to do that.

As 22050Hz was hardcoded in the matcha-tts itself, I wondered if any change would be needed in sherpa.

You can use an arbitrary sample rate, as long as it matches the one used in the training. Please follow the above link
to specify the exact sample rate in the meta data. You don't need to do anything in sherpa-onnx for using a sample rate other than 22.050 kHz.

Is it possible (and isn't it better) to embed the vocoder and treat the model like and end to end model (like vits)?

Yes, it is possible. But we choose the other way. The current C++ code in sherpa-onnx assumes you split it.

The pro is that you can replace the vocoder as you like and keep the acoustic model unchanged.

Which hifi-gan version do you propose (v1, v2, or v3)?

You can use any one you like. However, bear in mind that v1, v2, v3 differ not only in model file size, but also in speed.
If you don't care about speed or model file size, then you can use v1.

Is there any difference in sherpa to use which?

Please follow https://github.com/k2-fsa/icefall/blob/master/egs/ljspeech/TTS/matcha/export_onnx_hifigan.py#L40

If you want to use sherpa-onnx, then you must ensure that your vocoder's input and output matches the one in the above link.

(matches here means number of inputs/outputs, and the shape of inputs/putputs)

You can consider the vocoder as an API, as long as your vocoder matches the API specification, then it is OK.

The internals of the API are invisible to you. You can use any implementation other than hifigan.

but do you know any standard way to convert jik876/hifi-gan to onnx usable in sherpa?

Please see the comment above.

csukuangfj · 2025-02-01T12:14:40Z

If you make your model public, then we can add it to sherpa-onnx and you don't need to write any code on your side.

mah92 · 2025-02-01T16:24:16Z

Thank you. I will report back.

mah92 · 2025-02-09T21:14:39Z

Uploaded two fa/en matcha models
https://huggingface.co/mah92/Khadijah-FA_EN-Matcha-TTS-Model (female)
https://huggingface.co/mah92/Musa-FA_EN-Matcha-TTS-Model (male)
They seem to be excellent models.
They use the standard universal_v1 vocoders(22050).
The tokens.txt file is provided in the above repos(they needed some extra ipa tokens for persian).
For these models to work, the system language in android should be set by user to persian (not system). Otherwise, persian sentences are not read.
I did not add sherpa meta data for above links (should I?)

mah92 · 2025-02-09T21:15:25Z

By the way, thank you for introducing the matcha models in your repository. I could not get close to this quality/speed with piper-vits...

mah92 · 2025-02-10T03:01:54Z

Notes:
I trained many hifigan models but I could not get close to the standard universal v1 hifigan(22050Hz).
So I retreated and switched to 22050. The main problem for my voice quality was tokens not the vocoder, which was corrected when I switched to use sherpa tokens(my augmented version which I already used for Reza and Ibrahim now available in huggingface).
I still used original matcha repo for training instead of icefall. Mainly because of these problems:

It was not clear to how to use another language.
It was not clear to how to use different cleaners
It was not clear to how to use piper_phonemize
Also icefall was hard to install for me. After some struggles with many different python versions, I switched to docker which still needed to install some packages(the icefall installation guide is outdated).
The docker image could not use the graphic card using the provided commands(here), I needed to use a recipe which shared the graphic view completely(maybe gone too far).
At the end, it was ok, but the usage documentation still lacked (Yes I read here). I had to create a dummy ljspeech model and apply some hacks...

After all, icefall was the only way I could export a good onnx model with metadata, the original repo was messy and needed many hard-codings to get correct, but was simpler and I shared the onnx created by original repo in the hugging face, so that it could be reproducible. I don't know if I did right...

mah92 · 2025-02-10T03:10:34Z

Here are the files I used to get up icefall, in case it be helpfull

icefall-recipe.zip

mah92 · 2025-02-10T03:15:11Z

I also needed to do this, otherwise I could not export hifhi_gan with 32GB of pc RAM.
Inside export_onnx_hifigan.py, I did this to reduce memory usage:
convert 100000 to 10000
x = torch.ones(1, 80, 10000, dtype=torch.float32)

mah92 · 2025-02-10T04:01:30Z

As an un-experienced user, I had a noob opinion: maybe better to add a minimal code just to add metadata to an onnx model, like the one you provided for piper-vits...
Or maybe remove the need for metadata in onnx, as it is sometimes hard for other noobs like me to get it right :)

csukuangfj · 2025-02-10T04:39:41Z

It is ok to not use icefall for the matcha tts training.

It is also ok to use the original matcha tts repo for training.

Please wait before we support your model in sherpa-onnx.

mah92 · 2025-02-10T12:09:07Z

Thanks.
Would you kindly not remove the names? The male is musa and the female is khadijah... They would not be recognizable...
@csukuangfj

mah92 · 2025-02-10T12:17:43Z

I also mensioned that the male voice(musa) used in https://huggingface.co/spaces/k2-fsa/text-to-speech is much noisy(worse than my samples). Are you using the right vocoder? It works best with univ_v1.
The female(khadijah) voice is good.
@csukuangfj

csukuangfj · 2025-02-10T12:25:28Z

where is your uni v1 vocoder.

huggingface spacea uses hifigan vocoder

csukuangfj · 2025-02-10T12:34:34Z

Thanks.
Would you kindly not remove the names? The male is musa and the female is khadijah... They would not be recognizable...
@csukuangfj

sure, will rename it tomorrow.

mah92 · 2025-02-10T12:39:35Z

It is mensioned in the hifigan repo, They say that Universal vocoder is better for languages other than english. I tested and confirm this:
https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y

Note: I do not know if this is your main problem as the female voice seems OK.
@csukuangfj

mah92 · 2025-02-10T13:31:26Z

Another note: We tested arm64 and armv7a android versions and they did not run on two devices which we tested. Previous versions worked... @csukuangfj

csukuangfj · 2025-02-10T14:39:36Z

can you show logcat logs?

mah92 · 2025-02-10T18:52:52Z

02-10 23:25:31.120 25999 25999 I sherpa-onnx-tts-engine: Init Next-gen Kaldi TTS
02-10 23:25:31.120 25999 25999 I sherpa-onnx-tts-engine: data dir is matcha-tts-fa_en-female/espeak-ng-data
02-10 23:25:31.124 464 464 I Layer : id=5417 onRemoved com.sec.android.app.launcher/com.sec.android.app.launcher.activities.LauncherActivity[2273]#0
02-10 23:25:31.124 464 464 I BufferQueueConsumer: com.sec.android.app.launcher/com.sec.android.app.launcher.activities.LauncherActivity[2273]#0 disconnect(C)
02-10 23:25:31.124 464 464 I BufferQueue: com.sec.android.app.launcher/com.sec.android.app.launcher.activities.LauncherActivity[2273]#0 ~BufferQueueCore
02-10 23:25:31.170 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:31.299 1292 1608 D NetworkController.MobileSignalController(0/1): onSignalStrengthsChanged signalStrength=SignalStrength: 19 99 -12 -200 -12 -200 -1 36 -74 -13 40 2147483647 0 2147483647 19 46 -74 0x4000 P gsm|lte EEGE use_rsrp_and_rssnr_for_lte_level [-128, -118, -108, -98] [-115, -105, -95, -85] level=4
02-10 23:25:31.299 1292 1608 D NetworkController.MobileSignalController(0/1): getMobileIconGroup(): 13
02-10 23:25:31.374 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:31.614 2617 2617 I chatty : uid=10129(com.sec.android.inputmethod) identical 1 line
02-10 23:25:31.854 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:31.960 497 561 D AALLightSensor: newLux = 86, [76, 76] -> 86
02-10 23:25:32.069 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:32.171 1944 1944 D io_stats: !@ 179,0 r 4593819 95103978 w 759157 13512676 d 169491 7255924 f 204761 279542 iot 2947124 2836740 th 51200 51200 47232 pt 6848 inp 0 0 55080.821
02-10 23:25:32.318 1018 1235 D GameManagerService: handleForegroundChange(). pkgName: com.k2fsa.sherpa.onnx.tts.engine, clsName: com.k2fsa.sherpa.onnx.tts.engine.MainActivity,FgActivityName:com.k2fsa.sherpa.onnx.tts.engine/.MainActivity,userID:0
02-10 23:25:32.318 1018 1235 D GameManagerService: handleForegroundChange(). same package. game has never resumed yet. ignore
02-10 23:25:32.310 2617 2617 I chatty : uid=10129(com.sec.android.inputmethod) identical 1 line
02-10 23:25:32.550 2617 2617 I SKBD : anc isTosAccept false
02-10 23:25:32.552 25999 25999 I sherpa-onnx-tts-engine: newDataDir: /storage/emulated/0/Android/data/com.k2fsa.sherpa.onnx.tts.engine/files
02-10 23:25:32.554 25999 25999 D AndroidRuntime: Shutting down VM
02-10 23:25:32.562 25999 25999 E AndroidRuntime: FATAL EXCEPTION: main
02-10 23:25:32.562 25999 25999 E AndroidRuntime: Process: com.k2fsa.sherpa.onnx.tts.engine, PID: 25999
02-10 23:25:32.562 25999 25999 E AndroidRuntime: java.lang.RuntimeException: Unable to start activity ComponentInfo{com.k2fsa.sherpa.onnx.tts.engine/com.k2fsa.sherpa.onnx.tts.engine.MainActivity}: java.lang.IllegalArgumentException: Please specify a TTS model
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:3160)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:3303)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:78)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:108)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:68)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1991)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.os.Handler.dispatchMessage(Handler.java:106)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.os.Looper.loop(Looper.java:216)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread.main(ActivityThread.java:7258)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at java.lang.reflect.Method.invoke(Native Method)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:494)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:975)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: Caused by: java.lang.IllegalArgumentException: Please specify a TTS model
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.TtsKt.getOfflineTtsConfig(Tts.kt:218)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.TtsKt.getOfflineTtsConfig$default(Tts.kt:186)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.tts.engine.TtsEngine.initTts(TtsEngine.kt:194)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.tts.engine.TtsEngine.createTts(TtsEngine.kt:174)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at com.k2fsa.sherpa.onnx.tts.engine.MainActivity.onCreate(MainActivity.kt:73)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.Activity.performCreate(Activity.java:7353)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.Activity.performCreate(Activity.java:7344)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1275)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:3140)
02-10 23:25:32.562 25999 25999 E AndroidRuntime: ... 11 more

mah92 · 2025-02-10T20:07:08Z

The language should be set as "fas" inside TtsEngine.kt (not "fa"). Is it the problem? Could you provide the exact modification in TtsEngine as a reference? The tokens.txt is set as metadata in the onnx so there is no need for it. Am I correct?

    // Example 8
    // matcha-icefall-en_US-ljspeech
    // https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/matcha.html#matcha-icefall-en-us-ljspeech-american-english-1-female-speaker
    // modelDir = "matcha-icefall-en_US-ljspeech"
    // acousticModelName = "model-steps-3.onnx"
    // vocoder = "hifigan_v2.onnx"
    // dataDir = "matcha-icefall-en_US-ljspeech/espeak-ng-data"
    // lang = "eng"

mah92 · 2025-02-10T20:09:28Z

I wish I could check the model speed and set the optimal number of steps accodingly...

csukuangfj · 2025-02-11T04:25:48Z

Another note: We tested arm64 and armv7a android versions and they did not run on two devices which we tested. Previous versions worked... @csukuangfj

Fixed in #1841

csukuangfj · 2025-02-11T04:26:19Z

I also mensioned that the male voice(musa) used in https://huggingface.co/spaces/k2-fsa/text-to-speech is much noisy(worse than my samples). Are you using the right vocoder? It works best with univ_v1. The female(khadijah) voice is good. @csukuangfj

Please provide the onnx model for your vocoder by following what we are doing in icefall.

mah92 changed the title ~~Adding a matcha-tts model for android~~ Adding fa/en matcha-tts model for android Jan 31, 2025

mah92 mentioned this issue Feb 10, 2025

A successfull fa/en implementation report shivammehta25/Matcha-TTS#137

Open

This was referenced Feb 10, 2025

Export MatchaTTS fa-en model to sherpa-onnx #1832

Merged

Add C++ support for MatchaTTS models not from icefall. #1834

Merged

csukuangfj closed this as completed in #1834 Feb 10, 2025

csukuangfj mentioned this issue Feb 11, 2025

Fix matcha tts model names for Persian and English #1841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding fa/en matcha-tts model for android #1779

Adding fa/en matcha-tts model for android #1779

mah92 commented Jan 31, 2025 •

edited

Loading

csukuangfj commented Feb 1, 2025 •

edited

Loading

csukuangfj commented Feb 1, 2025

mah92 commented Feb 1, 2025

mah92 commented Feb 9, 2025 •

edited

Loading

mah92 commented Feb 9, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

csukuangfj commented Feb 10, 2025

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

csukuangfj commented Feb 10, 2025

csukuangfj commented Feb 10, 2025

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

csukuangfj commented Feb 10, 2025

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025

csukuangfj commented Feb 11, 2025

csukuangfj commented Feb 11, 2025

Adding fa/en matcha-tts model for android #1779

Adding fa/en matcha-tts model for android #1779

Comments

mah92 commented Jan 31, 2025 • edited Loading

csukuangfj commented Feb 1, 2025 • edited Loading

csukuangfj commented Feb 1, 2025

mah92 commented Feb 1, 2025

mah92 commented Feb 9, 2025 • edited Loading

mah92 commented Feb 9, 2025 • edited Loading

mah92 commented Feb 10, 2025 • edited Loading

mah92 commented Feb 10, 2025

mah92 commented Feb 10, 2025 • edited Loading

mah92 commented Feb 10, 2025 • edited Loading

csukuangfj commented Feb 10, 2025

mah92 commented Feb 10, 2025 • edited Loading

mah92 commented Feb 10, 2025 • edited Loading

csukuangfj commented Feb 10, 2025

csukuangfj commented Feb 10, 2025

mah92 commented Feb 10, 2025 • edited Loading

mah92 commented Feb 10, 2025 • edited Loading

csukuangfj commented Feb 10, 2025

mah92 commented Feb 10, 2025 • edited Loading

mah92 commented Feb 10, 2025 • edited Loading

mah92 commented Feb 10, 2025

csukuangfj commented Feb 11, 2025

csukuangfj commented Feb 11, 2025

mah92 commented Jan 31, 2025 •

edited

Loading

csukuangfj commented Feb 1, 2025 •

edited

Loading

mah92 commented Feb 9, 2025 •

edited

Loading

mah92 commented Feb 9, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading

mah92 commented Feb 10, 2025 •

edited

Loading