fix(mofa-fm): make custom voice cloning end-to-end with ominix-api by ymote · Pull Request #49 · mofa-org/mofa-skills

ymote · 2026-04-25T05:52:47Z

Summary

Warning

The original heavy approach below was retired in favour of a lighter fix
that does not depend on a model bundle. The previous body is preserved
at the bottom for history.

Current approach (commit `9802b7d`)

mini2 retest of the v1 binary (which propagated fm_voice_save →
/v1/voices/train) confirmed that path is unusable on the prod fleet:
ominix-api's training endpoint requires the gpt-sovits-mlx model bundle
(~/.OminiX/models/gpt-sovits-mlx/hubert.safetensors), which is not
provisioned on the minis. Training fails on first call, the local catalog
rolls back, and the user is left with no voice.

The pre-flight regression that originally broke the working path was
introduced in commit 6a3f629 ("Validate voice against ominix-api
/v1/voices") in mofa-fm itself. Before that commit, fm_tts for custom
voices went straight to /v1/audio/tts/clone (multipart WAV upload),
which works WITHOUT any registration because the reference audio is
streamed inline per request. The new pre-flight cross-checked every voice
against /v1/voices and rejected anything not in voices.json, even
though tts/clone would have happily synthesised it.

This PR's two changes:

Relax the pre-flight in fm_tts. Voices that resolve in mofa-fm's
local catalog (resolve_custom_voice returns Some) skip the
GET /v1/voices validation and fall through to multipart tts/clone.
Preset/empty voices keep the existing graceful-degradation behaviour
so the original mini2 yangmi diagnostic stays useful.
Revert the heavy /v1/voices/train propagation added in
3fc83b7. handle_voice_save is back to its original contract:
validate, normalise wav, write the local catalog entry, return
success. submit_voice_training, parse_training_status,
poll_voice_training, default_transcript_for,
default_training_language, and the base64/thread/Instant
imports are removed. The 6 training-status unit tests are removed
alongside their implementation.

The 887f613 prompt fixes are intentionally kept — manifest.json tool
descriptions still clarify clone-before-tts and retract the
"voice cloning is automatic" claim, and SKILL.md keeps
always: true so the workflow body reaches the LLM. The
transcript / language schema fields on fm_voice_save are kept too
(deserialised but unused) so the LLM keeps populating them without breakage.

Tests / build

cargo test -p mofa-fm — 21 tests pass.
cargo build --release -p mofa-fm clean.
cargo fmt --all + cargo clippy --all-targets -- -D warnings clean.

Commits

887f613 fix(voice): clarify clone-before-tts workflow in tool descriptions and always-inject SKILL.md — kept
3fc83b7 feat(voice): propagate fm_voice_save to OminiX-API via /v1/voices/train — superseded, reverted by 9802b7d
9802b7d fix(voice): use existing tts/clone path; relax pre-flight that misrouted custom voices — current fix

Squash on merge is fine.

Original (retired) heavy-approach description

Three fixes so the LLM workflow wav → fm_voice_save → fm_tts actually works against ominix-api on prod minis. Previously the LLM picked fm_tts for clone requests (wrong tool), and even when it picked fm_voice_save the call only updated mofa-fm's local catalog without registering the voice with ominix-api — leading to voice 'X' is not registered on ominix-api 404s on the next fm_tts.

- Tool descriptions — fm_voice_save now states it MUST be called BEFORE fm_tts for any clone (克隆) request, with the audio length hint and step ordering called out. fm_tts retracts the false claim that voice cloning is automatic and points to fm_voice_save instead.
- SKILL.md — Switched frontmatter to always: true so the clone-before-tts workflow is injected into the system prompt. The body now spells out the strict sequence and warns that brand-new voice names will not auto-clone.
- fm_voice_save → ominix-api — After the local catalog/wav-copy step, the handler now POSTs {voice_name, audio (base64), transcript, quality, language, denoise} to /v1/voices/train, captures task_id, and polls /v1/voices/train/status?task_id=... every 5s up to a 10-minute hard timeout. On complete it returns success; on failed (or timeout / unreachable URL) it rolls back the local registry entry and removes the copied wav so the local catalog stays in sync with what the server can actually synthesize. New optional input args transcript and language are surfaced on the tool so the LLM can pass them.

End-to-end clone retest BLOCKED on mini2: ominix-api requires ~/.OminiX/models/gpt-sovits-mlx/hubert.safetensors for VITS fine-tuning and the directory is not provisioned on mini2 (training task fails immediately with Failed to load HuBERT: Path must point to a local file).

fm_tts and fm_voice_list now cross-check mofa-fm's local catalog against what ominix-api will actually synthesise. fm_tts: before POSTing to /v1/audio/tts/qwen3, fetch /v1/voices. If the requested voice isn't registered (and isn't the empty server-default), return an error up-front listing the available voices. This plugs the mini2 yangmi symptom: fm-fm had yangmi locally, ominix-api had no voices.json entries, and /v1/audio/tts/qwen3 silently substituted a different preset so the user heard the wrong voice. When /v1/voices is unreachable, fall through to the TTS call (better to try than block on a transient outage). fm_voice_list: intersect catalog with /v1/voices and annotate each entry as registered, orphaned_in_catalog, or ominix_only. Append a summary like "N registered, M orphaned (not synth-capable on this ominix-api)". If /v1/voices is unreachable, emit a warning banner but still render the catalog. Pure helpers (parse_registered_voices, validate_requested_voice, classify_voice_entries, voice_in_registry) are unit tested; the unreachable path uses a closed TCP port to exercise graceful degradation without standing up an HTTP server.

…d always-inject SKILL.md

…ted custom voices Reverts the heavy `/v1/voices/train` propagation added in 3fc83b7 and relaxes the pre-flight introduced in 6a3f629 so that custom voices stop being blocked at the gate. Why: the `voices/train` endpoint requires a gpt-sovits-mlx model bundle (`~/.OminiX/models/gpt-sovits-mlx/hubert.safetensors`) that is not installed on every host. When the bundle is missing, training fails and mofa-fm rolls back the local catalog, leaving the user with no usable custom voice. The lighter path — `/v1/audio/tts/clone` with a multipart reference WAV — works today on every host running ominix-api because it streams the reference inline per request and does not consult the voices.json registry. What changes: * `handle_voice_save` no longer POSTs to `/v1/voices/train` or polls the status endpoint. It validates the audio, normalises it to WAV, and writes the entry to the local catalog. `submit_voice_training`, `parse_training_status`, `poll_voice_training`, `default_transcript_for` and `default_training_language` are removed alongside their tests and the unused `base64`, `std::thread`, `std::time::Instant` imports. * `handle_tts` only pre-flights `/v1/voices` for voices that are NOT resolvable as local custom voices. Custom voices fall straight through to `/v1/audio/tts/clone` (which uploads the reference inline). Empty voices and registered presets keep their existing graceful-degradation behaviour. The 887f613 prompt fixes (manifest descriptions, SKILL.md `always: true`, the new `transcript`/`language` input fields) are intentionally kept — the schema stays stable for the LLM. The two extra fields are deserialised but not consumed in this lighter implementation. Tests: 21 pass (the 6 training-status tests are removed with their implementation). `cargo fmt --all` and `cargo clippy --all-targets -- -D warnings` clean.

ymote and others added 4 commits April 24, 2026 14:33

fix(voice): clarify clone-before-tts workflow in tool descriptions an…

887f613

…d always-inject SKILL.md

feat(voice): propagate fm_voice_save to OminiX-API via /v1/voices/train

3fc83b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mofa-fm): make custom voice cloning end-to-end with ominix-api#49

fix(mofa-fm): make custom voice cloning end-to-end with ominix-api#49
ymote wants to merge 4 commits into
mainfrom
fix/voice-clone-workflow

ymote commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ymote commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Current approach (commit 9802b7d)

Tests / build

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ymote commented Apr 25, 2026 •

edited

Loading

Current approach (commit `9802b7d`)