Skip to content

fix(mofa-fm): make custom voice cloning end-to-end with ominix-api#49

Open
ymote wants to merge 4 commits into
mainfrom
fix/voice-clone-workflow
Open

fix(mofa-fm): make custom voice cloning end-to-end with ominix-api#49
ymote wants to merge 4 commits into
mainfrom
fix/voice-clone-workflow

Conversation

@ymote

@ymote ymote commented Apr 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Warning

The original heavy approach below was retired in favour of a lighter fix
that does not depend on a model bundle. The previous body is preserved
at the bottom for history.

Current approach (commit 9802b7d)

mini2 retest of the v1 binary (which propagated fm_voice_save
/v1/voices/train) confirmed that path is unusable on the prod fleet:
ominix-api's training endpoint requires the gpt-sovits-mlx model bundle
(~/.OminiX/models/gpt-sovits-mlx/hubert.safetensors), which is not
provisioned on the minis. Training fails on first call, the local catalog
rolls back, and the user is left with no voice.

The pre-flight regression that originally broke the working path was
introduced in commit 6a3f629 ("Validate voice against ominix-api
/v1/voices") in mofa-fm itself. Before that commit, fm_tts for custom
voices went straight to /v1/audio/tts/clone (multipart WAV upload),
which works WITHOUT any registration because the reference audio is
streamed inline per request. The new pre-flight cross-checked every voice
against /v1/voices and rejected anything not in voices.json, even
though tts/clone would have happily synthesised it.

This PR's two changes:

  1. Relax the pre-flight in fm_tts. Voices that resolve in mofa-fm's
    local catalog (resolve_custom_voice returns Some) skip the
    GET /v1/voices validation and fall through to multipart tts/clone.
    Preset/empty voices keep the existing graceful-degradation behaviour
    so the original mini2 yangmi diagnostic stays useful.
  2. Revert the heavy /v1/voices/train propagation added in
    3fc83b7. handle_voice_save is back to its original contract:
    validate, normalise wav, write the local catalog entry, return
    success. submit_voice_training, parse_training_status,
    poll_voice_training, default_transcript_for,
    default_training_language, and the base64/thread/Instant
    imports are removed. The 6 training-status unit tests are removed
    alongside their implementation.

The 887f613 prompt fixes are intentionally kept — manifest.json tool
descriptions still clarify clone-before-tts and retract the
"voice cloning is automatic" claim, and SKILL.md keeps
always: true so the workflow body reaches the LLM. The
transcript / language schema fields on fm_voice_save are kept too
(deserialised but unused) so the LLM keeps populating them without breakage.

Tests / build

  • cargo test -p mofa-fm — 21 tests pass.
  • cargo build --release -p mofa-fm clean.
  • cargo fmt --all + cargo clippy --all-targets -- -D warnings clean.

Commits

  1. 887f613 fix(voice): clarify clone-before-tts workflow in tool descriptions and always-inject SKILL.md — kept
  2. 3fc83b7 feat(voice): propagate fm_voice_save to OminiX-API via /v1/voices/train — superseded, reverted by 9802b7d
  3. 9802b7d fix(voice): use existing tts/clone path; relax pre-flight that misrouted custom voices — current fix

Squash on merge is fine.


Original (retired) heavy-approach description

Three fixes so the LLM workflow wav → fm_voice_save → fm_tts actually works against ominix-api on prod minis. Previously the LLM picked fm_tts for clone requests (wrong tool), and even when it picked fm_voice_save the call only updated mofa-fm's local catalog without registering the voice with ominix-api — leading to voice 'X' is not registered on ominix-api 404s on the next fm_tts.

- Tool descriptionsfm_voice_save now states it MUST be called BEFORE fm_tts for any clone (克隆) request, with the audio length hint and step ordering called out. fm_tts retracts the false claim that voice cloning is automatic and points to fm_voice_save instead.
- SKILL.md — Switched frontmatter to always: true so the clone-before-tts workflow is injected into the system prompt. The body now spells out the strict sequence and warns that brand-new voice names will not auto-clone.
- fm_voice_save → ominix-api — After the local catalog/wav-copy step, the handler now POSTs {voice_name, audio (base64), transcript, quality, language, denoise} to /v1/voices/train, captures task_id, and polls /v1/voices/train/status?task_id=... every 5s up to a 10-minute hard timeout. On complete it returns success; on failed (or timeout / unreachable URL) it rolls back the local registry entry and removes the copied wav so the local catalog stays in sync with what the server can actually synthesize. New optional input args transcript and language are surfaced on the tool so the LLM can pass them.

End-to-end clone retest BLOCKED on mini2: ominix-api requires ~/.OminiX/models/gpt-sovits-mlx/hubert.safetensors for VITS fine-tuning and the directory is not provisioned on mini2 (training task fails immediately with Failed to load HuBERT: Path must point to a local file).

ymote and others added 4 commits April 24, 2026 14:33
fm_tts and fm_voice_list now cross-check mofa-fm's local catalog
against what ominix-api will actually synthesise.

fm_tts: before POSTing to /v1/audio/tts/qwen3, fetch /v1/voices. If
the requested voice isn't registered (and isn't the empty
server-default), return an error up-front listing the available
voices. This plugs the mini2 yangmi symptom: fm-fm had yangmi locally,
ominix-api had no voices.json entries, and /v1/audio/tts/qwen3
silently substituted a different preset so the user heard the wrong
voice. When /v1/voices is unreachable, fall through to the TTS call
(better to try than block on a transient outage).

fm_voice_list: intersect catalog with /v1/voices and annotate each
entry as registered, orphaned_in_catalog, or ominix_only. Append a
summary like "N registered, M orphaned (not synth-capable on this
ominix-api)". If /v1/voices is unreachable, emit a warning banner but
still render the catalog.

Pure helpers (parse_registered_voices, validate_requested_voice,
classify_voice_entries, voice_in_registry) are unit tested; the
unreachable path uses a closed TCP port to exercise graceful
degradation without standing up an HTTP server.
…ted custom voices

Reverts the heavy `/v1/voices/train` propagation added in 3fc83b7 and
relaxes the pre-flight introduced in 6a3f629 so that custom voices stop
being blocked at the gate.

Why: the `voices/train` endpoint requires a gpt-sovits-mlx model bundle
(`~/.OminiX/models/gpt-sovits-mlx/hubert.safetensors`) that is not
installed on every host. When the bundle is missing, training fails and
mofa-fm rolls back the local catalog, leaving the user with no usable
custom voice. The lighter path — `/v1/audio/tts/clone` with a multipart
reference WAV — works today on every host running ominix-api because it
streams the reference inline per request and does not consult the
voices.json registry.

What changes:

* `handle_voice_save` no longer POSTs to `/v1/voices/train` or polls the
  status endpoint. It validates the audio, normalises it to WAV, and
  writes the entry to the local catalog. `submit_voice_training`,
  `parse_training_status`, `poll_voice_training`, `default_transcript_for`
  and `default_training_language` are removed alongside their tests
  and the unused `base64`, `std::thread`, `std::time::Instant` imports.

* `handle_tts` only pre-flights `/v1/voices` for voices that are NOT
  resolvable as local custom voices. Custom voices fall straight through
  to `/v1/audio/tts/clone` (which uploads the reference inline). Empty
  voices and registered presets keep their existing graceful-degradation
  behaviour.

The 887f613 prompt fixes (manifest descriptions, SKILL.md `always: true`,
the new `transcript`/`language` input fields) are intentionally kept —
the schema stays stable for the LLM. The two extra fields are deserialised
but not consumed in this lighter implementation.

Tests: 21 pass (the 6 training-status tests are removed with their
implementation). `cargo fmt --all` and
`cargo clippy --all-targets -- -D warnings` clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant