fix(mofa-fm): make custom voice cloning end-to-end with ominix-api#49
Open
ymote wants to merge 4 commits into
Open
fix(mofa-fm): make custom voice cloning end-to-end with ominix-api#49ymote wants to merge 4 commits into
ymote wants to merge 4 commits into
Conversation
fm_tts and fm_voice_list now cross-check mofa-fm's local catalog against what ominix-api will actually synthesise. fm_tts: before POSTing to /v1/audio/tts/qwen3, fetch /v1/voices. If the requested voice isn't registered (and isn't the empty server-default), return an error up-front listing the available voices. This plugs the mini2 yangmi symptom: fm-fm had yangmi locally, ominix-api had no voices.json entries, and /v1/audio/tts/qwen3 silently substituted a different preset so the user heard the wrong voice. When /v1/voices is unreachable, fall through to the TTS call (better to try than block on a transient outage). fm_voice_list: intersect catalog with /v1/voices and annotate each entry as registered, orphaned_in_catalog, or ominix_only. Append a summary like "N registered, M orphaned (not synth-capable on this ominix-api)". If /v1/voices is unreachable, emit a warning banner but still render the catalog. Pure helpers (parse_registered_voices, validate_requested_voice, classify_voice_entries, voice_in_registry) are unit tested; the unreachable path uses a closed TCP port to exercise graceful degradation without standing up an HTTP server.
…d always-inject SKILL.md
…ted custom voices Reverts the heavy `/v1/voices/train` propagation added in 3fc83b7 and relaxes the pre-flight introduced in 6a3f629 so that custom voices stop being blocked at the gate. Why: the `voices/train` endpoint requires a gpt-sovits-mlx model bundle (`~/.OminiX/models/gpt-sovits-mlx/hubert.safetensors`) that is not installed on every host. When the bundle is missing, training fails and mofa-fm rolls back the local catalog, leaving the user with no usable custom voice. The lighter path — `/v1/audio/tts/clone` with a multipart reference WAV — works today on every host running ominix-api because it streams the reference inline per request and does not consult the voices.json registry. What changes: * `handle_voice_save` no longer POSTs to `/v1/voices/train` or polls the status endpoint. It validates the audio, normalises it to WAV, and writes the entry to the local catalog. `submit_voice_training`, `parse_training_status`, `poll_voice_training`, `default_transcript_for` and `default_training_language` are removed alongside their tests and the unused `base64`, `std::thread`, `std::time::Instant` imports. * `handle_tts` only pre-flights `/v1/voices` for voices that are NOT resolvable as local custom voices. Custom voices fall straight through to `/v1/audio/tts/clone` (which uploads the reference inline). Empty voices and registered presets keep their existing graceful-degradation behaviour. The 887f613 prompt fixes (manifest descriptions, SKILL.md `always: true`, the new `transcript`/`language` input fields) are intentionally kept — the schema stays stable for the LLM. The two extra fields are deserialised but not consumed in this lighter implementation. Tests: 21 pass (the 6 training-status tests are removed with their implementation). `cargo fmt --all` and `cargo clippy --all-targets -- -D warnings` clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Warning
The original heavy approach below was retired in favour of a lighter fix
that does not depend on a model bundle. The previous body is preserved
at the bottom for history.
Current approach (commit
9802b7d)mini2 retest of the v1 binary (which propagated
fm_voice_save→/v1/voices/train) confirmed that path is unusable on the prod fleet:ominix-api's training endpoint requires the gpt-sovits-mlx model bundle
(
~/.OminiX/models/gpt-sovits-mlx/hubert.safetensors), which is notprovisioned on the minis. Training fails on first call, the local catalog
rolls back, and the user is left with no voice.
The pre-flight regression that originally broke the working path was
introduced in commit
6a3f629("Validate voice against ominix-api/v1/voices") in mofa-fm itself. Before that commit,
fm_ttsfor customvoices went straight to
/v1/audio/tts/clone(multipart WAV upload),which works WITHOUT any registration because the reference audio is
streamed inline per request. The new pre-flight cross-checked every voice
against
/v1/voicesand rejected anything not invoices.json, eventhough tts/clone would have happily synthesised it.
This PR's two changes:
fm_tts. Voices that resolve in mofa-fm'slocal catalog (
resolve_custom_voicereturns Some) skip theGET /v1/voicesvalidation and fall through to multipart tts/clone.Preset/empty voices keep the existing graceful-degradation behaviour
so the original mini2 yangmi diagnostic stays useful.
/v1/voices/trainpropagation added in3fc83b7.handle_voice_saveis back to its original contract:validate, normalise wav, write the local catalog entry, return
success.
submit_voice_training,parse_training_status,poll_voice_training,default_transcript_for,default_training_language, and thebase64/thread/Instantimports are removed. The 6 training-status unit tests are removed
alongside their implementation.
The 887f613 prompt fixes are intentionally kept —
manifest.jsontooldescriptions still clarify clone-before-tts and retract the
"voice cloning is automatic" claim, and
SKILL.mdkeepsalways: trueso the workflow body reaches the LLM. Thetranscript/languageschema fields onfm_voice_saveare kept too(deserialised but unused) so the LLM keeps populating them without breakage.
Tests / build
cargo test -p mofa-fm— 21 tests pass.cargo build --release -p mofa-fmclean.cargo fmt --all+cargo clippy --all-targets -- -D warningsclean.Commits
887f613fix(voice): clarify clone-before-tts workflow in tool descriptions and always-inject SKILL.md— kept3fc83b7feat(voice): propagate fm_voice_save to OminiX-API via /v1/voices/train— superseded, reverted by9802b7d9802b7dfix(voice): use existing tts/clone path; relax pre-flight that misrouted custom voices— current fixSquash on merge is fine.
Original (retired) heavy-approach description
Three fixes so the LLM workflowwav → fm_voice_save → fm_ttsactually works against ominix-api on prod minis. Previously the LLM pickedfm_ttsfor clone requests (wrong tool), and even when it pickedfm_voice_savethe call only updated mofa-fm's local catalog without registering the voice with ominix-api — leading tovoice 'X' is not registered on ominix-api404s on the nextfm_tts.- Tool descriptions —fm_voice_savenow states it MUST be called BEFOREfm_ttsfor any clone (克隆) request, with the audio length hint and step ordering called out.fm_ttsretracts the false claim that voice cloning is automatic and points tofm_voice_saveinstead.- SKILL.md — Switched frontmatter toalways: trueso the clone-before-tts workflow is injected into the system prompt. The body now spells out the strict sequence and warns that brand-new voice names will not auto-clone.-fm_voice_save→ ominix-api — After the local catalog/wav-copy step, the handler now POSTs{voice_name, audio (base64), transcript, quality, language, denoise}to/v1/voices/train, capturestask_id, and polls/v1/voices/train/status?task_id=...every 5s up to a 10-minute hard timeout. Oncompleteit returns success; onfailed(or timeout / unreachable URL) it rolls back the local registry entry and removes the copied wav so the local catalog stays in sync with what the server can actually synthesize. New optional input argstranscriptandlanguageare surfaced on the tool so the LLM can pass them.End-to-end clone retest BLOCKED on mini2: ominix-api requires~/.OminiX/models/gpt-sovits-mlx/hubert.safetensorsfor VITS fine-tuning and the directory is not provisioned on mini2 (training task fails immediately withFailed to load HuBERT: Path must point to a local file).