feat(chatterbox): autoregressive codec-LM engine — d-vector cloning + exaggeration#181
feat(chatterbox): autoregressive codec-LM engine — d-vector cloning + exaggeration#181JarbasAl wants to merge 16 commits into
Conversation
…tion Chatterbox (Resemble AI) — phoonnx's first autoregressive engine. A Llama-based codec LM that generates speech tokens conditioned on text + a reference speaker, driven through the iterative BaseOnnxAdapter.synthesize() hook. - ChatterboxAdapter: the 4-ONNX pipeline (speech_encoder -> embed_tokens -> KV-cached Llama AR loop with repetition penalty + greedy decode -> conditional_decoder), KV shape read from the LM's own signature. d-vector cloning (reference wav, no transcription) via the speaker_reference API; exaggeration via a new SynthesisConfig.exaggeration field. - Tokenizer generalized: TTSTokenizer (vocab-lookup) is now one impl; BPETokenizer (subword, HF tokenizers) is another, for text-token models. TTSVoice gains a tokenizes_raw_text path (Chatterbox's BPE does its own normalization — phoneme front ends would strip punctuation / expand numbers and corrupt the input). Validated end-to-end against onnx-community/chatterbox-ONNX: clones a reference clip into coherent speech (voicing 1.05), punctuation preserved. 7 tests; suite 253. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs/chatterbox.md — the autoregressive codec-LM architecture, the 4-ONNX pipeline, the BPE/raw-text tokenization, d-vector cloning + exaggeration usage, and the upstream/converter links. Added to the d-vector cloning table, the engines adapter table, and the docs index. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Hello there! I've processed your latest changes. 🌊I've aggregated the results of the automated checks for this PR below. 🏷️ Release PreviewEnsuring the 'Dependency Updates' are documented. 📦 Current:
🚀 Release Channel Compatibility Predicted next version:
📋 Repo HealthHow's the repo's pulse? Let's take a look. 💓 Latest Version: ✅ 📊 CoverageMeasuring the footprint of your testing efforts. 👣 ❌ 41.3% total coverage Files below 80% coverage (45 files)
Full report: download the 🔒 Security (pip-audit)Shields up! Scanning for potential threats. 🛡️ ✅ No known vulnerabilities found (61 packages scanned). 🔍 LintEnsuring we're following our development process. 📏 ❌ ruff: issues found — see job log ⚖️ License CheckNavigating the maze of open-source compliance. 🧩 ❌ License violations detected (43 packages) — review required before merging. License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more Full breakdown — 43 packages
Copyright (c) 2022 Phil Ewels Permission is hereby granted, free of charge, to any person obtaining a copy The above copyright notice and this permission notice shall be included in all THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed. 🔨 Build TestsThe blueprints match the build! 📐 ✅ All versions pass
Keeping the OVOS ecosystem thriving 🌿 |
Replace the tokenizes_raw_text if/else in TTSVoice with a BaseOnnxAdapter.encode_text the voice calls polymorphically — every engine receives text and returns its own model-input ids. Default = phonemize + vocab-tokenize; ChatterboxAdapter overrides with BPE. Engine-agnostic text preprocessing (pronunciation overrides + diacritics) stays in the voice, before encode_text — it operates on the text regardless of engine. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…one path Drive embed_tokens and the LM from each graph's actual input signature instead of hardcoding the base contract, so the three Chatterbox variants run unchanged: - base / multilingual (Llama, 30 layers): embed_tokens takes input_ids + position_ids + exaggeration; LM has no position_ids. - turbo (GPT-2, 24 layers, meanflow decoder): embed_tokens takes input_ids alone; LM takes a cumulative position_ids. exaggeration is simply not fed when the graph lacks it. Validated end-to-end against onnx-community/chatterbox-ONNX (base, voicing 1.05), onnx-community/chatterbox-multilingual-ONNX (same contract), and ResembleAI/chatterbox-turbo-ONNX (voicing 1.41) — all clone a reference clip into coherent speech. Suite 253. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Greedy argmax made the codec-LM loop without emitting the speech-EOS (32s of babble for a one-line sentence). Replace it with temperature + nucleus (top-p) sampling (defaults 0.8/0.8, overridable) — base/multilingual now stop cleanly (~5s, reproducible). Turbo (GPT-2 + meanflow) loads through the I/O-driven adapter but conditions generation differently from the Llama models and produces unintelligible output; it needs a turbo-specific reference. Marked not-yet-supported in the docs. The earlier 'turbo works' read was a false positive — the voicing metric measures energy, not intelligibility. Base + multilingual validated end-to-end. 8 tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…telligible Per the reference T3 inference, turbo's GPT-2 (absolute PE) resets speech positions to 0; wire that when the LM exposes position_ids. It shifts turbo from 15s to ~10s of output but does not yet make it intelligible — its full ONNX inference (prefill layout, possible CFG, start-speech handling) is not publicly documented and the torch reference doesn't map 1:1 onto the exported graphs. Marked experimental; base + multilingual remain the supported, validated variants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…riants) Match the reference turbo ONNX inference exactly: - repetition penalty over ALL emitted tokens (not just the last) — this was the babble. - cumulative LM position_ids (position_ids[:,-1:]+1), not a reset. - append trailing silence tokens (4299) before the conditional decoder. - and the core fix: each variant has its OWN tokenizer — turbo uses a GPT-2 BPE (vocab 50276), base/multilingual a different one; feeding the base tokenizer to turbo was producing garbage input. The BPETokenizer matches the HF AutoTokenizer per model. Decoding is now temperature + top-p sampling (defaults 0.8 / 0.95), exposed via SynthesisConfig.temperature / top_p; temperature=0 falls back to greedy. Validated end-to-end: base 5.0s, turbo 4.8s, multilingual (Portuguese) 4.2s — all clone a reference clip into coherent, correctly-terminated speech. Suite 254. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… NFKD) The multilingual tokenizer needs a specific front end, missing before (it spoke the wrong language): lowercase + NFKD-normalise, prefix a [<lang>] token derived from the voice's lang_code, and replace spaces with the [SPACE] token. encode_text applies it only when the tokenizer actually has those tokens (multilingual), leaving base/turbo as-is. Validated: Portuguese, Spanish, French all clone the reference voice in the correct language. Five languages (zh/ja/ko/he/ru) need extra per-language preprocessing (Cangjie / hiragana / diacritics / stress) that isn't ported yet — documented. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ngual front end Move the multilingual language preprocessing out of the adapter and into a dedicated ChatterboxMTLTokenizer(BPETokenizer): the [<lang>] prefix, lowercase + NFKD, [SPACE] substitution, and a _script_transform hook for the not-yet-ported zh/ja/ko/he/ru transforms. BPETokenizer.tokenize gains a no-op language parameter so the voice calls tokenize(text, language=...) uniformly; encode_text is back to a one-liner. Validated: the subclass emits the [pt] token and renders correct Portuguese (4.2s). Suite 255. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…reprocess Vendor the multilingual script transforms into phoonnx/lang_preprocess.py with a SCRIPT_TRANSFORMS dispatch, consumed by ChatterboxMTLTokenizer._script_transform: - ko: Hangul -> Jamo (pure Python, always on) - ja: kanji -> hiragana (pykakasi) - zh: Cangjie codes (spacy-pkuseg + HF Cangjie5_TC.json) - ru: stress marks (russian_text_stresser; heavy + not on PyPI, best-effort) Hebrew/Arabic are NOT a tokenizer transform — they are the universal add_diacritics SynthesisConfig flag (applied before encode_text), so those voices just set it. Optional ja/zh deps added as the extra. Validated: ja 私の声->わたしのこえ, ko decomposes. Suite green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add speech_encoder_url / embed_tokens_url / conditional_decoder_url to TTSModelInfo, resolved into engine_params (speech_encoder_path / embed_tokens_path / conditional_decoder_path) — mirroring the vocoder_url / style_url / speaker_encoder_url pattern. Factor the model download into _fetch_onnx, which also pulls the external-data sidecar (<name>.onnx_data, saved under the referenced name so it resolves) — Chatterbox's four graphs all use external weights. Tokenizer construction for engine=chatterbox in VoiceConfig.from_dict is the remaining piece before a chatterbox voice loads end-to-end via the manager. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…MTL) - load_chatterbox_tokenizer factory: ChatterboxMTLTokenizer when the vocab has the [SPACE] token (multilingual), else a plain BPETokenizer (base/turbo); raw tokenizer loaded once and shared. - VoiceConfig.from_dict: an explicit chatterbox branch (keyed on engine, not the is_phoonnx/is_xxx foreign-config detection) that builds the tokenizer from a tokenizer.json, sets sample_rate 24000 and num_symbols from the vocab. - TTSModelInfo: download_bpe_tokenizer (tokenizer_config_url -> tokenizer.json) + a chatterbox branch in the config property that passes its path through. Validated end-to-end: from_dict builds ChatterboxMTLTokenizer for the multilingual checkpoint and BPETokenizer for base; the [pt] language front end fires. Suite 258. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sonnx - from_dict chatterbox branch defaults add_diacritics for Arabic/Hebrew voices, so the 30-language multilingual index's ar/he entries get vocalization (niqqud/tashkeel) via the universal diacritics flag. - russian_add_stress now calls stressonnx (pure-onnxruntime, no torch at runtime) instead of the heavy russian_text_stresser (spaCy + Wiktionary DB, not on PyPI). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ct models The multilingual tokenizer normally derives its [<lang>] token from the voice's lang_code. Dialect models break that: lahgtna repurposes the base 30 language tokens and selects a dialect with a literal token (e.g. language_id='eg' -> '[eg]', which isn't even a single vocab id, it BPE-splits and the model was fine-tuned on it), and lang-code normalization mangles such codes (eg -> eg-US). Add an optional lang_tokens map (BCP47 -> token string) on VoiceConfig + TTSModelInfo, plumbed through from_dict. ChatterboxMTLTokenizer prefers it and prepends the mapped token LITERALLY (no vocab check); otherwise it derives [<lang>] and requires it in the vocab as before. base/turbo ignore it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
10 per-dialect entries (eg/sa/mo/iq/lb/sd/ly/sy/tn/ps) pointing at OpenVoiceOS/phoonnx-chatterbox-lahgtna with lang_tokens BCP47 maps so the MTL tokenizer prepends dialect-specific tokens (e.g. [eg]) rather than the generic [ar]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Chatterbox (Resemble AI) — phoonnx's first autoregressive engine: a Llama-based codec-LM that generates speech tokens conditioned on text + a reference speaker. It clones from a reference clip with no transcription (d-vector) and adds an
exaggerationexpressiveness control.Engine
ChatterboxAdapter— overridessynthesize()to run the 4-ONNX pipeline:speech_encoder(ref→speaker conditioning) →embed_tokens(+exaggeration) → KV-cached Llama AR loop (repetition penalty + greedy + speech-EOS) →conditional_decoder→ wav. The KV-cache shape (layers/heads/dim) is read from the LM's own input signature — nothing hardcoded.speaker_referenceAPI (nospeaker_reference_text);exaggerationis a newSynthesisConfigfield.The tokenizer generalization (per the design discussion)
The tokenizer stops being one concrete class:
TTSTokenizer(vocab-lookup, phonemes→ids) — the existing impl.BPETokenizer(subword, HFtokenizers) — new, for text-token models.Since phoneme front ends normalize (strip punctuation, expand numbers — confirmed:
UNICODE/GRAPHEMESboth do), the adapter setstokenizes_raw_text = TrueandTTSVoicefeeds Chatterbox raw text untouched; its BPE does its own normalization. No special-case path — just the tokenizer + a documented flag.Validation
Against
onnx-community/chatterbox-ONNX(the external-data ONNX): clones a reference clip into coherent speech (voicing 1.05, no NaN), punctuation/numbers preserved through the BPE. 7 tests; suite 253.Docs
docs/chatterbox.md(architecture, pipeline, tokenization, usage) + cloning/engines/index cross-links.Models
Base ONNX exist (
onnx-community/chatterbox-ONNX). The multilingual + turbo ONNX exports (PyTorch-only upstream) are running via VladOS95's converter — they'll be indexed as cloning voices once produced. The engine is contract-compatible with all three.🤖 Generated with Claude Code