Skip to content

feat(chatterbox): autoregressive codec-LM engine — d-vector cloning + exaggeration#181

Draft
JarbasAl wants to merge 16 commits into
devfrom
feat/chatterbox
Draft

feat(chatterbox): autoregressive codec-LM engine — d-vector cloning + exaggeration#181
JarbasAl wants to merge 16 commits into
devfrom
feat/chatterbox

Conversation

@JarbasAl

@JarbasAl JarbasAl commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Chatterbox (Resemble AI) — phoonnx's first autoregressive engine: a Llama-based codec-LM that generates speech tokens conditioned on text + a reference speaker. It clones from a reference clip with no transcription (d-vector) and adds an exaggeration expressiveness control.

Engine

  • ChatterboxAdapter — overrides synthesize() to run the 4-ONNX pipeline: speech_encoder (ref→speaker conditioning) → embed_tokens (+exaggeration) → KV-cached Llama AR loop (repetition penalty + greedy + speech-EOS) → conditional_decoder → wav. The KV-cache shape (layers/heads/dim) is read from the LM's own input signature — nothing hardcoded.
  • d-vector cloning via the existing speaker_reference API (no speaker_reference_text); exaggeration is a new SynthesisConfig field.

The tokenizer generalization (per the design discussion)

The tokenizer stops being one concrete class:

  • TTSTokenizer (vocab-lookup, phonemes→ids) — the existing impl.
  • BPETokenizer (subword, HF tokenizers) — new, for text-token models.

Since phoneme front ends normalize (strip punctuation, expand numbers — confirmed: UNICODE/GRAPHEMES both do), the adapter sets tokenizes_raw_text = True and TTSVoice feeds Chatterbox raw text untouched; its BPE does its own normalization. No special-case path — just the tokenizer + a documented flag.

Validation

Against onnx-community/chatterbox-ONNX (the external-data ONNX): clones a reference clip into coherent speech (voicing 1.05, no NaN), punctuation/numbers preserved through the BPE. 7 tests; suite 253.

Docs

docs/chatterbox.md (architecture, pipeline, tokenization, usage) + cloning/engines/index cross-links.

Models

Base ONNX exist (onnx-community/chatterbox-ONNX). The multilingual + turbo ONNX exports (PyTorch-only upstream) are running via VladOS95's converter — they'll be indexed as cloning voices once produced. The engine is contract-compatible with all three.

🤖 Generated with Claude Code

JarbasAl and others added 2 commits June 8, 2026 16:43
…tion

Chatterbox (Resemble AI) — phoonnx's first autoregressive engine. A Llama-based codec
LM that generates speech tokens conditioned on text + a reference speaker, driven
through the iterative BaseOnnxAdapter.synthesize() hook.

- ChatterboxAdapter: the 4-ONNX pipeline (speech_encoder -> embed_tokens -> KV-cached
  Llama AR loop with repetition penalty + greedy decode -> conditional_decoder), KV
  shape read from the LM's own signature. d-vector cloning (reference wav, no
  transcription) via the speaker_reference API; exaggeration via a new
  SynthesisConfig.exaggeration field.
- Tokenizer generalized: TTSTokenizer (vocab-lookup) is now one impl; BPETokenizer
  (subword, HF tokenizers) is another, for text-token models. TTSVoice gains a
  tokenizes_raw_text path (Chatterbox's BPE does its own normalization — phoneme front
  ends would strip punctuation / expand numbers and corrupt the input).

Validated end-to-end against onnx-community/chatterbox-ONNX: clones a reference clip
into coherent speech (voicing 1.05), punctuation preserved. 7 tests; suite 253.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs/chatterbox.md — the autoregressive codec-LM architecture, the 4-ONNX pipeline,
the BPE/raw-text tokenization, d-vector cloning + exaggeration usage, and the
upstream/converter links. Added to the d-vector cloning table, the engines adapter
table, and the docs index.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b447eed1-267c-4b28-9d78-6c7287f48172

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/chatterbox

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added feature and removed feature labels Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Hello there! I've processed your latest changes. 🌊

I've aggregated the results of the automated checks for this PR below.

🏷️ Release Preview

Ensuring the 'Dependency Updates' are documented. 📦

Current: 1.19.0a1Next: 1.20.0a1

Signal Value
Label feature
PR title feat(chatterbox): autoregressive codec-LM engine — d-vector cloning + exaggeration
Bump minor

⚠️ No conventional commit prefix — alpha-only bump.
Suggested: fix: update the thing or feat: update the thing


🚀 Release Channel Compatibility

Predicted next version: 1.20.0a1

Channel Status Note Current Constraint
Stable Not in channel -
Testing Not in channel -
Alpha Not in channel -

📋 Repo Health

How's the repo's pulse? Let's take a look. 💓

⚠️ Some required files are missing.

Latest Version: 1.19.0a1

phoonnx/version.py — Version file
README.md — README
LICENSE — License file
pyproject.toml — pyproject.toml
⚠️ setup.py — setup.py
CHANGELOG.md — Changelog
phoonnx/version.py has valid version block markers

📊 Coverage

Measuring the footprint of your testing efforts. 👣

41.3% total coverage

Files below 80% coverage (45 files)
File Coverage Missing lines
phoonnx/cli.py 0.0% 98
phoonnx/thirdparty/kog2p/__init__.py 0.0% 203
phoonnx/thirdparty/mantoq/unicode_symbol2label.py 0.0% 1
phoonnx/thirdparty/bw2ipa.py 7.5% 86
phoonnx/thirdparty/mantoq/pyarabic/number.py 7.7% 371
phoonnx/thirdparty/mantoq/buck/phonetise_buckwalter.py 10.4% 180
phoonnx/thirdparty/hangul2ipa.py 16.6% 372
phoonnx/phonemizers/en.py 17.5% 104
phoonnx/thirdparty/mantoq/pyarabic/trans.py 18.2% 135
phoonnx/model_manager.py 19.3% 268
phoonnx/thirdparty/zh_num.py 23.1% 83
phoonnx/thirdparty/tashkeel/__init__.py 23.9% 89
phoonnx/voice.py 24.3% 234
phoonnx/lang_preprocess.py 26.6% 69
phoonnx/phonemizers/zh.py 27.0% 92
phoonnx/phonemizers/mul.py 27.6% 234
phoonnx/phonemizers/ko.py 30.4% 32
phoonnx/phonemizers/gl.py 31.1% 42
phoonnx/phonemizers/ar.py 31.2% 44
phoonnx/thirdparty/mantoq/buck/tokenization.py 32.5% 27
phoonnx/thirdparty/phonikud/__init__.py 35.3% 11
phoonnx/phonemizers/ja.py 36.0% 32
phoonnx/phonemizers/fa.py 36.4% 14
phoonnx/phonemizers/pt.py 38.1% 13
phoonnx/thirdparty/mantoq/pyarabic/normalize.py 38.1% 13
phoonnx/thirdparty/mantoq/pyarabic/araby.py 39.7% 298
phoonnx/engines/speaker_encoders/base.py 40.0% 12
phoonnx/phonemizers/he.py 40.0% 12
phoonnx/phonemizers/vi.py 40.0% 12
phoonnx/phonemizers/base.py 40.8% 71
phoonnx/engines/chatterbox.py 42.4% 80
phoonnx/thirdparty/mantoq/pyarabic/stack.py 45.5% 6
phoonnx/engines/zipvoice.py 46.9% 43
phoonnx/thirdparty/mantoq/num2words.py 47.6% 11
phoonnx/phonemizers/mwl.py 50.0% 8
phoonnx/tokenizer.py 55.1% 157
phoonnx/engines/speaker_encoders/styletts2_style.py 57.1% 6
phoonnx/thirdparty/mantoq/__init__.py 60.0% 10
phoonnx/thirdparty/mantoq/pyarabic/arabrepr.py 60.0% 6
phoonnx/engines/vocoders/griffinlim.py 61.4% 27
phoonnx/engines/speaker_encoders/coqui_resnet.py 61.5% 5
phoonnx/config.py 65.6% 130
phoonnx/engines/optispeech.py 69.6% 24
phoonnx/engines/speaker_encoders/__init__.py 73.9% 6
phoonnx/util.py 78.9% 59

Full report: download the coverage-report artifact.

🔒 Security (pip-audit)

Shields up! Scanning for potential threats. 🛡️

✅ No known vulnerabilities found (61 packages scanned).

🔍 Lint

Ensuring we're following our development process. 📏

ruff: issues found — see job log

⚖️ License Check

Navigating the maze of open-source compliance. 🧩

❌ License violations detected (43 packages) — review required before merging.

Dependency                          License Name                                            License Type         Misc                                    
phoonnx:1.3.3                       Error                                                   Error                                                        

License Type                        Found                                                  
Error                               1

License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more

Full breakdown — 43 packages
Package Version License URL
build 1.5.0 MIT link
certifi 2026.5.20 Mozilla Public License 2.0 (MPL 2.0) link
charset-normalizer 3.4.7 MIT link
click 8.4.1 BSD-3-Clause link
combo_lock 0.3.1 Apache-2.0 link
dateparser 1.4.0 BSD License link
filelock 3.29.1 MIT link
flatbuffers 25.12.19 Apache Software License link
idna 3.18 BSD-3-Clause link
json-database 0.10.1 MIT link
kthread 0.2.3 MIT License link
langcodes 3.5.1 MIT License link
markdown-it-py 4.2.0 MIT License link
mdurl 0.1.2 MIT License link
memory-tempfile 2.2.3 MIT License link
numpy 2.4.6 BSD-3-Clause AND 0BSD AND MIT AND Zlib AND CC0-1.0 link
onnxruntime 1.26.0 MIT License link
ovos-config 2.1.1 Apache-2.0 link
ovos-date-parser 0.7.0a5 Apache Software License link
ovos-number-parser 0.5.1 Apache Software License link
ovos-utils 0.8.5 Apache-2.0 link
packaging 26.2 Apache-2.0 OR BSD-2-Clause link
pexpect 4.9.0 ISC License (ISCL) link
phoonnx 1.19.0a1 Apache Software License link
protobuf 7.35.0 3-Clause BSD License link
ptyprocess 0.7.0 ISC License (ISCL) link
pyee 13.0.1 MIT License link
Pygments 2.20.0 BSD-2-Clause link
pyproject_hooks 1.2.0 MIT License link
python-dateutil 2.9.0.post0 Apache Software License; BSD License link
pytz 2026.2 MIT License link
PyYAML 6.0.3 MIT License link
quebra-frases 0.3.7 Apache Software License link
regex 2026.5.9 Apache-2.0 AND CNRI-Python link
requests 2.34.2 Apache Software License link
rich 13.9.4 MIT License link
rich-click 1.9.8 MIT License

Copyright (c) 2022 Phil Ewels

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
| link |
| six | 1.17.0 | MIT License | link |
| typing_extensions | 4.15.0 | PSF-2.0 | link |
| tzlocal | 5.3.1 | MIT License | link |
| unicode-rbnf | 2.4.0 | MIT License | |
| urllib3 | 2.7.0 | MIT | link |
| watchdog | 6.0.0 | Apache Software License | link |

Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed.

🔨 Build Tests

The blueprints match the build! 📐

✅ All versions pass

Python Build Install Tests
3.10
3.11
3.12
3.13
3.14

Keeping the OVOS ecosystem thriving 🌿

JarbasAl and others added 14 commits June 8, 2026 16:52
Replace the tokenizes_raw_text if/else in TTSVoice with a BaseOnnxAdapter.encode_text
the voice calls polymorphically — every engine receives text and returns its own
model-input ids. Default = phonemize + vocab-tokenize; ChatterboxAdapter overrides with
BPE. Engine-agnostic text preprocessing (pronunciation overrides + diacritics) stays in
the voice, before encode_text — it operates on the text regardless of engine.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…one path

Drive embed_tokens and the LM from each graph's actual input signature instead of
hardcoding the base contract, so the three Chatterbox variants run unchanged:
- base / multilingual (Llama, 30 layers): embed_tokens takes input_ids + position_ids
  + exaggeration; LM has no position_ids.
- turbo (GPT-2, 24 layers, meanflow decoder): embed_tokens takes input_ids alone; LM
  takes a cumulative position_ids. exaggeration is simply not fed when the graph lacks
  it.

Validated end-to-end against onnx-community/chatterbox-ONNX (base, voicing 1.05),
onnx-community/chatterbox-multilingual-ONNX (same contract), and
ResembleAI/chatterbox-turbo-ONNX (voicing 1.41) — all clone a reference clip into
coherent speech. Suite 253.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Greedy argmax made the codec-LM loop without emitting the speech-EOS (32s of babble
for a one-line sentence). Replace it with temperature + nucleus (top-p) sampling
(defaults 0.8/0.8, overridable) — base/multilingual now stop cleanly (~5s, reproducible).

Turbo (GPT-2 + meanflow) loads through the I/O-driven adapter but conditions generation
differently from the Llama models and produces unintelligible output; it needs a
turbo-specific reference. Marked not-yet-supported in the docs. The earlier 'turbo works'
read was a false positive — the voicing metric measures energy, not intelligibility.

Base + multilingual validated end-to-end. 8 tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…telligible

Per the reference T3 inference, turbo's GPT-2 (absolute PE) resets speech positions to
0; wire that when the LM exposes position_ids. It shifts turbo from 15s to ~10s of
output but does not yet make it intelligible — its full ONNX inference (prefill layout,
possible CFG, start-speech handling) is not publicly documented and the torch reference
doesn't map 1:1 onto the exported graphs. Marked experimental; base + multilingual
remain the supported, validated variants.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…riants)

Match the reference turbo ONNX inference exactly:
- repetition penalty over ALL emitted tokens (not just the last) — this was the babble.
- cumulative LM position_ids (position_ids[:,-1:]+1), not a reset.
- append trailing silence tokens (4299) before the conditional decoder.
- and the core fix: each variant has its OWN tokenizer — turbo uses a GPT-2 BPE
  (vocab 50276), base/multilingual a different one; feeding the base tokenizer to turbo
  was producing garbage input. The BPETokenizer matches the HF AutoTokenizer per model.

Decoding is now temperature + top-p sampling (defaults 0.8 / 0.95), exposed via
SynthesisConfig.temperature / top_p; temperature=0 falls back to greedy.

Validated end-to-end: base 5.0s, turbo 4.8s, multilingual (Portuguese) 4.2s — all clone
a reference clip into coherent, correctly-terminated speech. Suite 254.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… NFKD)

The multilingual tokenizer needs a specific front end, missing before (it spoke the
wrong language): lowercase + NFKD-normalise, prefix a [<lang>] token derived from the
voice's lang_code, and replace spaces with the [SPACE] token. encode_text applies it
only when the tokenizer actually has those tokens (multilingual), leaving base/turbo
as-is.

Validated: Portuguese, Spanish, French all clone the reference voice in the correct
language. Five languages (zh/ja/ko/he/ru) need extra per-language preprocessing
(Cangjie / hiragana / diacritics / stress) that isn't ported yet — documented.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ngual front end

Move the multilingual language preprocessing out of the adapter and into a dedicated
ChatterboxMTLTokenizer(BPETokenizer): the [<lang>] prefix, lowercase + NFKD, [SPACE]
substitution, and a _script_transform hook for the not-yet-ported zh/ja/ko/he/ru
transforms. BPETokenizer.tokenize gains a no-op language parameter so the voice calls
tokenize(text, language=...) uniformly; encode_text is back to a one-liner.

Validated: the subclass emits the [pt] token and renders correct Portuguese (4.2s).
Suite 255.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…reprocess

Vendor the multilingual script transforms into phoonnx/lang_preprocess.py with a
SCRIPT_TRANSFORMS dispatch, consumed by ChatterboxMTLTokenizer._script_transform:
- ko: Hangul -> Jamo (pure Python, always on)
- ja: kanji -> hiragana (pykakasi)
- zh: Cangjie codes (spacy-pkuseg + HF Cangjie5_TC.json)
- ru: stress marks (russian_text_stresser; heavy + not on PyPI, best-effort)

Hebrew/Arabic are NOT a tokenizer transform — they are the universal add_diacritics
SynthesisConfig flag (applied before encode_text), so those voices just set it. Optional
ja/zh deps added as the  extra. Validated: ja 私の声->わたしのこえ,
ko decomposes. Suite green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add speech_encoder_url / embed_tokens_url / conditional_decoder_url to TTSModelInfo,
resolved into engine_params (speech_encoder_path / embed_tokens_path /
conditional_decoder_path) — mirroring the vocoder_url / style_url / speaker_encoder_url
pattern. Factor the model download into _fetch_onnx, which also pulls the external-data
sidecar (<name>.onnx_data, saved under the referenced name so it resolves) — Chatterbox's
four graphs all use external weights.

Tokenizer construction for engine=chatterbox in VoiceConfig.from_dict is the remaining
piece before a chatterbox voice loads end-to-end via the manager.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…MTL)

- load_chatterbox_tokenizer factory: ChatterboxMTLTokenizer when the vocab has the
  [SPACE] token (multilingual), else a plain BPETokenizer (base/turbo); raw tokenizer
  loaded once and shared.
- VoiceConfig.from_dict: an explicit chatterbox branch (keyed on engine, not the
  is_phoonnx/is_xxx foreign-config detection) that builds the tokenizer from a
  tokenizer.json, sets sample_rate 24000 and num_symbols from the vocab.
- TTSModelInfo: download_bpe_tokenizer (tokenizer_config_url -> tokenizer.json) + a
  chatterbox branch in the config property that passes its path through.

Validated end-to-end: from_dict builds ChatterboxMTLTokenizer for the multilingual
checkpoint and BPETokenizer for base; the [pt] language front end fires. Suite 258.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sonnx

- from_dict chatterbox branch defaults add_diacritics for Arabic/Hebrew voices, so the
  30-language multilingual index's ar/he entries get vocalization (niqqud/tashkeel) via
  the universal diacritics flag.
- russian_add_stress now calls stressonnx (pure-onnxruntime, no torch at runtime)
  instead of the heavy russian_text_stresser (spaCy + Wiktionary DB, not on PyPI).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ct models

The multilingual tokenizer normally derives its [<lang>] token from the voice's
lang_code. Dialect models break that: lahgtna repurposes the base 30 language tokens and
selects a dialect with a literal token (e.g. language_id='eg' -> '[eg]', which isn't even
a single vocab id, it BPE-splits and the model was fine-tuned on it), and lang-code
normalization mangles such codes (eg -> eg-US).

Add an optional lang_tokens map (BCP47 -> token string) on VoiceConfig + TTSModelInfo,
plumbed through from_dict. ChatterboxMTLTokenizer prefers it and prepends the mapped token
LITERALLY (no vocab check); otherwise it derives [<lang>] and requires it in the vocab as
before. base/turbo ignore it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
10 per-dialect entries (eg/sa/mo/iq/lb/sd/ly/sy/tn/ps) pointing at
OpenVoiceOS/phoonnx-chatterbox-lahgtna with lang_tokens BCP47 maps so
the MTL tokenizer prepends dialect-specific tokens (e.g. [eg]) rather
than the generic [ar].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant