feat(chatterbox): autoregressive codec-LM engine — d-vector cloning + exaggeration by JarbasAl · Pull Request #181 · TigreGotico/phoonnx

JarbasAl · 2026-06-08T15:45:11Z

Chatterbox (Resemble AI) — phoonnx's first autoregressive engine: a Llama-based codec-LM that generates speech tokens conditioned on text + a reference speaker. It clones from a reference clip with no transcription (d-vector) and adds an exaggeration expressiveness control.

Engine

ChatterboxAdapter — overrides synthesize() to run the 4-ONNX pipeline: speech_encoder (ref→speaker conditioning) → embed_tokens (+exaggeration) → KV-cached Llama AR loop (repetition penalty + greedy + speech-EOS) → conditional_decoder → wav. The KV-cache shape (layers/heads/dim) is read from the LM's own input signature — nothing hardcoded.
d-vector cloning via the existing speaker_reference API (no speaker_reference_text); exaggeration is a new SynthesisConfig field.

The tokenizer generalization (per the design discussion)

The tokenizer stops being one concrete class:

TTSTokenizer (vocab-lookup, phonemes→ids) — the existing impl.
BPETokenizer (subword, HF tokenizers) — new, for text-token models.

Since phoneme front ends normalize (strip punctuation, expand numbers — confirmed: UNICODE/GRAPHEMES both do), the adapter sets tokenizes_raw_text = True and TTSVoice feeds Chatterbox raw text untouched; its BPE does its own normalization. No special-case path — just the tokenizer + a documented flag.

Validation

Against onnx-community/chatterbox-ONNX (the external-data ONNX): clones a reference clip into coherent speech (voicing 1.05, no NaN), punctuation/numbers preserved through the BPE. 7 tests; suite 253.

Docs

docs/chatterbox.md (architecture, pipeline, tokenization, usage) + cloning/engines/index cross-links.

Models

Base ONNX exist (onnx-community/chatterbox-ONNX). The multilingual + turbo ONNX exports (PyTorch-only upstream) are running via VladOS95's converter — they'll be indexed as cloning voices once produced. The engine is contract-compatible with all three.

🤖 Generated with Claude Code

…tion Chatterbox (Resemble AI) — phoonnx's first autoregressive engine. A Llama-based codec LM that generates speech tokens conditioned on text + a reference speaker, driven through the iterative BaseOnnxAdapter.synthesize() hook. - ChatterboxAdapter: the 4-ONNX pipeline (speech_encoder -> embed_tokens -> KV-cached Llama AR loop with repetition penalty + greedy decode -> conditional_decoder), KV shape read from the LM's own signature. d-vector cloning (reference wav, no transcription) via the speaker_reference API; exaggeration via a new SynthesisConfig.exaggeration field. - Tokenizer generalized: TTSTokenizer (vocab-lookup) is now one impl; BPETokenizer (subword, HF tokenizers) is another, for text-token models. TTSVoice gains a tokenizes_raw_text path (Chatterbox's BPE does its own normalization — phoneme front ends would strip punctuation / expand numbers and corrupt the input). Validated end-to-end against onnx-community/chatterbox-ONNX: clones a reference clip into coherent speech (voicing 1.05), punctuation preserved. 7 tests; suite 253. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs/chatterbox.md — the autoregressive codec-LM architecture, the 4-ONNX pipeline, the BPE/raw-text tokenization, d-vector cloning + exaggeration usage, and the upstream/converter links. Added to the d-vector cloning table, the engines adapter table, and the docs index. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-08T15:45:19Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b447eed1-267c-4b28-9d78-6c7287f48172

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/chatterbox

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-08T15:45:56Z

Hello there! I've processed your latest changes. 🌊

I've aggregated the results of the automated checks for this PR below.

🏷️ Release Preview

Ensuring the 'Dependency Updates' are documented. 📦

Current: 1.19.0a1 → Next: 1.20.0a1

Signal	Value
Label	`feature`
PR title	`feat(chatterbox): autoregressive codec-LM engine — d-vector cloning + exaggeration`
Bump	minor

⚠️ No conventional commit prefix — alpha-only bump.
Suggested: fix: update the thing or feat: update the thing

🚀 Release Channel Compatibility

Predicted next version: 1.20.0a1

Channel	Status	Note	Current Constraint
Stable	⚪	Not in channel	-
Testing	⚪	Not in channel	-
Alpha	⚪	Not in channel	-

📋 Repo Health

How's the repo's pulse? Let's take a look. 💓

⚠️ Some required files are missing.

Latest Version: 1.19.0a1

✅ phoonnx/version.py — Version file
✅ README.md — README
❌ LICENSE — License file
✅ pyproject.toml — pyproject.toml
⚠️ setup.py — setup.py
✅ CHANGELOG.md — Changelog
✅ phoonnx/version.py has valid version block markers

📊 Coverage

Measuring the footprint of your testing efforts. 👣

❌ 41.3% total coverage

Files below 80% coverage (45 files)

File	Coverage	Missing lines
`phoonnx/cli.py`	0.0%	98
`phoonnx/thirdparty/kog2p/__init__.py`	0.0%	203
`phoonnx/thirdparty/mantoq/unicode_symbol2label.py`	0.0%	1
`phoonnx/thirdparty/bw2ipa.py`	7.5%	86
`phoonnx/thirdparty/mantoq/pyarabic/number.py`	7.7%	371
`phoonnx/thirdparty/mantoq/buck/phonetise_buckwalter.py`	10.4%	180
`phoonnx/thirdparty/hangul2ipa.py`	16.6%	372
`phoonnx/phonemizers/en.py`	17.5%	104
`phoonnx/thirdparty/mantoq/pyarabic/trans.py`	18.2%	135
`phoonnx/model_manager.py`	19.3%	268
`phoonnx/thirdparty/zh_num.py`	23.1%	83
`phoonnx/thirdparty/tashkeel/__init__.py`	23.9%	89
`phoonnx/voice.py`	24.3%	234
`phoonnx/lang_preprocess.py`	26.6%	69
`phoonnx/phonemizers/zh.py`	27.0%	92
`phoonnx/phonemizers/mul.py`	27.6%	234
`phoonnx/phonemizers/ko.py`	30.4%	32
`phoonnx/phonemizers/gl.py`	31.1%	42
`phoonnx/phonemizers/ar.py`	31.2%	44
`phoonnx/thirdparty/mantoq/buck/tokenization.py`	32.5%	27
`phoonnx/thirdparty/phonikud/__init__.py`	35.3%	11
`phoonnx/phonemizers/ja.py`	36.0%	32
`phoonnx/phonemizers/fa.py`	36.4%	14
`phoonnx/phonemizers/pt.py`	38.1%	13
`phoonnx/thirdparty/mantoq/pyarabic/normalize.py`	38.1%	13
`phoonnx/thirdparty/mantoq/pyarabic/araby.py`	39.7%	298
`phoonnx/engines/speaker_encoders/base.py`	40.0%	12
`phoonnx/phonemizers/he.py`	40.0%	12
`phoonnx/phonemizers/vi.py`	40.0%	12
`phoonnx/phonemizers/base.py`	40.8%	71
`phoonnx/engines/chatterbox.py`	42.4%	80
`phoonnx/thirdparty/mantoq/pyarabic/stack.py`	45.5%	6
`phoonnx/engines/zipvoice.py`	46.9%	43
`phoonnx/thirdparty/mantoq/num2words.py`	47.6%	11
`phoonnx/phonemizers/mwl.py`	50.0%	8
`phoonnx/tokenizer.py`	55.1%	157
`phoonnx/engines/speaker_encoders/styletts2_style.py`	57.1%	6
`phoonnx/thirdparty/mantoq/__init__.py`	60.0%	10
`phoonnx/thirdparty/mantoq/pyarabic/arabrepr.py`	60.0%	6
`phoonnx/engines/vocoders/griffinlim.py`	61.4%	27
`phoonnx/engines/speaker_encoders/coqui_resnet.py`	61.5%	5
`phoonnx/config.py`	65.6%	130
`phoonnx/engines/optispeech.py`	69.6%	24
`phoonnx/engines/speaker_encoders/__init__.py`	73.9%	6
`phoonnx/util.py`	78.9%	59

Full report: download the coverage-report artifact.

🔒 Security (pip-audit)

Shields up! Scanning for potential threats. 🛡️

✅ No known vulnerabilities found (61 packages scanned).

🔍 Lint

Ensuring we're following our development process. 📏

❌ ruff: issues found — see job log

⚖️ License Check

Navigating the maze of open-source compliance. 🧩

❌ License violations detected (43 packages) — review required before merging.

Dependency                          License Name                                            License Type         Misc                                    
phoonnx:1.3.3                       Error                                                   Error                                                        

License Type                        Found                                                  
Error                               1

License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more

Full breakdown — 43 packages

Package	Version	License	URL
`build`	1.5.0	MIT	link
`certifi`	2026.5.20	Mozilla Public License 2.0 (MPL 2.0)	link
`charset-normalizer`	3.4.7	MIT	link
`click`	8.4.1	BSD-3-Clause	link
`combo_lock`	0.3.1	Apache-2.0	link
`dateparser`	1.4.0	BSD License	link
`filelock`	3.29.1	MIT	link
`flatbuffers`	25.12.19	Apache Software License	link
`idna`	3.18	BSD-3-Clause	link
`json-database`	0.10.1	MIT	link
`kthread`	0.2.3	MIT License	link
`langcodes`	3.5.1	MIT License	link
`markdown-it-py`	4.2.0	MIT License	link
`mdurl`	0.1.2	MIT License	link
`memory-tempfile`	2.2.3	MIT License	link
`numpy`	2.4.6	BSD-3-Clause AND 0BSD AND MIT AND Zlib AND CC0-1.0	link
`onnxruntime`	1.26.0	MIT License	link
`ovos-config`	2.1.1	Apache-2.0	link
`ovos-date-parser`	0.7.0a5	Apache Software License	link
`ovos-number-parser`	0.5.1	Apache Software License	link
`ovos-utils`	0.8.5	Apache-2.0	link
`packaging`	26.2	Apache-2.0 OR BSD-2-Clause	link
`pexpect`	4.9.0	ISC License (ISCL)	link
`phoonnx`	1.19.0a1	Apache Software License	link
`protobuf`	7.35.0	3-Clause BSD License	link
`ptyprocess`	0.7.0	ISC License (ISCL)	link
`pyee`	13.0.1	MIT License	link
`Pygments`	2.20.0	BSD-2-Clause	link
`pyproject_hooks`	1.2.0	MIT License	link
`python-dateutil`	2.9.0.post0	Apache Software License; BSD License	link
`pytz`	2026.2	MIT License	link
`PyYAML`	6.0.3	MIT License	link
`quebra-frases`	0.3.7	Apache Software License	link
`regex`	2026.5.9	Apache-2.0 AND CNRI-Python	link
`requests`	2.34.2	Apache Software License	link
`rich`	13.9.4	MIT License	link
`rich-click`	1.9.8	MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed.

🔨 Build Tests

The blueprints match the build! 📐

✅ All versions pass

Python	Build	Install	Tests
3.10	✅	✅	✅
3.11	✅	✅	✅
3.12	✅	✅	✅
3.13	✅	✅	✅
3.14	✅	✅	✅

Keeping the OVOS ecosystem thriving 🌿

Replace the tokenizes_raw_text if/else in TTSVoice with a BaseOnnxAdapter.encode_text the voice calls polymorphically — every engine receives text and returns its own model-input ids. Default = phonemize + vocab-tokenize; ChatterboxAdapter overrides with BPE. Engine-agnostic text preprocessing (pronunciation overrides + diacritics) stays in the voice, before encode_text — it operates on the text regardless of engine. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…one path Drive embed_tokens and the LM from each graph's actual input signature instead of hardcoding the base contract, so the three Chatterbox variants run unchanged: - base / multilingual (Llama, 30 layers): embed_tokens takes input_ids + position_ids + exaggeration; LM has no position_ids. - turbo (GPT-2, 24 layers, meanflow decoder): embed_tokens takes input_ids alone; LM takes a cumulative position_ids. exaggeration is simply not fed when the graph lacks it. Validated end-to-end against onnx-community/chatterbox-ONNX (base, voicing 1.05), onnx-community/chatterbox-multilingual-ONNX (same contract), and ResembleAI/chatterbox-turbo-ONNX (voicing 1.41) — all clone a reference clip into coherent speech. Suite 253. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Greedy argmax made the codec-LM loop without emitting the speech-EOS (32s of babble for a one-line sentence). Replace it with temperature + nucleus (top-p) sampling (defaults 0.8/0.8, overridable) — base/multilingual now stop cleanly (~5s, reproducible). Turbo (GPT-2 + meanflow) loads through the I/O-driven adapter but conditions generation differently from the Llama models and produces unintelligible output; it needs a turbo-specific reference. Marked not-yet-supported in the docs. The earlier 'turbo works' read was a false positive — the voicing metric measures energy, not intelligibility. Base + multilingual validated end-to-end. 8 tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…telligible Per the reference T3 inference, turbo's GPT-2 (absolute PE) resets speech positions to 0; wire that when the LM exposes position_ids. It shifts turbo from 15s to ~10s of output but does not yet make it intelligible — its full ONNX inference (prefill layout, possible CFG, start-speech handling) is not publicly documented and the torch reference doesn't map 1:1 onto the exported graphs. Marked experimental; base + multilingual remain the supported, validated variants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…riants) Match the reference turbo ONNX inference exactly: - repetition penalty over ALL emitted tokens (not just the last) — this was the babble. - cumulative LM position_ids (position_ids[:,-1:]+1), not a reset. - append trailing silence tokens (4299) before the conditional decoder. - and the core fix: each variant has its OWN tokenizer — turbo uses a GPT-2 BPE (vocab 50276), base/multilingual a different one; feeding the base tokenizer to turbo was producing garbage input. The BPETokenizer matches the HF AutoTokenizer per model. Decoding is now temperature + top-p sampling (defaults 0.8 / 0.95), exposed via SynthesisConfig.temperature / top_p; temperature=0 falls back to greedy. Validated end-to-end: base 5.0s, turbo 4.8s, multilingual (Portuguese) 4.2s — all clone a reference clip into coherent, correctly-terminated speech. Suite 254. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… NFKD) The multilingual tokenizer needs a specific front end, missing before (it spoke the wrong language): lowercase + NFKD-normalise, prefix a [<lang>] token derived from the voice's lang_code, and replace spaces with the [SPACE] token. encode_text applies it only when the tokenizer actually has those tokens (multilingual), leaving base/turbo as-is. Validated: Portuguese, Spanish, French all clone the reference voice in the correct language. Five languages (zh/ja/ko/he/ru) need extra per-language preprocessing (Cangjie / hiragana / diacritics / stress) that isn't ported yet — documented. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ngual front end Move the multilingual language preprocessing out of the adapter and into a dedicated ChatterboxMTLTokenizer(BPETokenizer): the [<lang>] prefix, lowercase + NFKD, [SPACE] substitution, and a _script_transform hook for the not-yet-ported zh/ja/ko/he/ru transforms. BPETokenizer.tokenize gains a no-op language parameter so the voice calls tokenize(text, language=...) uniformly; encode_text is back to a one-liner. Validated: the subclass emits the [pt] token and renders correct Portuguese (4.2s). Suite 255. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…reprocess Vendor the multilingual script transforms into phoonnx/lang_preprocess.py with a SCRIPT_TRANSFORMS dispatch, consumed by ChatterboxMTLTokenizer._script_transform: - ko: Hangul -> Jamo (pure Python, always on) - ja: kanji -> hiragana (pykakasi) - zh: Cangjie codes (spacy-pkuseg + HF Cangjie5_TC.json) - ru: stress marks (russian_text_stresser; heavy + not on PyPI, best-effort) Hebrew/Arabic are NOT a tokenizer transform — they are the universal add_diacritics SynthesisConfig flag (applied before encode_text), so those voices just set it. Optional ja/zh deps added as the extra. Validated: ja 私の声->わたしのこえ, ko decomposes. Suite green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add speech_encoder_url / embed_tokens_url / conditional_decoder_url to TTSModelInfo, resolved into engine_params (speech_encoder_path / embed_tokens_path / conditional_decoder_path) — mirroring the vocoder_url / style_url / speaker_encoder_url pattern. Factor the model download into _fetch_onnx, which also pulls the external-data sidecar (<name>.onnx_data, saved under the referenced name so it resolves) — Chatterbox's four graphs all use external weights. Tokenizer construction for engine=chatterbox in VoiceConfig.from_dict is the remaining piece before a chatterbox voice loads end-to-end via the manager. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…MTL) - load_chatterbox_tokenizer factory: ChatterboxMTLTokenizer when the vocab has the [SPACE] token (multilingual), else a plain BPETokenizer (base/turbo); raw tokenizer loaded once and shared. - VoiceConfig.from_dict: an explicit chatterbox branch (keyed on engine, not the is_phoonnx/is_xxx foreign-config detection) that builds the tokenizer from a tokenizer.json, sets sample_rate 24000 and num_symbols from the vocab. - TTSModelInfo: download_bpe_tokenizer (tokenizer_config_url -> tokenizer.json) + a chatterbox branch in the config property that passes its path through. Validated end-to-end: from_dict builds ChatterboxMTLTokenizer for the multilingual checkpoint and BPETokenizer for base; the [pt] language front end fires. Suite 258. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…sonnx - from_dict chatterbox branch defaults add_diacritics for Arabic/Hebrew voices, so the 30-language multilingual index's ar/he entries get vocalization (niqqud/tashkeel) via the universal diacritics flag. - russian_add_stress now calls stressonnx (pure-onnxruntime, no torch at runtime) instead of the heavy russian_text_stresser (spaCy + Wiktionary DB, not on PyPI). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ct models The multilingual tokenizer normally derives its [<lang>] token from the voice's lang_code. Dialect models break that: lahgtna repurposes the base 30 language tokens and selects a dialect with a literal token (e.g. language_id='eg' -> '[eg]', which isn't even a single vocab id, it BPE-splits and the model was fine-tuned on it), and lang-code normalization mangles such codes (eg -> eg-US). Add an optional lang_tokens map (BCP47 -> token string) on VoiceConfig + TTSModelInfo, plumbed through from_dict. ChatterboxMTLTokenizer prefers it and prepends the mapped token LITERALLY (no vocab check); otherwise it derives [<lang>] and requires it in the vocab as before. base/turbo ignore it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

10 per-dialect entries (eg/sa/mo/iq/lb/sd/ly/sy/tn/ps) pointing at OpenVoiceOS/phoonnx-chatterbox-lahgtna with lang_tokens BCP47 maps so the MTL tokenizer prepends dialect-specific tokens (e.g. [eg]) rather than the generic [ar]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

JarbasAl and others added 2 commits June 8, 2026 16:43

github-actions Bot added feature and removed feature labels Jun 8, 2026

JarbasAl and others added 14 commits June 8, 2026 16:52

feat(voice_index): add chatterbox voice index (base/turbo/multilingual)

be7b920

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chatterbox): autoregressive codec-LM engine — d-vector cloning + exaggeration#181

feat(chatterbox): autoregressive codec-LM engine — d-vector cloning + exaggeration#181
JarbasAl wants to merge 16 commits into
devfrom
feat/chatterbox

JarbasAl commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JarbasAl commented Jun 8, 2026

Engine

The tokenizer generalization (per the design discussion)

Validation

Docs

Models

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hello there! I've processed your latest changes. 🌊

🏷️ Release Preview

📋 Repo Health

📊 Coverage

🔒 Security (pip-audit)

🔍 Lint

⚖️ License Check

🔨 Build Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading