Chatterbox Voice Conversion

A Python library for zero-shot voice conversion built on Resemble AI's Chatterbox S3Gen model. Converts the speaker identity of any speech audio to match a target speaker while preserving the original linguistic content and prosody.

How It Works

Chatterbox VC uses a three-stage pipeline:

Source audio (any speaker)     ─→  S3Tokenizer (16kHz)  ─→  Content tokens (25 tokens/sec)
                                                                          ↓
Target speaker audio (6-10s)   ─→  CAMPPlus encoder     ─→  Speaker embedding (192-dim x-vector)
                                                                          ↓
                                                              S3Gen Flow-Matching Decoder
                                                              (Conditional Flow Matching, 10 steps)
                                                                          ↓
                                                              Mel-spectrogram (80 bins)
                                                                          ↓
                                                              HiFi-GAN Vocoder + F0 Predictor
                                                                          ↓
                                                              24kHz output waveform
                                                                          ↓
                                                              PerTH Watermark (imperceptible)

Stage 1 — Content Extraction: The S3Tokenizer encodes the source speech at 16kHz into discrete content tokens at 25Hz (one token per 40ms). These tokens capture what is said — phonetic content, prosody timing, and rhythm — but discard speaker identity information.

Stage 2 — Speaker Conditioning: The CAMPPlus speaker encoder extracts a 192-dimensional x-vector from the target speaker's reference audio. This embedding captures the target's vocal identity: timbre, pitch range, and speaking style.

Stage 3 — Waveform Synthesis: The S3Gen decoder takes the content tokens and speaker embedding, then uses Conditional Flow Matching (CFM) to iteratively refine noise into a mel-spectrogram over N timesteps (default 10). The mel-spectrogram is converted to a 24kHz waveform by the HiFi-GAN vocoder, which includes an F0 (pitch) predictor for natural intonation. Finally, an imperceptible PerTH watermark is embedded.

Installation

Prerequisites

Python 3.10+
PyTorch 2.0+ with CUDA support (GPU strongly recommended — CPU inference is ~20x slower)
~4 GB GPU memory

Install from source

git clone https://github.com/LAION-AI/chatterbox-voice-conversion.git
cd chatterbox-voice-conversion
pip install -e .

This installs the chatterbox_vc package and its dependencies, including chatterbox-tts which provides the underlying model.

Install dependencies only

If you prefer to install dependencies manually:

pip install chatterbox-tts>=0.1.1 torch torchaudio librosa soundfile numpy

Quick Start

Python API

from chatterbox_vc import VoiceConverter

# Load model (~1.5 GB, auto-downloaded from HuggingFace on first run)
vc = VoiceConverter(device="cuda:0")

# Convert source audio to sound like the target speaker
wav = vc.convert("source_speech.wav", "target_speaker.wav", "output.wav")

print(f"Output: {len(wav)} samples at {vc.sample_rate}Hz")
# Output: 72000 samples at 24000Hz

Command-line

# Single file conversion
python examples/basic_conversion.py \
    --source source_speech.wav \
    --target target_speaker.wav \
    --output converted.wav \
    --device cuda:0

# Batch conversion (extracts target speaker embedding once)
python examples/batch_conversion.py \
    --sources audio1.wav audio2.wav audio3.wav \
    --target target_speaker.wav \
    --output-dir ./converted/ \
    --device cuda:0

API Reference

`VoiceConverter`

VoiceConverter(device="cuda:0", model_path=None)

Parameter	Type	Default	Description
`device`	`str`	`"cuda:0"`	PyTorch device. Use `"cuda:0"`, `"cuda:1"`, etc. for GPU, or `"cpu"` (very slow).
`model_path`	`str` or `None`	`None`	Path to a local checkpoint directory. If `None`, weights are auto-downloaded from HuggingFace (`ResembleAI/chatterbox`).

`convert(source_audio, target_voice, output_path=None)`

Convert a single audio file.

Parameter	Type	Description
`source_audio`	`str`	Path to the source WAV file (speech to convert). Any sample rate.
`target_voice`	`str`	Path to the target speaker's WAV file (6-10s of clean speech recommended). Any sample rate.
`output_path`	`str` or `None`	If provided, saves the output WAV to this path. Directories are created automatically.

Returns: NumPy array of float32 audio at 24kHz, shape (num_samples,).

`convert_batch(source_paths, target_voice, output_dir)`

Convert multiple source files to the same target voice. More efficient than calling convert() in a loop because the target speaker embedding is extracted only once.

Parameter	Type	Description
`source_paths`	`list[str]`	List of source WAV file paths.
`target_voice`	`str`	Path to the target speaker's WAV file.
`output_dir`	`str`	Directory for output files (named `{stem}_converted.wav`).

Returns: List of result dicts with keys: source, output, duration_s, elapsed_s, success, and error (on failure).

`set_target_voice(target_audio_path)`

Pre-compute and cache the target speaker embedding. Useful when converting many files to the same target voice.

`sample_rate`

Property returning the output sample rate (always 24000).

Subprocess Worker

For production deployments where you need GPU isolation or a separate Python environment, use the subprocess worker:

from chatterbox_vc.worker import WorkerClient

# Spawn a worker subprocess on a specific GPU
client = WorkerClient(device="cuda:1")

# Convert audio (synchronous)
result = client.convert("source.wav", "target.wav", "output.wav")
print(f"Done in {result['elapsed']}s at {result['sample_rate']}Hz")

# Clean up
client.shutdown()

Or run the worker directly and communicate via JSON-line protocol:

python -m chatterbox_vc.worker --device cuda:0

The worker reads JSON requests from stdin and writes responses to fd 3 (or stdout if fd 3 is unavailable):

{"source": "/path/to/source.wav", "target": "/path/to/target.wav", "output": "/path/to/output.wav"}

Response:

{"status": "ok", "output": "/path/to/output.wav", "sample_rate": 24000, "elapsed": 2.34}

Tunable Hyperparameters

The Chatterbox VC model has several hyperparameters that affect output quality, speed, and characteristics. Most are architectural constants, but a few can be tuned at inference time.

Inference-Time Parameters

CFM Timesteps (`n_cfm_timesteps`)

Default: 10 · Range: 1–100 · Location: Passed to model.generate() internally

The number of Conditional Flow Matching (CFM) denoising steps. This is the most impactful quality/speed trade-off:

Steps	Quality	Speed	Use Case
1–5	Lower, may have artifacts	Very fast	Real-time/streaming, previews
10	Good (default)	~0.3x real-time on GPU	Production use
20–50	Marginally better	2–5x slower	Maximum quality, offline processing
50+	Diminishing returns	Very slow	Not recommended

To modify: Edit the generate() call in chatterbox/vc.py or patch the model's inference() method.

Classifier-Free Guidance Rate (`inference_cfg_rate`)

Default: 0.7 · Range: 0.0–2.0 · Location: chatterbox/models/s3gen/configs.py

Controls how strongly the model follows the speaker conditioning signal:

Value	Effect
0.0	No guidance — output may not match target speaker well
0.5	Light guidance — more natural but less speaker-similar
0.7	Default — good balance of naturalness and speaker match
1.0+	Strong guidance — closer speaker match but may sound less natural

Formula: output = (1 + cfg_rate) * conditioned - cfg_rate * unconditioned

Temperature

Default: 1.0 · Range: 0.1–2.0 · Location: chatterbox/models/s3gen/flow_matching.py

Scales the initial noise fed to the flow-matching decoder:

Value	Effect
< 1.0	More deterministic, potentially less expressive
1.0	Default stochasticity
> 1.0	More variation, potentially more expressive but less stable

Reference Audio Parameters

Target Speaker Reference Length

Encoding: 6 seconds @ 16kHz (ENC_COND_LEN = 96,000 samples) Decoding: 10 seconds @ 24kHz (DEC_COND_LEN = 240,000 samples)

Location: chatterbox/vc.py

The target reference audio is truncated to these lengths. Longer audio is clipped; shorter audio is used as-is.

Recommendations:

Minimum: 3 seconds (shorter clips produce less stable speaker embeddings)
Optimal: 6–10 seconds of clean, single-speaker speech
Content: Use diverse phonetic content (not just one word repeated)
Quality: Clean recording, minimal background noise, no music

Source Audio

No explicit length limit on source audio, but:

Very short clips (< 1 second) may produce artifacts
Very long clips (> 60 seconds) work but increase processing time linearly
Any sample rate is accepted (automatically resampled to 16kHz internally)

Architecture Constants (Advanced)

These require model retraining to change but are documented for understanding:

Parameter	Value	Description
Output sample rate	24,000 Hz	Fixed by HiFi-GAN vocoder architecture
Mel bins	80	Mel-spectrogram resolution
Mel hop size	480 samples (20ms)	Temporal resolution of mel frames
Mel frequency range	0–8,000 Hz	Captured frequency band
Content token rate	25 tokens/sec	S3Tokenizer output rate
Content vocabulary	6,561 tokens	Discrete codebook size (3^8)
Speaker embedding dim	192	CAMPPlus x-vector output
Conformer blocks	6	Encoder depth
Decoder mid-blocks	12	UNet1D bottleneck depth
Upsample rates	8 × 5 × 3 = 120x	Mel-to-waveform upsampling
Time scheduler	Cosine	CFM step distribution: `t = 1 - cos(t × π/2)`

Tips for Best Results

Target reference quality matters most. Use 6-10 seconds of clean speech with varied phonetic content. Background noise in the reference degrades all conversions.
Source audio should be intelligible. The model preserves content from the source — if the source is unclear, the output will be too.
Same-language works best. Cross-language conversion (e.g., English source → Japanese target speaker) may produce accented output.
Batch conversion is faster. When converting multiple files to the same target voice, use convert_batch() or set_target_voice() to avoid redundant speaker embedding extraction.
GPU memory: The model requires ~4 GB of GPU memory. If you have multiple GPUs, use the device parameter to spread load.

Project Structure

chatterbox-voice-conversion/
├── chatterbox_vc/
│   ├── __init__.py          # Package entry point, exports VoiceConverter
│   ├── convert.py           # Core VoiceConverter class with convert/batch APIs
│   └── worker.py            # Subprocess worker + WorkerClient for GPU isolation
├── examples/
│   ├── basic_conversion.py  # Single-file conversion CLI example
│   └── batch_conversion.py  # Multi-file batch conversion CLI example
├── tests/
│   └── test_conversion.py   # Integration test that verifies end-to-end conversion
├── LICENSE                  # Apache 2.0
├── pyproject.toml           # Package metadata and dependencies
└── README.md                # This file

License

This project is licensed under the Apache License 2.0.

The underlying Chatterbox model is developed by Resemble AI and has its own license terms.

Acknowledgments

Resemble AI for the Chatterbox TTS/VC model
LAION for supporting open-source AI research

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
chatterbox_vc		chatterbox_vc
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Chatterbox Voice Conversion

How It Works

Installation

Prerequisites

Install from source

Install dependencies only

Quick Start

Python API

Command-line

API Reference

VoiceConverter

convert(source_audio, target_voice, output_path=None)

convert_batch(source_paths, target_voice, output_dir)

set_target_voice(target_audio_path)

sample_rate

Subprocess Worker

Tunable Hyperparameters

Inference-Time Parameters

CFM Timesteps (n_cfm_timesteps)

Classifier-Free Guidance Rate (inference_cfg_rate)

Temperature

Reference Audio Parameters

Target Speaker Reference Length

Source Audio

Architecture Constants (Advanced)

Tips for Best Results

Project Structure

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`VoiceConverter`

`convert(source_audio, target_voice, output_path=None)`

`convert_batch(source_paths, target_voice, output_dir)`

`set_target_voice(target_audio_path)`

`sample_rate`

CFM Timesteps (`n_cfm_timesteps`)

Classifier-Free Guidance Rate (`inference_cfg_rate`)

Packages