A Python library for zero-shot voice conversion built on Resemble AI's Chatterbox S3Gen model. Converts the speaker identity of any speech audio to match a target speaker while preserving the original linguistic content and prosody.
Chatterbox VC uses a three-stage pipeline:
Source audio (any speaker) ─→ S3Tokenizer (16kHz) ─→ Content tokens (25 tokens/sec)
↓
Target speaker audio (6-10s) ─→ CAMPPlus encoder ─→ Speaker embedding (192-dim x-vector)
↓
S3Gen Flow-Matching Decoder
(Conditional Flow Matching, 10 steps)
↓
Mel-spectrogram (80 bins)
↓
HiFi-GAN Vocoder + F0 Predictor
↓
24kHz output waveform
↓
PerTH Watermark (imperceptible)
Stage 1 — Content Extraction: The S3Tokenizer encodes the source speech at 16kHz into discrete content tokens at 25Hz (one token per 40ms). These tokens capture what is said — phonetic content, prosody timing, and rhythm — but discard speaker identity information.
Stage 2 — Speaker Conditioning: The CAMPPlus speaker encoder extracts a 192-dimensional x-vector from the target speaker's reference audio. This embedding captures the target's vocal identity: timbre, pitch range, and speaking style.
Stage 3 — Waveform Synthesis: The S3Gen decoder takes the content tokens and speaker embedding, then uses Conditional Flow Matching (CFM) to iteratively refine noise into a mel-spectrogram over N timesteps (default 10). The mel-spectrogram is converted to a 24kHz waveform by the HiFi-GAN vocoder, which includes an F0 (pitch) predictor for natural intonation. Finally, an imperceptible PerTH watermark is embedded.
- Python 3.10+
- PyTorch 2.0+ with CUDA support (GPU strongly recommended — CPU inference is ~20x slower)
- ~4 GB GPU memory
git clone https://github.com/LAION-AI/chatterbox-voice-conversion.git
cd chatterbox-voice-conversion
pip install -e .This installs the chatterbox_vc package and its dependencies, including chatterbox-tts which provides the underlying model.
If you prefer to install dependencies manually:
pip install chatterbox-tts>=0.1.1 torch torchaudio librosa soundfile numpyfrom chatterbox_vc import VoiceConverter
# Load model (~1.5 GB, auto-downloaded from HuggingFace on first run)
vc = VoiceConverter(device="cuda:0")
# Convert source audio to sound like the target speaker
wav = vc.convert("source_speech.wav", "target_speaker.wav", "output.wav")
print(f"Output: {len(wav)} samples at {vc.sample_rate}Hz")
# Output: 72000 samples at 24000Hz# Single file conversion
python examples/basic_conversion.py \
--source source_speech.wav \
--target target_speaker.wav \
--output converted.wav \
--device cuda:0
# Batch conversion (extracts target speaker embedding once)
python examples/batch_conversion.py \
--sources audio1.wav audio2.wav audio3.wav \
--target target_speaker.wav \
--output-dir ./converted/ \
--device cuda:0VoiceConverter(device="cuda:0", model_path=None)| Parameter | Type | Default | Description |
|---|---|---|---|
device |
str |
"cuda:0" |
PyTorch device. Use "cuda:0", "cuda:1", etc. for GPU, or "cpu" (very slow). |
model_path |
str or None |
None |
Path to a local checkpoint directory. If None, weights are auto-downloaded from HuggingFace (ResembleAI/chatterbox). |
Convert a single audio file.
| Parameter | Type | Description |
|---|---|---|
source_audio |
str |
Path to the source WAV file (speech to convert). Any sample rate. |
target_voice |
str |
Path to the target speaker's WAV file (6-10s of clean speech recommended). Any sample rate. |
output_path |
str or None |
If provided, saves the output WAV to this path. Directories are created automatically. |
Returns: NumPy array of float32 audio at 24kHz, shape (num_samples,).
Convert multiple source files to the same target voice. More efficient than calling convert() in a loop because the target speaker embedding is extracted only once.
| Parameter | Type | Description |
|---|---|---|
source_paths |
list[str] |
List of source WAV file paths. |
target_voice |
str |
Path to the target speaker's WAV file. |
output_dir |
str |
Directory for output files (named {stem}_converted.wav). |
Returns: List of result dicts with keys: source, output, duration_s, elapsed_s, success, and error (on failure).
Pre-compute and cache the target speaker embedding. Useful when converting many files to the same target voice.
Property returning the output sample rate (always 24000).
For production deployments where you need GPU isolation or a separate Python environment, use the subprocess worker:
from chatterbox_vc.worker import WorkerClient
# Spawn a worker subprocess on a specific GPU
client = WorkerClient(device="cuda:1")
# Convert audio (synchronous)
result = client.convert("source.wav", "target.wav", "output.wav")
print(f"Done in {result['elapsed']}s at {result['sample_rate']}Hz")
# Clean up
client.shutdown()Or run the worker directly and communicate via JSON-line protocol:
python -m chatterbox_vc.worker --device cuda:0The worker reads JSON requests from stdin and writes responses to fd 3 (or stdout if fd 3 is unavailable):
{"source": "/path/to/source.wav", "target": "/path/to/target.wav", "output": "/path/to/output.wav"}Response:
{"status": "ok", "output": "/path/to/output.wav", "sample_rate": 24000, "elapsed": 2.34}The Chatterbox VC model has several hyperparameters that affect output quality, speed, and characteristics. Most are architectural constants, but a few can be tuned at inference time.
Default: 10 · Range: 1–100 · Location: Passed to model.generate() internally
The number of Conditional Flow Matching (CFM) denoising steps. This is the most impactful quality/speed trade-off:
| Steps | Quality | Speed | Use Case |
|---|---|---|---|
| 1–5 | Lower, may have artifacts | Very fast | Real-time/streaming, previews |
| 10 | Good (default) | ~0.3x real-time on GPU | Production use |
| 20–50 | Marginally better | 2–5x slower | Maximum quality, offline processing |
| 50+ | Diminishing returns | Very slow | Not recommended |
To modify: Edit the generate() call in chatterbox/vc.py or patch the model's inference() method.
Default: 0.7 · Range: 0.0–2.0 · Location: chatterbox/models/s3gen/configs.py
Controls how strongly the model follows the speaker conditioning signal:
| Value | Effect |
|---|---|
| 0.0 | No guidance — output may not match target speaker well |
| 0.5 | Light guidance — more natural but less speaker-similar |
| 0.7 | Default — good balance of naturalness and speaker match |
| 1.0+ | Strong guidance — closer speaker match but may sound less natural |
Formula: output = (1 + cfg_rate) * conditioned - cfg_rate * unconditioned
Default: 1.0 · Range: 0.1–2.0 · Location: chatterbox/models/s3gen/flow_matching.py
Scales the initial noise fed to the flow-matching decoder:
| Value | Effect |
|---|---|
| < 1.0 | More deterministic, potentially less expressive |
| 1.0 | Default stochasticity |
| > 1.0 | More variation, potentially more expressive but less stable |
Encoding: 6 seconds @ 16kHz (ENC_COND_LEN = 96,000 samples)
Decoding: 10 seconds @ 24kHz (DEC_COND_LEN = 240,000 samples)
Location: chatterbox/vc.py
The target reference audio is truncated to these lengths. Longer audio is clipped; shorter audio is used as-is.
Recommendations:
- Minimum: 3 seconds (shorter clips produce less stable speaker embeddings)
- Optimal: 6–10 seconds of clean, single-speaker speech
- Content: Use diverse phonetic content (not just one word repeated)
- Quality: Clean recording, minimal background noise, no music
No explicit length limit on source audio, but:
- Very short clips (< 1 second) may produce artifacts
- Very long clips (> 60 seconds) work but increase processing time linearly
- Any sample rate is accepted (automatically resampled to 16kHz internally)
These require model retraining to change but are documented for understanding:
| Parameter | Value | Description |
|---|---|---|
| Output sample rate | 24,000 Hz | Fixed by HiFi-GAN vocoder architecture |
| Mel bins | 80 | Mel-spectrogram resolution |
| Mel hop size | 480 samples (20ms) | Temporal resolution of mel frames |
| Mel frequency range | 0–8,000 Hz | Captured frequency band |
| Content token rate | 25 tokens/sec | S3Tokenizer output rate |
| Content vocabulary | 6,561 tokens | Discrete codebook size (3^8) |
| Speaker embedding dim | 192 | CAMPPlus x-vector output |
| Conformer blocks | 6 | Encoder depth |
| Decoder mid-blocks | 12 | UNet1D bottleneck depth |
| Upsample rates | 8 × 5 × 3 = 120x | Mel-to-waveform upsampling |
| Time scheduler | Cosine | CFM step distribution: t = 1 - cos(t × π/2) |
-
Target reference quality matters most. Use 6-10 seconds of clean speech with varied phonetic content. Background noise in the reference degrades all conversions.
-
Source audio should be intelligible. The model preserves content from the source — if the source is unclear, the output will be too.
-
Same-language works best. Cross-language conversion (e.g., English source → Japanese target speaker) may produce accented output.
-
Batch conversion is faster. When converting multiple files to the same target voice, use
convert_batch()orset_target_voice()to avoid redundant speaker embedding extraction. -
GPU memory: The model requires ~4 GB of GPU memory. If you have multiple GPUs, use the
deviceparameter to spread load.
chatterbox-voice-conversion/
├── chatterbox_vc/
│ ├── __init__.py # Package entry point, exports VoiceConverter
│ ├── convert.py # Core VoiceConverter class with convert/batch APIs
│ └── worker.py # Subprocess worker + WorkerClient for GPU isolation
├── examples/
│ ├── basic_conversion.py # Single-file conversion CLI example
│ └── batch_conversion.py # Multi-file batch conversion CLI example
├── tests/
│ └── test_conversion.py # Integration test that verifies end-to-end conversion
├── LICENSE # Apache 2.0
├── pyproject.toml # Package metadata and dependencies
└── README.md # This file
This project is licensed under the Apache License 2.0.
The underlying Chatterbox model is developed by Resemble AI and has its own license terms.
- Resemble AI for the Chatterbox TTS/VC model
- LAION for supporting open-source AI research