Skip to content

wildminder/ComfyUI-Chatterbox

Repository files navigation


Chatterbox Nodes in ComfyUI

ComfyUI Chatterbox

High-quality Text-to-Speech (TTS) and Voice Conversion (VC) nodes for ComfyUI, powered by Resemble AI's Chatterbox model.

Report Bug · Request Feature

Stargazers Issues Contributors Forks

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. Acknowledgments

About The Project

ComfyUI custom nodes for the powerful Resemble AI Chatterbox library. It enables seamless in-workflow Text-to-Speech and Voice Conversion, complete with deep integration into ComfyUI's model management system for efficient VRAM usage.

(back to top)

Major Update Notice

Note

  • 1.2.0: This version has been deeply refactored for better performance, stability, and alignment with the ComfyUI codebase. All parameters have been unlocked.

(back to top)

Features

  • Long generation: No longer limited to 40 seconds.
  • Chatterbox TTS Node: Synthesize speech from text with optional voice cloning from an audio prompt.
  • Chatterbox Voice Conversion Node: Convert the voice in a source audio file to a target voice.
  • Automatic Model Downloading: Models are automatically downloaded from Hugging Face on first use.
  • Efficient VRAM Management: Full integration with ComfyUI's model patcher system to load models to GPU only when needed and offload them afterward.
  • Detailed Generation Control: Fine-tune your audio output with parameters for speed, expressiveness, creativity, and quality.
  • Accurate Progress Bars: Both console and UI progress bars reflect the true step-by-step generation process.

(back to top)

Getting Started

Installation

  1. Install via ComfyUI Manager (Recommended):

    • Search for ComfyUI-Chatterbox in the ComfyUI Manager and install it.
  2. Manual Installation:

    • Clone this repository into your ComfyUI/custom_nodes/ directory:
      git clone https://github.com/wildminder/ComfyUI-Chatterbox.git ComfyUI/custom_nodes/ComfyUI-Chatterbox
  3. Install Dependencies:

    • Navigate to the new directory and install the required packages:
      cd ComfyUI/custom_nodes/ComfyUI-Chatterbox
      pip install -r requirements.txt
  4. Model Management:

Important

For users of previous versions: This update changes the model directory. You must manually delete your old model folder to avoid conflicts:

Delete this folder: ComfyUI/models/chatterbox_tts/

The new version will automatically download models to the correct ComfyUI-standard directory: ComfyUI/models/tts/chatterbox/.

  1. Restart ComfyUI.

(back to top)

Usage

After installation, you will find two new nodes:

  • Chatterbox TTS 📢 under the audio/generation category.
  • Chatterbox Voice Conversion 🗣️ under the audio/generation category.

Load an example workflow from the workflow-examples/ directory in this repository to get started.

Node Parameters Explained

Chatterbox TTS 📢 Parameters

  • max_new_tokens: Maximum number of audio tokens to generate. Acts as a failsafe against run-on generations. 25 tokens is approximately 1 second of audio. The model's hard limit is 4096 tokens (≈ 163 seconds).
  • flow_cfg_scale: CFG scale for the mel spectrogram decoder. Higher values increase adherence to the text content and speaker timbre but may reduce naturalness.
  • exaggeration: Controls the expressiveness and emotional intensity. Higher values lead to more exaggerated prosody.
  • temperature: Controls the randomness of the token sampling process. Higher values produce more diverse and creative speech, while lower values are more deterministic.
  • cfg_weight: Classifier-Free Guidance (CFG) weight for the token sampling process.
  • repetition_penalty: Penalizes repeated tokens to discourage monotonous or repetitive speech. 1.0 means no penalty.
  • min_p / top_p: Parameters for nucleus sampling, controlling the pool of tokens the model can choose from at each step.

Chatterbox Voice Conversion 🗣️ Parameters

  • n_timesteps: Number of diffusion steps for the flow matching process. Higher values can improve quality but will take longer to generate.
  • temperature: Controls the randomness of the initial noise for the diffusion process. 1.0 is standard. Lower values are more deterministic; higher values are more random.
  • flow_cfg_scale: CFG scale for the mel spectrogram decoder. Higher values increase adherence to the target voice's timbre but may reduce the naturalness of the speech prosody.
  • target_voice_audio: The audio file containing the target voice timbre. If not provided, the default voice from the selected model pack will be used.

(back to top)

Acknowledgments

(back to top)

About

ComfyUI Chatterbox TTS & Voice Conversion Node

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages