This is a fork of the original CSM project that creates a complete Gradio-based web interface, making CSM accessible through an intuitive UI. The fork adds Windows and WSL compatibility and various usability improvements to make speech generation easy and accessible.
ascs.mp4
dvdv.mp4
- Windows Compatibility: Full support for Windows with triton-windows package
- WSL Support: Dedicated setup script for Linux/WSL environments
- Model Format: Updated to use safetensors format (more efficient and secure)
- Model Flexibility: Support for both local models and HuggingFace-hosted models
- Improved Local Model Loading: Fixed issues with Llama-3.2-1B model loading
- Windows-Specific Generator: Separate generator file for Windows environments
- Robust Setup Scripts: Comprehensive setup for both Windows and WSL
- Git Integration: Optional Git-based model downloading for better reliability
The original project had issues loading the models because:
- The models are gated on HuggingFace and require authentication
- The Windows version had compatibility issues
- The tokenizer wasn't configured to use local files
Our solution:
- Use the non-gated drbaph/CSM-1B version
- Create a Windows-specific generator with proper path handling
- Provide multiple fallback methods for model loading
- Create robust setup scripts for different environments
Our setup ensures compatible versions of key packages:
-
Windows:
- triton-windows instead of triton
- numpy==1.26.4
- scipy==1.11.4
- Flask==2.2.5
- librosa==0.10.0
- SoundFile==0.12.1
- PyTorch 2.4.0 with CUDA 12.4 support
-
WSL/Linux:
- Standard triton package
- Compatible numpy and scipy versions
The project uses the CSM-1B .safeteonsors model:
- CSM-1B: Available at drbaph/CSM-1B
Models are stored in specific directories:
- Windows:
models/model.safetensors
- WSL: Same structure, but will default to the original model paths if not found locally
IMPORTANT: Before installation, you need to:
- Create a HuggingFace account at huggingface.co
- Visit meta-llama/Llama-3.2-1B and request access to the model
- Create a HuggingFace access token at huggingface.co/settings/tokens
- During the installation process, you will be prompted to authenticate with your HuggingFace token - paste the token when prompted
Without access to the Llama model and proper authentication, the program will not work as it uses the Llama backbone.
# Clone the repository
git clone https://github.com/Saganaki22/CSM-WebUI.git
cd CSM-WebUI
# Step 1: Run the improved Windows setup script
verbose-win-setup.bat
# Step 2: Fix PyTorch compilation issues
fix-torch-compile.bat
# Step 3: Run the application using the generated script
run_fixed.bat
-
Run verbose-win-setup.bat
- Installs all dependencies with detailed output
- Sets up PyTorch with CUDA 12.4 support
- Creates virtual environment and installs required packages
- Optionally downloads the model file
- IMPORTANT: If the script just git pulls and stops without installing dependencies, close it and run it once more
- IMPORTANT: If you encounter a .venv issue, delete the .venv folder and run verbose-win-setup.bat again - this should fix the issue
-
Run fix-torch-compile.bat
- Patches the Moshi library to fix PyTorch compilation errors
- Creates the run_fixed.bat launcher
- Makes a backup of the original file in case you need to restore it
-
Use run_fixed.bat to launch the application
- This script is automatically created by fix-torch-compile.bat
- Properly activates the virtual environment and launches the application
# Clone the repository
git clone https://github.com/Saganaki22/CSM-WebUI.git
cd CSM-WebUI
# Run the WSL setup script
bash wsl-setup.sh
# After setup completes, run the application
python wsl-gradio.py
-
HuggingFace Authentication Issues
- Make sure you have a HuggingFace account and have requested access to meta-llama/Llama-3.2-1B
- Create an access token at huggingface.co/settings/tokens before running the setup
- When prompted during installation, paste your HuggingFace token for authentication
-
Setup Script Stopping After Git Pull
- If verbose-win-setup.bat only performs a git pull and stops without installing dependencies, simply close the script and run it again
- This is a known issue that can occur on the first run
-
Virtual Environment (.venv) Issues
- If you encounter errors related to the .venv folder, delete the entire .venv folder and run verbose-win-setup.bat again
- This completely recreates the virtual environment and resolves most initialization issues
-
PyTorch Compilation Errors
- If you encounter "dataclass errors" or "must be called with a dataclass type or instance" errors, make sure you've run the fix-torch-compile.bat script
- This error occurs with PyTorch 2.4.0's compilation system on Windows
-
Missing bitsandbytes Error
- If you see "No module named 'bitsandbytes'" error, run verbose-win-setup.bat again which installs this package
- Alternatively, manually install it:
pip install bitsandbytes-windows
-
CUDA Not Available
- If PyTorch doesn't detect your CUDA GPU, verify your NVIDIA drivers are up to date
- Check with
python -c "import torch; print(torch.cuda.is_available())"
If you encounter model loading errors on WSL/Linux:
# Make sure you're using the correct paths
python wsl-gradio.py --model-path models/model.safetensors
This fork is designed to let you use both environments without conflicts:
- Windows will use the triton-windows package and the win-gradio.py file
- WSL will use the standard triton package and the wsl-gradio.py file
CSM-WebUI/
βββ models/ # Directory for model files
β βββ model.safetensors # CSM model file (where setup scripts save model)
βββ sounds/ # Directory for example audio files
β βββ man.mp3 # Male voice example
β βββ woman.mp3 # Female voice example
βββ generator.py # Generator for speech synthesis
βββ watermarking.py # Audio watermarking functionality
βββ wsl-gradio.py # Gradio UI for WSL/Linux
βββ win-gradio.py # Windows-specific Gradio UI
βββ verbose-win-setup.bat # Improved setup script for Windows with verbose output
βββ fix-torch-compile.bat # Script to fix PyTorch compilation issues
βββ wsl-setup.sh # Setup script for WSL/Linux
βββ requirements.txt # Python package requirements
- Model Storage: Original required manual downloads, our version simplifies this
- File Format: Using .safetensors for better security and compatibility
- Windows Support: Added comprehensive Windows support with separate setup script
- Dual Environments: Support for both Windows and WSL without conflicts
- Robust Error Handling: Multiple fallback methods for model loading
- Streamlined UI: Unified interface across platforms
- PyTorch Compatibility: Fixes for Windows-specific PyTorch compilation issues
2025/03/13 - We are releasing the 1B CSM variant. The checkpoint is hosted on Hugging Face.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
A fine-tuned variant of CSM powers the interactive voice demo shown in our blog post.
A hosted Hugging Face space is also available for testing audio generation.
- A CUDA-compatible GPU
- The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions
- Similarly, Python 3.10 is recommended, but newer versions may be fine
- For some audio operations,
ffmpeg
may be required - Access to the following Hugging Face models:
git clone [email protected]:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
The triton
package cannot be installed in Windows. Instead use pip install triton-windows
.
Generate a sentence
from huggingface_hub import hf_hub_download
from generator import load_csm_1b
import torchaudio
import torch
if torch.backends.mps.is_available():
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
model_path = hf_hub_download(repo_id="sesame/csm-1b", filename="ckpt.pt")
generator = load_csm_1b(model_path, device)
audio = generator.generate(
text="Hello from Sesame.",
speaker=0,
context=[],
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
CSM sounds best when provided with context. You can prompt or provide context to the model using a Segment
for each speaker's utterance.
speakers = [0, 1, 0, 0]
transcripts = [
"Hey how are you doing.",
"Pretty good, pretty good.",
"I'm great.",
"So happy to be speaking to you.",
]
audio_paths = [
"utterance_0.wav",
"utterance_1.wav",
"utterance_2.wav",
"utterance_3.wav",
]
def load_audio(audio_path):
audio_tensor, sample_rate = torchaudio.load(audio_path)
audio_tensor = torchaudio.functional.resample(
audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
)
return audio_tensor
segments = [
Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
text="Me too, this is some cool stuff huh?",
speaker=1,
context=segments,
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
Does this model come with any voices?
The model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.
Can I converse with the model?
CSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
Does it support other languages?
The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.
This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:
- Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
- Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
- Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.
By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.
Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.