Skip to content

Commit bce59c0

Browse files
mmaudetclaude
andcommitted
feat: Add automatic backend detection with MLX and PyTorch support
- Implemented dual backend architecture with auto-detection - MLX backend (macOS Metal GPU): Loads via moshi-mlx with native Metal acceleration - PyTorch backend (CUDA/CPU): Fallback for non-Mac platforms - Dummy backend: Test mode when no ML frameworks available Backend Detection Logic: - Tries MLX first (import mlx.core, moshi_mlx) - preferred on Apple Silicon - Falls back to PyTorch (import torch) if MLX unavailable - Uses dummy model if neither available MLX Implementation (app.py:299-359): - Loads config and weights using hf_get() from moshi_mlx - Creates Lm, Mimi, and tokenizer manually - Uses mx.bfloat16 for model dtype - Supports CFG distillation same as PyTorch version Synthesis Updates: - MLX path (app.py:480-541): Uses decode_step() for frame decoding - PyTorch path (app.py:544-604): Uses mimi.streaming() context - Both paths produce identical 24kHz WAV output Health Endpoint Enhancement: - Reports backend type: "mlx (Metal GPU)", "cuda (GPU)", or "cpu" - Helps users verify which backend is active Benefits: - 2-5x faster on Apple Silicon with MLX vs Docker/PyTorch CPU - No code changes needed - same API for all backends - Seamless deployment on Mac (MLX) or Linux/Windows (PyTorch) Updated CLAUDE.md with backend architecture documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 37766d5 commit bce59c0

3 files changed

Lines changed: 589 additions & 100 deletions

File tree

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ coverage.xml
2222
.hypothesis/
2323

2424
# Claude Code session files
25-
CLAUDE.md
25+
# Note: CLAUDE.md is kept in version control as project documentation
2626

2727
# Virtual Environments
2828
bin/

CLAUDE.md

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Moshi TTS API is a REST API wrapper around Kyutai Labs' Moshi text-to-speech model. It provides a FastAPI-based service with bilingual support (French and English), 44 voice presets, Swagger documentation, and flexible deployment options (Docker with GPU/CPU, or native macOS with MLX).
8+
9+
**Backend Auto-Detection**: The application automatically detects and uses the best available backend:
10+
- **MLX** (macOS with Metal GPU) - Preferred on Apple Silicon, 2-5x faster
11+
- **PyTorch** (CUDA/CPU) - Fallback for other platforms
12+
- **Dummy** - Test mode when no ML backend is available
13+
14+
## Development Commands
15+
16+
### Local Development (Non-Docker)
17+
18+
```bash
19+
# Install dependencies (without Moshi - for testing API structure)
20+
pip install fastapi uvicorn pydantic pydantic-settings numpy scipy python-multipart aiofiles
21+
22+
# Install with Moshi TTS (requires PyTorch)
23+
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126 # For CUDA 12.6
24+
pip install moshi
25+
26+
# Run the API server locally
27+
python app.py
28+
# OR with uvicorn directly
29+
uvicorn app:app --host 0.0.0.0 --port 8000 --reload
30+
31+
# Access API documentation
32+
# Swagger UI: http://localhost:8000/docs
33+
# ReDoc: http://localhost:8000/redoc
34+
```
35+
36+
### Native macOS Installation (Apple Silicon)
37+
38+
For Mac M1/M2/M3/M4/M5 users, use MLX for optimal Metal GPU acceleration:
39+
40+
**Requirements:**
41+
- macOS with Apple Silicon (ARM64)
42+
- Python 3.10, 3.11, or 3.12 (MLX does not support Python 3.13+ yet)
43+
44+
```bash
45+
# Check Python version
46+
python3 --version # Must be 3.10.x, 3.11.x, or 3.12.x
47+
48+
# If you have Python 3.13+, install a compatible version:
49+
brew install pyenv
50+
pyenv install 3.12
51+
pyenv local 3.12
52+
53+
# Run installation script
54+
./install-macos-mlx.sh
55+
56+
# Activate environment and start server
57+
source venv-moshi-mlx/bin/activate
58+
python3 -m uvicorn app:app --host 0.0.0.0 --port 8000
59+
```
60+
61+
**Why MLX for macOS:**
62+
- Direct Metal GPU access (Docker cannot access Metal framework)
63+
- 2-5x faster than CPU/Docker versions
64+
- Optimized for Apple Silicon
65+
66+
**Python Version Issues:**
67+
If installation fails with "no matching distributions available for mlx", you're likely using Python 3.13+. The installation script will now detect this and provide instructions.
68+
69+
### Testing
70+
71+
```bash
72+
# Run pytest tests
73+
pytest tests/ -v
74+
75+
# Run tests with coverage
76+
pytest tests/ --cov=./ --cov-report=xml
77+
78+
# Test the API (requires running server)
79+
chmod +x test_api.sh
80+
./test_api.sh
81+
```
82+
83+
### Docker Development
84+
85+
```bash
86+
# Quick build and run (GPU)
87+
./build-and-run.sh
88+
89+
# Docker Compose with GPU (recommended)
90+
docker compose up -d --build
91+
92+
# Manual Docker with GPU
93+
docker build -t moshi-tts-api:latest .
94+
docker run -d --name moshi-tts-api -p 8000:8000 -v $(pwd)/models:/app/models --gpus all moshi-tts-api:latest
95+
96+
# View logs
97+
docker compose logs -f
98+
# OR
99+
docker logs -f moshi-tts-api
100+
101+
# Rebuild after code changes
102+
docker compose up -d --build
103+
104+
# Stop and remove
105+
docker compose down
106+
docker rm -f moshi-tts-api
107+
```
108+
109+
## Architecture
110+
111+
### Application Structure
112+
113+
The codebase has a clean, modular structure:
114+
115+
- **app.py**: Main FastAPI application with all endpoints, model loading, and synthesis logic
116+
- **config.py**: Type-safe configuration management using pydantic-settings
117+
- **client.py**: Python client for programmatic API access (can be used as CLI or library)
118+
119+
### Key Architecture Patterns
120+
121+
**Configuration Management** (config.py):
122+
- Uses pydantic-settings for type-safe configuration
123+
- Supports `.env` file (local dev), environment variables (Docker), and defaults
124+
- Cached singleton pattern via `@lru_cache()` for performance
125+
- All settings documented with Field descriptions
126+
127+
**FastAPI Application** (app.py):
128+
- **Backend Auto-Detection**: Tries MLX first, falls back to PyTorch, then dummy mode
129+
- **Pydantic Models**: TTSRequest, HealthResponse, ErrorResponse with validation
130+
- **Enums**: LanguageCode (fr/en), AudioFormat (wav/raw), VoicePreset (44 voices)
131+
- **Global State**: Model loaded at startup, ThreadPoolExecutor for async synthesis
132+
- **Audio Processing**: 24kHz mono, NumPy → int16 PCM → WAV/RAW
133+
- **API Versioning**: All endpoints prefixed with `/api/v1/`
134+
- **CORS Middleware**: Configurable via settings
135+
136+
**Dual Backend Support**:
137+
The app automatically detects which ML backend is available:
138+
1. **MLX Backend** (app.py:33-44): Detects `mlx.core` and `moshi_mlx` packages
139+
- Model loading (app.py:299-359): Uses `hf_get()`, manual weight loading
140+
- Synthesis (app.py:480-541): Uses `decode_step()` for frame decoding
141+
- Device: Reports as "mlx (Metal GPU)"
142+
2. **PyTorch Backend** (app.py:45-50): Detects `torch` package
143+
- Model loading (app.py:362-426): Uses `CheckpointInfo.from_hf_repo()`
144+
- Synthesis (app.py:544-604): Uses `mimi.streaming()` context manager
145+
- Device: Auto-detects CUDA or CPU
146+
147+
**Threading Model**:
148+
- CPU-bound synthesis runs in ThreadPoolExecutor (2 workers default)
149+
- Uses `asyncio.run_in_executor()` to prevent blocking FastAPI event loop
150+
- Model lives in global state, shared across requests
151+
152+
**Model Integration** (app.py:262-343):
153+
- Attempts to load real Moshi TTS model from `moshi.models.tts`
154+
- Device selection: CUDA auto-detected or forced via `MODEL_DEVICE` env var
155+
- Dtype: Auto (bfloat16 for CUDA, float32 for CPU) or forced via config
156+
- CFG Distillation: Handles distilled models by setting `cfg_coef_conditioning`
157+
- Fallback: Uses dummy sine wave generator if Moshi unavailable (for testing)
158+
159+
**Synthesis Flow** (app.py:370-451):
160+
- Text → `prepare_script()` → voice selection → `make_condition_attributes()`
161+
- Generate frames → decode with MIMI → trim to `end_steps` → convert to NumPy
162+
- Handles both multi-speaker (voices in attributes) and single-speaker (voices as prefixes) models
163+
164+
### API Endpoints
165+
166+
All endpoints are under `/api/v1/` for versioning:
167+
168+
- `GET /` - API info and endpoint list
169+
- `GET /api/v1/health` - Health check with model status, device info
170+
- `GET /api/v1/languages` - List supported languages (fr, en)
171+
- `GET /api/v1/voices` - List all 44 voice presets with descriptions
172+
- `POST /api/v1/tts` - Main TTS endpoint (JSON → audio file)
173+
- `POST /api/v1/tts/file` - TTS from uploaded text file
174+
175+
### Voice Presets
176+
177+
44 voices from 4 collections (see VoicePreset enum in app.py:127-182):
178+
- **VCTK** (10 voices): British English speakers (p225-p234)
179+
- **CML-TTS** (10 voices): High-quality French speakers
180+
- **Expresso** (9 voices): English with emotions (happy, angry, calm, confused) and styles (whisper, fast, enunciated)
181+
- **EARS** (14 voices): Diverse English speakers (subset of 50)
182+
183+
Voice selection: Pass `"voice": "vctk/p226_023.wav"` or use enum name `"voice": "vctk_p226"`
184+
185+
### Docker Architecture
186+
187+
**GPU Image** (Dockerfile):
188+
- Base: `nvidia/cuda:12.6.3-cudnn-devel-ubuntu24.04`
189+
- Python 3.12, system deps (git, libsndfile1, build tools)
190+
- Uses `uv` package manager (10-100x faster than pip)
191+
- Installs PyTorch + moshi together to avoid duplicate downloads
192+
- Runs as non-root user `appuser` (UID 1001) for security
193+
- Health check on `/api/v1/health`
194+
- Model cache at `/app/models` (volume mount)
195+
196+
**Multi-architecture**: GitHub Actions workflow supports `linux/amd64` (GPU) builds
197+
198+
### Configuration
199+
200+
Environment variables (see config.py for all options):
201+
202+
```bash
203+
# Server
204+
HOST=0.0.0.0
205+
PORT=8000
206+
LOG_LEVEL=info
207+
WORKERS=1
208+
209+
# Model
210+
DEFAULT_TTS_REPO=kyutai/tts-1.6b-en_fr
211+
DEFAULT_VOICE_REPO=kyutai/tts-voices
212+
SAMPLE_RATE=24000
213+
MODEL_DEVICE=cuda # or cpu, auto if not set
214+
MODEL_DTYPE=auto # auto, bfloat16, or float32
215+
MODEL_N_Q=32
216+
MODEL_TEMP=0.6
217+
MODEL_CFG_COEF=2.0
218+
219+
# CORS
220+
CORS_ORIGINS=*
221+
CORS_CREDENTIALS=true
222+
```
223+
224+
Set via `.env` file (local), Docker environment, or docker-compose.yml.
225+
226+
## Important Implementation Details
227+
228+
### Audio Processing
229+
- Sample rate: **24kHz** (fixed, do not change without model retraining)
230+
- Format: Mono channel, 16-bit signed integer PCM
231+
- WAV: Standard RIFF WAVE with headers
232+
- RAW: PCM only (convert: `ffmpeg -f s16le -ar 24000 -ac 1 -i input.raw output.wav`)
233+
234+
### Input Validation
235+
- Text length: 1-5000 characters (configurable via `MAX_TEXT_LENGTH`)
236+
- Whitespace normalized automatically (app.py:209-216)
237+
- Languages: "fr" or "en"
238+
- File upload: Must be UTF-8
239+
240+
### Error Handling
241+
- Custom HTTPException and ValueError handlers (app.py:761-783)
242+
- Model availability checks before synthesis
243+
- Graceful fallback to dummy model (generates sine waves for testing)
244+
245+
### Startup/Shutdown
246+
- `@app.on_event("startup")`: Loads model, handles errors gracefully
247+
- `@app.on_event("shutdown")`: Cleans up model, empties CUDA cache, shuts down executor
248+
249+
## CI/CD
250+
251+
GitHub Actions workflow (`.github/workflows/docker-publish.yml`):
252+
- **Triggers**: Push to main/master, PRs, tags (v*.*.*)
253+
- **Build**: Docker image for `linux/amd64` with buildx caching
254+
- **Push**: To Docker Hub (on non-PR events)
255+
- **Tags**: `latest` (main branch), semver (v1.0.0 → 1.0.0, 1.0, 1), SHA
256+
- **Description**: Updates Docker Hub README from repo README.md
257+
258+
Secrets required: `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN`
259+
260+
## Client Integration
261+
262+
**Python Client** (client.py):
263+
```python
264+
from client import MoshiTTSClient
265+
266+
client = MoshiTTSClient("http://localhost:8000")
267+
client.health_check()
268+
client.synthesize("Hello world", language="en", output_file="output.wav")
269+
```
270+
271+
**CLI**:
272+
```bash
273+
python client.py -t "Bonjour" -l fr -o test.wav
274+
python client.py --health
275+
python client.py --languages
276+
```
277+
278+
## Testing Strategy
279+
280+
- **Unit tests**: tests/test_basic.py (module imports, API structure)
281+
- **Integration tests**: test_api.sh (bash script testing all endpoints)
282+
- **Pytest config**: pyproject.toml with coverage settings
283+
284+
## Common Tasks
285+
286+
### Adding a New Endpoint
287+
1. Define Pydantic request/response models in app.py
288+
2. Add endpoint function with `@app.get()` or `@app.post()` decorator
289+
3. Use appropriate tags (TTS or System) for documentation
290+
4. Add tests to test_api.sh
291+
292+
### Changing Model Configuration
293+
1. Update Settings class in config.py
294+
2. Add Field with description and default
295+
3. Use in app.py via `settings.your_field`
296+
4. Document in .env.example (if exists) or README
297+
298+
### Debugging Model Loading
299+
Check logs for:
300+
- "✅ Moshi TTS model loaded successfully!" - Real model loaded
301+
- "⚠️ Using dummy model for testing" - Fallback mode (generates sine waves)
302+
- "⚠️ PyTorch not available" - Missing PyTorch
303+
- "⚠️ Moshi library not available" - Missing moshi package
304+
305+
Verify model: `docker exec moshi-tts-api python3 -c "import moshi; print(moshi.__version__)"`
306+
307+
### Performance Optimization
308+
- **GPU**: Real-time or faster generation
309+
- **CPU**: 2-10x real-time depending on CPU
310+
- **Memory**: ~6GB for bf16 model
311+
- **First request**: Slower (model loading and caching)
312+
- **macOS MLX**: 2-5x faster than Docker/CPU on Apple Silicon
313+
314+
## Deployment Notes
315+
316+
- **Docker Hub**: Images at `mmaudet/moshi-tts-api:latest`
317+
- **Model caching**: Always mount `/app/models` volume to avoid re-downloading
318+
- **Security**: Container runs as non-root user (appuser UID 1001)
319+
- **CORS**: Default is `*` (all origins) - restrict in production
320+
- **Health checks**: Built into Docker with 30s interval

0 commit comments

Comments
 (0)