|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +Moshi TTS API is a REST API wrapper around Kyutai Labs' Moshi text-to-speech model. It provides a FastAPI-based service with bilingual support (French and English), 44 voice presets, Swagger documentation, and flexible deployment options (Docker with GPU/CPU, or native macOS with MLX). |
| 8 | + |
| 9 | +**Backend Auto-Detection**: The application automatically detects and uses the best available backend: |
| 10 | +- **MLX** (macOS with Metal GPU) - Preferred on Apple Silicon, 2-5x faster |
| 11 | +- **PyTorch** (CUDA/CPU) - Fallback for other platforms |
| 12 | +- **Dummy** - Test mode when no ML backend is available |
| 13 | + |
| 14 | +## Development Commands |
| 15 | + |
| 16 | +### Local Development (Non-Docker) |
| 17 | + |
| 18 | +```bash |
| 19 | +# Install dependencies (without Moshi - for testing API structure) |
| 20 | +pip install fastapi uvicorn pydantic pydantic-settings numpy scipy python-multipart aiofiles |
| 21 | + |
| 22 | +# Install with Moshi TTS (requires PyTorch) |
| 23 | +pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126 # For CUDA 12.6 |
| 24 | +pip install moshi |
| 25 | + |
| 26 | +# Run the API server locally |
| 27 | +python app.py |
| 28 | +# OR with uvicorn directly |
| 29 | +uvicorn app:app --host 0.0.0.0 --port 8000 --reload |
| 30 | + |
| 31 | +# Access API documentation |
| 32 | +# Swagger UI: http://localhost:8000/docs |
| 33 | +# ReDoc: http://localhost:8000/redoc |
| 34 | +``` |
| 35 | + |
| 36 | +### Native macOS Installation (Apple Silicon) |
| 37 | + |
| 38 | +For Mac M1/M2/M3/M4/M5 users, use MLX for optimal Metal GPU acceleration: |
| 39 | + |
| 40 | +**Requirements:** |
| 41 | +- macOS with Apple Silicon (ARM64) |
| 42 | +- Python 3.10, 3.11, or 3.12 (MLX does not support Python 3.13+ yet) |
| 43 | + |
| 44 | +```bash |
| 45 | +# Check Python version |
| 46 | +python3 --version # Must be 3.10.x, 3.11.x, or 3.12.x |
| 47 | + |
| 48 | +# If you have Python 3.13+, install a compatible version: |
| 49 | +brew install pyenv |
| 50 | +pyenv install 3.12 |
| 51 | +pyenv local 3.12 |
| 52 | + |
| 53 | +# Run installation script |
| 54 | +./install-macos-mlx.sh |
| 55 | + |
| 56 | +# Activate environment and start server |
| 57 | +source venv-moshi-mlx/bin/activate |
| 58 | +python3 -m uvicorn app:app --host 0.0.0.0 --port 8000 |
| 59 | +``` |
| 60 | + |
| 61 | +**Why MLX for macOS:** |
| 62 | +- Direct Metal GPU access (Docker cannot access Metal framework) |
| 63 | +- 2-5x faster than CPU/Docker versions |
| 64 | +- Optimized for Apple Silicon |
| 65 | + |
| 66 | +**Python Version Issues:** |
| 67 | +If installation fails with "no matching distributions available for mlx", you're likely using Python 3.13+. The installation script will now detect this and provide instructions. |
| 68 | + |
| 69 | +### Testing |
| 70 | + |
| 71 | +```bash |
| 72 | +# Run pytest tests |
| 73 | +pytest tests/ -v |
| 74 | + |
| 75 | +# Run tests with coverage |
| 76 | +pytest tests/ --cov=./ --cov-report=xml |
| 77 | + |
| 78 | +# Test the API (requires running server) |
| 79 | +chmod +x test_api.sh |
| 80 | +./test_api.sh |
| 81 | +``` |
| 82 | + |
| 83 | +### Docker Development |
| 84 | + |
| 85 | +```bash |
| 86 | +# Quick build and run (GPU) |
| 87 | +./build-and-run.sh |
| 88 | + |
| 89 | +# Docker Compose with GPU (recommended) |
| 90 | +docker compose up -d --build |
| 91 | + |
| 92 | +# Manual Docker with GPU |
| 93 | +docker build -t moshi-tts-api:latest . |
| 94 | +docker run -d --name moshi-tts-api -p 8000:8000 -v $(pwd)/models:/app/models --gpus all moshi-tts-api:latest |
| 95 | + |
| 96 | +# View logs |
| 97 | +docker compose logs -f |
| 98 | +# OR |
| 99 | +docker logs -f moshi-tts-api |
| 100 | + |
| 101 | +# Rebuild after code changes |
| 102 | +docker compose up -d --build |
| 103 | + |
| 104 | +# Stop and remove |
| 105 | +docker compose down |
| 106 | +docker rm -f moshi-tts-api |
| 107 | +``` |
| 108 | + |
| 109 | +## Architecture |
| 110 | + |
| 111 | +### Application Structure |
| 112 | + |
| 113 | +The codebase has a clean, modular structure: |
| 114 | + |
| 115 | +- **app.py**: Main FastAPI application with all endpoints, model loading, and synthesis logic |
| 116 | +- **config.py**: Type-safe configuration management using pydantic-settings |
| 117 | +- **client.py**: Python client for programmatic API access (can be used as CLI or library) |
| 118 | + |
| 119 | +### Key Architecture Patterns |
| 120 | + |
| 121 | +**Configuration Management** (config.py): |
| 122 | +- Uses pydantic-settings for type-safe configuration |
| 123 | +- Supports `.env` file (local dev), environment variables (Docker), and defaults |
| 124 | +- Cached singleton pattern via `@lru_cache()` for performance |
| 125 | +- All settings documented with Field descriptions |
| 126 | + |
| 127 | +**FastAPI Application** (app.py): |
| 128 | +- **Backend Auto-Detection**: Tries MLX first, falls back to PyTorch, then dummy mode |
| 129 | +- **Pydantic Models**: TTSRequest, HealthResponse, ErrorResponse with validation |
| 130 | +- **Enums**: LanguageCode (fr/en), AudioFormat (wav/raw), VoicePreset (44 voices) |
| 131 | +- **Global State**: Model loaded at startup, ThreadPoolExecutor for async synthesis |
| 132 | +- **Audio Processing**: 24kHz mono, NumPy → int16 PCM → WAV/RAW |
| 133 | +- **API Versioning**: All endpoints prefixed with `/api/v1/` |
| 134 | +- **CORS Middleware**: Configurable via settings |
| 135 | + |
| 136 | +**Dual Backend Support**: |
| 137 | +The app automatically detects which ML backend is available: |
| 138 | +1. **MLX Backend** (app.py:33-44): Detects `mlx.core` and `moshi_mlx` packages |
| 139 | + - Model loading (app.py:299-359): Uses `hf_get()`, manual weight loading |
| 140 | + - Synthesis (app.py:480-541): Uses `decode_step()` for frame decoding |
| 141 | + - Device: Reports as "mlx (Metal GPU)" |
| 142 | +2. **PyTorch Backend** (app.py:45-50): Detects `torch` package |
| 143 | + - Model loading (app.py:362-426): Uses `CheckpointInfo.from_hf_repo()` |
| 144 | + - Synthesis (app.py:544-604): Uses `mimi.streaming()` context manager |
| 145 | + - Device: Auto-detects CUDA or CPU |
| 146 | + |
| 147 | +**Threading Model**: |
| 148 | +- CPU-bound synthesis runs in ThreadPoolExecutor (2 workers default) |
| 149 | +- Uses `asyncio.run_in_executor()` to prevent blocking FastAPI event loop |
| 150 | +- Model lives in global state, shared across requests |
| 151 | + |
| 152 | +**Model Integration** (app.py:262-343): |
| 153 | +- Attempts to load real Moshi TTS model from `moshi.models.tts` |
| 154 | +- Device selection: CUDA auto-detected or forced via `MODEL_DEVICE` env var |
| 155 | +- Dtype: Auto (bfloat16 for CUDA, float32 for CPU) or forced via config |
| 156 | +- CFG Distillation: Handles distilled models by setting `cfg_coef_conditioning` |
| 157 | +- Fallback: Uses dummy sine wave generator if Moshi unavailable (for testing) |
| 158 | + |
| 159 | +**Synthesis Flow** (app.py:370-451): |
| 160 | +- Text → `prepare_script()` → voice selection → `make_condition_attributes()` |
| 161 | +- Generate frames → decode with MIMI → trim to `end_steps` → convert to NumPy |
| 162 | +- Handles both multi-speaker (voices in attributes) and single-speaker (voices as prefixes) models |
| 163 | + |
| 164 | +### API Endpoints |
| 165 | + |
| 166 | +All endpoints are under `/api/v1/` for versioning: |
| 167 | + |
| 168 | +- `GET /` - API info and endpoint list |
| 169 | +- `GET /api/v1/health` - Health check with model status, device info |
| 170 | +- `GET /api/v1/languages` - List supported languages (fr, en) |
| 171 | +- `GET /api/v1/voices` - List all 44 voice presets with descriptions |
| 172 | +- `POST /api/v1/tts` - Main TTS endpoint (JSON → audio file) |
| 173 | +- `POST /api/v1/tts/file` - TTS from uploaded text file |
| 174 | + |
| 175 | +### Voice Presets |
| 176 | + |
| 177 | +44 voices from 4 collections (see VoicePreset enum in app.py:127-182): |
| 178 | +- **VCTK** (10 voices): British English speakers (p225-p234) |
| 179 | +- **CML-TTS** (10 voices): High-quality French speakers |
| 180 | +- **Expresso** (9 voices): English with emotions (happy, angry, calm, confused) and styles (whisper, fast, enunciated) |
| 181 | +- **EARS** (14 voices): Diverse English speakers (subset of 50) |
| 182 | + |
| 183 | +Voice selection: Pass `"voice": "vctk/p226_023.wav"` or use enum name `"voice": "vctk_p226"` |
| 184 | + |
| 185 | +### Docker Architecture |
| 186 | + |
| 187 | +**GPU Image** (Dockerfile): |
| 188 | +- Base: `nvidia/cuda:12.6.3-cudnn-devel-ubuntu24.04` |
| 189 | +- Python 3.12, system deps (git, libsndfile1, build tools) |
| 190 | +- Uses `uv` package manager (10-100x faster than pip) |
| 191 | +- Installs PyTorch + moshi together to avoid duplicate downloads |
| 192 | +- Runs as non-root user `appuser` (UID 1001) for security |
| 193 | +- Health check on `/api/v1/health` |
| 194 | +- Model cache at `/app/models` (volume mount) |
| 195 | + |
| 196 | +**Multi-architecture**: GitHub Actions workflow supports `linux/amd64` (GPU) builds |
| 197 | + |
| 198 | +### Configuration |
| 199 | + |
| 200 | +Environment variables (see config.py for all options): |
| 201 | + |
| 202 | +```bash |
| 203 | +# Server |
| 204 | +HOST=0.0.0.0 |
| 205 | +PORT=8000 |
| 206 | +LOG_LEVEL=info |
| 207 | +WORKERS=1 |
| 208 | + |
| 209 | +# Model |
| 210 | +DEFAULT_TTS_REPO=kyutai/tts-1.6b-en_fr |
| 211 | +DEFAULT_VOICE_REPO=kyutai/tts-voices |
| 212 | +SAMPLE_RATE=24000 |
| 213 | +MODEL_DEVICE=cuda # or cpu, auto if not set |
| 214 | +MODEL_DTYPE=auto # auto, bfloat16, or float32 |
| 215 | +MODEL_N_Q=32 |
| 216 | +MODEL_TEMP=0.6 |
| 217 | +MODEL_CFG_COEF=2.0 |
| 218 | + |
| 219 | +# CORS |
| 220 | +CORS_ORIGINS=* |
| 221 | +CORS_CREDENTIALS=true |
| 222 | +``` |
| 223 | + |
| 224 | +Set via `.env` file (local), Docker environment, or docker-compose.yml. |
| 225 | + |
| 226 | +## Important Implementation Details |
| 227 | + |
| 228 | +### Audio Processing |
| 229 | +- Sample rate: **24kHz** (fixed, do not change without model retraining) |
| 230 | +- Format: Mono channel, 16-bit signed integer PCM |
| 231 | +- WAV: Standard RIFF WAVE with headers |
| 232 | +- RAW: PCM only (convert: `ffmpeg -f s16le -ar 24000 -ac 1 -i input.raw output.wav`) |
| 233 | + |
| 234 | +### Input Validation |
| 235 | +- Text length: 1-5000 characters (configurable via `MAX_TEXT_LENGTH`) |
| 236 | +- Whitespace normalized automatically (app.py:209-216) |
| 237 | +- Languages: "fr" or "en" |
| 238 | +- File upload: Must be UTF-8 |
| 239 | + |
| 240 | +### Error Handling |
| 241 | +- Custom HTTPException and ValueError handlers (app.py:761-783) |
| 242 | +- Model availability checks before synthesis |
| 243 | +- Graceful fallback to dummy model (generates sine waves for testing) |
| 244 | + |
| 245 | +### Startup/Shutdown |
| 246 | +- `@app.on_event("startup")`: Loads model, handles errors gracefully |
| 247 | +- `@app.on_event("shutdown")`: Cleans up model, empties CUDA cache, shuts down executor |
| 248 | + |
| 249 | +## CI/CD |
| 250 | + |
| 251 | +GitHub Actions workflow (`.github/workflows/docker-publish.yml`): |
| 252 | +- **Triggers**: Push to main/master, PRs, tags (v*.*.*) |
| 253 | +- **Build**: Docker image for `linux/amd64` with buildx caching |
| 254 | +- **Push**: To Docker Hub (on non-PR events) |
| 255 | +- **Tags**: `latest` (main branch), semver (v1.0.0 → 1.0.0, 1.0, 1), SHA |
| 256 | +- **Description**: Updates Docker Hub README from repo README.md |
| 257 | + |
| 258 | +Secrets required: `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN` |
| 259 | + |
| 260 | +## Client Integration |
| 261 | + |
| 262 | +**Python Client** (client.py): |
| 263 | +```python |
| 264 | +from client import MoshiTTSClient |
| 265 | + |
| 266 | +client = MoshiTTSClient("http://localhost:8000") |
| 267 | +client.health_check() |
| 268 | +client.synthesize("Hello world", language="en", output_file="output.wav") |
| 269 | +``` |
| 270 | + |
| 271 | +**CLI**: |
| 272 | +```bash |
| 273 | +python client.py -t "Bonjour" -l fr -o test.wav |
| 274 | +python client.py --health |
| 275 | +python client.py --languages |
| 276 | +``` |
| 277 | + |
| 278 | +## Testing Strategy |
| 279 | + |
| 280 | +- **Unit tests**: tests/test_basic.py (module imports, API structure) |
| 281 | +- **Integration tests**: test_api.sh (bash script testing all endpoints) |
| 282 | +- **Pytest config**: pyproject.toml with coverage settings |
| 283 | + |
| 284 | +## Common Tasks |
| 285 | + |
| 286 | +### Adding a New Endpoint |
| 287 | +1. Define Pydantic request/response models in app.py |
| 288 | +2. Add endpoint function with `@app.get()` or `@app.post()` decorator |
| 289 | +3. Use appropriate tags (TTS or System) for documentation |
| 290 | +4. Add tests to test_api.sh |
| 291 | + |
| 292 | +### Changing Model Configuration |
| 293 | +1. Update Settings class in config.py |
| 294 | +2. Add Field with description and default |
| 295 | +3. Use in app.py via `settings.your_field` |
| 296 | +4. Document in .env.example (if exists) or README |
| 297 | + |
| 298 | +### Debugging Model Loading |
| 299 | +Check logs for: |
| 300 | +- "✅ Moshi TTS model loaded successfully!" - Real model loaded |
| 301 | +- "⚠️ Using dummy model for testing" - Fallback mode (generates sine waves) |
| 302 | +- "⚠️ PyTorch not available" - Missing PyTorch |
| 303 | +- "⚠️ Moshi library not available" - Missing moshi package |
| 304 | + |
| 305 | +Verify model: `docker exec moshi-tts-api python3 -c "import moshi; print(moshi.__version__)"` |
| 306 | + |
| 307 | +### Performance Optimization |
| 308 | +- **GPU**: Real-time or faster generation |
| 309 | +- **CPU**: 2-10x real-time depending on CPU |
| 310 | +- **Memory**: ~6GB for bf16 model |
| 311 | +- **First request**: Slower (model loading and caching) |
| 312 | +- **macOS MLX**: 2-5x faster than Docker/CPU on Apple Silicon |
| 313 | + |
| 314 | +## Deployment Notes |
| 315 | + |
| 316 | +- **Docker Hub**: Images at `mmaudet/moshi-tts-api:latest` |
| 317 | +- **Model caching**: Always mount `/app/models` volume to avoid re-downloading |
| 318 | +- **Security**: Container runs as non-root user (appuser UID 1001) |
| 319 | +- **CORS**: Default is `*` (all origins) - restrict in production |
| 320 | +- **Health checks**: Built into Docker with 30s interval |
0 commit comments