|
1 | | -## Nvidia Triton Inference Serving Best Practice for Spark TTS |
| 1 | +## Best Practices for Serving CosyVoice with NVIDIA Triton Inference Server |
2 | 2 |
|
3 | 3 | ### Quick Start |
4 | | -Directly launch the service using docker compose. |
| 4 | +Launch the service directly with Docker Compose: |
5 | 5 | ```sh |
6 | 6 | docker compose up |
7 | 7 | ``` |
8 | 8 |
|
9 | | -### Build Image |
10 | | -Build the docker image from scratch. |
| 9 | +### Build the Docker Image |
| 10 | +Build the image from scratch: |
11 | 11 | ```sh |
12 | | -docker build . -f Dockerfile.server -t soar97/triton-spark-tts:25.02 |
| 12 | +docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06 |
13 | 13 | ``` |
14 | 14 |
|
15 | | -### Create Docker Container |
| 15 | +### Run a Docker Container |
16 | 16 | ```sh |
17 | 17 | your_mount_dir=/mnt:/mnt |
18 | | -docker run -it --name "spark-tts-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-spark-tts:25.02 |
| 18 | +docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06 |
19 | 19 | ``` |
20 | 20 |
|
21 | 21 | ### Understanding `run.sh` |
| 22 | +The `run.sh` script orchestrates the entire workflow through numbered stages. |
22 | 23 |
|
23 | | -The `run.sh` script automates various steps using stages. You can run specific stages using: |
| 24 | +Run a subset of stages with: |
24 | 25 | ```sh |
25 | 26 | bash run.sh <start_stage> <stop_stage> [service_type] |
26 | 27 | ``` |
27 | | -- `<start_stage>`: The stage to begin execution from (0-5). |
28 | | -- `<stop_stage>`: The stage to end execution at (0-5). |
29 | | -- `[service_type]`: Optional, specifies the service type ('streaming' or 'offline', defaults may apply based on script logic). Required for stages 4 and 5. |
| 28 | +- `<start_stage>` – stage to start from (0-5). |
| 29 | +- `<stop_stage>` – stage to stop after (0-5). |
30 | 30 |
|
31 | 31 | Stages: |
32 | | -- **Stage 0**: Download Spark-TTS-0.5B model from HuggingFace. |
33 | | -- **Stage 1**: Convert HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines. |
34 | | -- **Stage 2**: Create the Triton model repository structure and configure model files (adjusts for streaming/offline). |
35 | | -- **Stage 3**: Launch the Triton Inference Server. |
36 | | -- **Stage 4**: Run the gRPC benchmark client. |
37 | | -- **Stage 5**: Run the single utterance client (gRPC for streaming, HTTP for offline). |
38 | | - |
39 | | -### Export Models to TensorRT-LLM and Launch Server |
40 | | -Inside the docker container, you can prepare the models and launch the Triton server by running stages 0 through 3. This involves downloading the original model, converting it to TensorRT-LLM format, building the optimized TensorRT engines, creating the necessary model repository structure for Triton, and finally starting the server. |
| 32 | +- **Stage 0** – Download the cosyvoice-2 0.5B model from HuggingFace. |
| 33 | +- **Stage 1** – Convert the HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines. |
| 34 | +- **Stage 2** – Create the Triton model repository and configure the model files (adjusts depending on whether `Decoupled=True/False` will be used later). |
| 35 | +- **Stage 3** – Launch the Triton Inference Server. |
| 36 | +- **Stage 4** – Run the single-utterance HTTP client. |
| 37 | +- **Stage 5** – Run the gRPC benchmark client. |
| 38 | + |
| 39 | +### Export Models to TensorRT-LLM and Launch the Server |
| 40 | +Inside the Docker container, prepare the models and start the Triton server by running stages 0-3: |
41 | 41 | ```sh |
42 | | -# This runs stages 0, 1, 2, and 3 |
| 42 | +# Runs stages 0, 1, 2, and 3 |
43 | 43 | bash run.sh 0 3 |
44 | 44 | ``` |
45 | | -*Note: Stage 2 prepares the model repository differently based on whether you intend to run streaming or offline inference later. You might need to re-run stage 2 if switching service types.* |
46 | | - |
47 | | - |
48 | | -### Single Utterance Client |
49 | | -Run a single inference request. Specify `streaming` or `offline` as the third argument. |
| 45 | +*Note: Stage 2 prepares the model repository differently depending on whether you intend to run with `Decoupled=False` or `Decoupled=True`. Rerun stage 2 if you switch the service type.* |
50 | 46 |
|
51 | | -**Streaming Mode (gRPC):** |
| 47 | +### Single-Utterance HTTP Client |
| 48 | +Send a single HTTP inference request: |
52 | 49 | ```sh |
53 | | -bash run.sh 5 5 streaming |
| 50 | +bash run.sh 4 4 |
54 | 51 | ``` |
55 | | -This executes the `client_grpc.py` script with predefined example text and prompt audio in streaming mode. |
56 | 52 |
|
57 | | -**Offline Mode (HTTP):** |
| 53 | +### Benchmark with a Dataset |
| 54 | +Benchmark the running Triton server. Pass either `streaming` or `offline` as the third argument. |
58 | 55 | ```sh |
59 | | -bash run.sh 5 5 offline |
| 56 | +bash run.sh 5 5 |
| 57 | + |
| 58 | +# You can also customise parameters such as num_task and dataset split directly: |
| 59 | +# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts_cosy2 --split-name test_zh --mode [streaming|offline] |
60 | 60 | ``` |
| 61 | +> [!TIP] |
| 62 | +> Only offline CosyVoice TTS is currently supported. Setting the client to `streaming` simply enables NVIDIA Triton’s decoupled mode so that responses are returned as soon as they are ready. |
61 | 63 |
|
62 | | -### Benchmark using Dataset |
63 | | -Run the benchmark client against the running Triton server. Specify `streaming` or `offline` as the third argument. |
| 64 | +### Benchmark Results |
| 65 | +Decoding on a single L20 GPU with 26 prompt_audio/target_text [pairs](https://huggingface.co/datasets/yuekai/seed_tts) (≈221 s of audio): |
| 66 | + |
| 67 | +| Mode | Note | Concurrency | Avg Latency (ms) | P50 Latency (ms) | RTF | |
| 68 | +|------|------|-------------|------------------|------------------|-----| |
| 69 | +| Decoupled=False | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 1 | 758.04 | 615.79 | 0.0891 | |
| 70 | +| Decoupled=False | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 2 | 1025.93 | 901.68 | 0.0657 | |
| 71 | +| Decoupled=False | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 4 | 1914.13 | 1783.58 | 0.0610 | |
| 72 | +| Decoupled=True | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 1 | 659.87 | 655.63 | 0.0891 | |
| 73 | +| Decoupled=True | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 2 | 1103.16 | 992.96 | 0.0693 | |
| 74 | +| Decoupled=True | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 4 | 1790.91 | 1668.63 | 0.0604 | |
| 75 | + |
| 76 | +### OpenAI-Compatible Server |
| 77 | +To launch an OpenAI-compatible service, run: |
64 | 78 | ```sh |
65 | | -# Run benchmark in streaming mode |
66 | | -bash run.sh 4 4 streaming |
67 | | - |
68 | | -# Run benchmark in offline mode |
69 | | -bash run.sh 4 4 offline |
70 | | - |
71 | | -# You can also customize parameters like num_task directly in client_grpc.py or via args if supported |
72 | | -# Example from run.sh (streaming): |
73 | | -# python3 client_grpc.py \ |
74 | | -# --server-addr localhost \ |
75 | | -# --model-name spark_tts \ |
76 | | -# --num-tasks 2 \ |
77 | | -# --mode streaming \ |
78 | | -# --log-dir ./log_concurrent_tasks_2_streaming_new |
79 | | - |
80 | | -# Example customizing dataset (requires modifying client_grpc.py or adding args): |
81 | | -# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts --split-name wenetspeech4tts --mode [streaming|offline] |
| 79 | +git clone https://github.com/yuekaizhang/Triton-OpenAI-Speech.git |
| 80 | +pip install -r requirements.txt |
| 81 | +# After the Triton service is up, start the FastAPI bridge: |
| 82 | +python3 tts_server.py --url http://localhost:8000 --ref_audios_dir ./ref_audios/ --port 10086 --default_sample_rate 24000 |
| 83 | +# Test with curl |
| 84 | +bash test/test_cosyvoice.sh |
82 | 85 | ``` |
83 | 86 |
|
84 | | -### Benchmark Results |
85 | | -Decoding on a single L20 GPU, using 26 different prompt_audio/target_text [pairs](https://huggingface.co/datasets/yuekai/seed_tts), total audio duration 169 secs. |
| 87 | +### Acknowledgements |
| 88 | +This section originates from the NVIDIA CISI project. We also provide other multimodal resources—see [mair-hub](https://github.com/nvidia-china-sae/mair-hub) for details. |
86 | 89 |
|
87 | | -| Mode | Note | Concurrency | Avg Latency | First Chunk Latency (P50) | RTF | |
88 | | -|-------|-----------|-----------------------|---------|----------------|-| |
89 | | -| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 1 | 876.24 ms |-| 0.1362| |
90 | | -| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 2 | 920.97 ms |-|0.0737| |
91 | | -| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 4 | 1611.51 ms |-| 0.0704| |
92 | | -| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 1 | 913.28 ms |210.42 ms| 0.1501 | |
93 | | -| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 2 | 1009.23 ms |226.08 ms |0.0862 | |
94 | | -| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 4 | 1793.86 ms |1017.70 ms| 0.0824 | |
0 commit comments