Skip to content

Commit b44f121

Browse files
author
yuekaiz
committed
update readme
1 parent dc196df commit b44f121

File tree

2 files changed

+57
-62
lines changed

2 files changed

+57
-62
lines changed

runtime/triton_trtllm/README.md

Lines changed: 55 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,94 +1,89 @@
1-
## Nvidia Triton Inference Serving Best Practice for Spark TTS
1+
## Best Practices for Serving CosyVoice with NVIDIA Triton Inference Server
22

33
### Quick Start
4-
Directly launch the service using docker compose.
4+
Launch the service directly with Docker Compose:
55
```sh
66
docker compose up
77
```
88

9-
### Build Image
10-
Build the docker image from scratch.
9+
### Build the Docker Image
10+
Build the image from scratch:
1111
```sh
12-
docker build . -f Dockerfile.server -t soar97/triton-spark-tts:25.02
12+
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
1313
```
1414

15-
### Create Docker Container
15+
### Run a Docker Container
1616
```sh
1717
your_mount_dir=/mnt:/mnt
18-
docker run -it --name "spark-tts-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-spark-tts:25.02
18+
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
1919
```
2020

2121
### Understanding `run.sh`
22+
The `run.sh` script orchestrates the entire workflow through numbered stages.
2223

23-
The `run.sh` script automates various steps using stages. You can run specific stages using:
24+
Run a subset of stages with:
2425
```sh
2526
bash run.sh <start_stage> <stop_stage> [service_type]
2627
```
27-
- `<start_stage>`: The stage to begin execution from (0-5).
28-
- `<stop_stage>`: The stage to end execution at (0-5).
29-
- `[service_type]`: Optional, specifies the service type ('streaming' or 'offline', defaults may apply based on script logic). Required for stages 4 and 5.
28+
- `<start_stage>` – stage to start from (0-5).
29+
- `<stop_stage>` – stage to stop after (0-5).
3030

3131
Stages:
32-
- **Stage 0**: Download Spark-TTS-0.5B model from HuggingFace.
33-
- **Stage 1**: Convert HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines.
34-
- **Stage 2**: Create the Triton model repository structure and configure model files (adjusts for streaming/offline).
35-
- **Stage 3**: Launch the Triton Inference Server.
36-
- **Stage 4**: Run the gRPC benchmark client.
37-
- **Stage 5**: Run the single utterance client (gRPC for streaming, HTTP for offline).
38-
39-
### Export Models to TensorRT-LLM and Launch Server
40-
Inside the docker container, you can prepare the models and launch the Triton server by running stages 0 through 3. This involves downloading the original model, converting it to TensorRT-LLM format, building the optimized TensorRT engines, creating the necessary model repository structure for Triton, and finally starting the server.
32+
- **Stage 0**Download the cosyvoice-2 0.5B model from HuggingFace.
33+
- **Stage 1**Convert the HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines.
34+
- **Stage 2**Create the Triton model repository and configure the model files (adjusts depending on whether `Decoupled=True/False` will be used later).
35+
- **Stage 3** Launch the Triton Inference Server.
36+
- **Stage 4**Run the single-utterance HTTP client.
37+
- **Stage 5**Run the gRPC benchmark client.
38+
39+
### Export Models to TensorRT-LLM and Launch the Server
40+
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
4141
```sh
42-
# This runs stages 0, 1, 2, and 3
42+
# Runs stages 0, 1, 2, and 3
4343
bash run.sh 0 3
4444
```
45-
*Note: Stage 2 prepares the model repository differently based on whether you intend to run streaming or offline inference later. You might need to re-run stage 2 if switching service types.*
46-
47-
48-
### Single Utterance Client
49-
Run a single inference request. Specify `streaming` or `offline` as the third argument.
45+
*Note: Stage 2 prepares the model repository differently depending on whether you intend to run with `Decoupled=False` or `Decoupled=True`. Rerun stage 2 if you switch the service type.*
5046

51-
**Streaming Mode (gRPC):**
47+
### Single-Utterance HTTP Client
48+
Send a single HTTP inference request:
5249
```sh
53-
bash run.sh 5 5 streaming
50+
bash run.sh 4 4
5451
```
55-
This executes the `client_grpc.py` script with predefined example text and prompt audio in streaming mode.
5652

57-
**Offline Mode (HTTP):**
53+
### Benchmark with a Dataset
54+
Benchmark the running Triton server. Pass either `streaming` or `offline` as the third argument.
5855
```sh
59-
bash run.sh 5 5 offline
56+
bash run.sh 5 5
57+
58+
# You can also customise parameters such as num_task and dataset split directly:
59+
# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts_cosy2 --split-name test_zh --mode [streaming|offline]
6060
```
61+
> [!TIP]
62+
> Only offline CosyVoice TTS is currently supported. Setting the client to `streaming` simply enables NVIDIA Triton’s decoupled mode so that responses are returned as soon as they are ready.
6163
62-
### Benchmark using Dataset
63-
Run the benchmark client against the running Triton server. Specify `streaming` or `offline` as the third argument.
64+
### Benchmark Results
65+
Decoding on a single L20 GPU with 26 prompt_audio/target_text [pairs](https://huggingface.co/datasets/yuekai/seed_tts) (≈221 s of audio):
66+
67+
| Mode | Note | Concurrency | Avg Latency (ms) | P50 Latency (ms) | RTF |
68+
|------|------|-------------|------------------|------------------|-----|
69+
| Decoupled=False | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 1 | 758.04 | 615.79 | 0.0891 |
70+
| Decoupled=False | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 2 | 1025.93 | 901.68 | 0.0657 |
71+
| Decoupled=False | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 4 | 1914.13 | 1783.58 | 0.0610 |
72+
| Decoupled=True | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 1 | 659.87 | 655.63 | 0.0891 |
73+
| Decoupled=True | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 2 | 1103.16 | 992.96 | 0.0693 |
74+
| Decoupled=True | [Commit](https://github.com/SparkAudio/cosyvoice/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 4 | 1790.91 | 1668.63 | 0.0604 |
75+
76+
### OpenAI-Compatible Server
77+
To launch an OpenAI-compatible service, run:
6478
```sh
65-
# Run benchmark in streaming mode
66-
bash run.sh 4 4 streaming
67-
68-
# Run benchmark in offline mode
69-
bash run.sh 4 4 offline
70-
71-
# You can also customize parameters like num_task directly in client_grpc.py or via args if supported
72-
# Example from run.sh (streaming):
73-
# python3 client_grpc.py \
74-
# --server-addr localhost \
75-
# --model-name spark_tts \
76-
# --num-tasks 2 \
77-
# --mode streaming \
78-
# --log-dir ./log_concurrent_tasks_2_streaming_new
79-
80-
# Example customizing dataset (requires modifying client_grpc.py or adding args):
81-
# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts --split-name wenetspeech4tts --mode [streaming|offline]
79+
git clone https://github.com/yuekaizhang/Triton-OpenAI-Speech.git
80+
pip install -r requirements.txt
81+
# After the Triton service is up, start the FastAPI bridge:
82+
python3 tts_server.py --url http://localhost:8000 --ref_audios_dir ./ref_audios/ --port 10086 --default_sample_rate 24000
83+
# Test with curl
84+
bash test/test_cosyvoice.sh
8285
```
8386

84-
### Benchmark Results
85-
Decoding on a single L20 GPU, using 26 different prompt_audio/target_text [pairs](https://huggingface.co/datasets/yuekai/seed_tts), total audio duration 169 secs.
87+
### Acknowledgements
88+
This section originates from the NVIDIA CISI project. We also provide other multimodal resources—see [mair-hub](https://github.com/nvidia-china-sae/mair-hub) for details.
8689

87-
| Mode | Note | Concurrency | Avg Latency | First Chunk Latency (P50) | RTF |
88-
|-------|-----------|-----------------------|---------|----------------|-|
89-
| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 1 | 876.24 ms |-| 0.1362|
90-
| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 2 | 920.97 ms |-|0.0737|
91-
| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 4 | 1611.51 ms |-| 0.0704|
92-
| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 1 | 913.28 ms |210.42 ms| 0.1501 |
93-
| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 2 | 1009.23 ms |226.08 ms |0.0862 |
94-
| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 4 | 1793.86 ms |1017.70 ms| 0.0824 |

runtime/triton_trtllm/docker-compose.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
services:
22
tts:
3-
image: soar97/triton-spark-tts:25.02
3+
image: soar97/triton-cosyvoice:25.06
44
shm_size: '1gb'
55
ports:
66
- "8000:8000"
@@ -17,4 +17,4 @@ services:
1717
device_ids: ['0']
1818
capabilities: [gpu]
1919
command: >
20-
/bin/bash -c "rm -rf Spark-TTS && git clone https://github.com/SparkAudio/Spark-TTS.git && cd Spark-TTS/runtime/triton_trtllm && bash run.sh 0 3"
20+
/bin/bash -c "pip install modelscope && cd /workspace && git clone https://github.com/FunAudioLLM/CosyVoice.git && cd CosyVoice && git submodule update --init --recursive && cd runtime/triton_trtllm && bash run.sh 0 3"

0 commit comments

Comments
 (0)