Skip to content

Commit 6e01309

Browse files
authored
Merge pull request #1598 from yuekaizhang/streaming
[Runtime] StepAudio2 Streaming DiT Token2Wav Integration
2 parents 4d60ff6 + 1fc8435 commit 6e01309

File tree

22 files changed

+1991
-153
lines changed

22 files changed

+1991
-153
lines changed

examples/grpo/cosyvoice2/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Stage `0` converts raw JSONL files into the parquet format expected by veRL:
3636
```bash
3737
bash run.sh 0 0
3838
```
39-
Create two JSONL files—`train.jsonl` and `test.jsonl`.
39+
Create two JSONL files—`train.jsonl` and `test.jsonl`.
4040
The script will then generate two Parquet files:
4141

4242
```
@@ -111,7 +111,7 @@ bash run.sh 5 5
111111

112112
The script converts the Hugging Face checkpoint back into the format expected by the CosyVoice repository.
113113
> [!TIP]
114-
> However, we observed a slight accuracy drop when using the RL-trained model after conversion, compared with the Hugging Face format.
114+
> However, we observed a slight accuracy drop when using the RL-trained model after conversion, compared with the Hugging Face format.
115115
116116
## Results
117117

examples/grpo/cosyvoice2/infer_dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@
5353
pass
5454

5555

56-
TEMPLATE = "{% for message in messages %}{%- if message['role'] == 'user' %}{{- '<|im_start|>' + message['role'] + '\n' + 'Convert the text to speech: ' + message['content'] + '<|im_end|>\n'}}{%- elif message['role'] == 'assistant' %}{{- '<|im_start|>' + message['role'] + '\n' + '<|SPEECH_GENERATION_START|>' + message['content']}}{%- endif %}{%- endfor %}"
56+
TEMPLATE = "{% for message in messages %}{%- if message['role'] == 'user' %}{{- '<|im_start|>' + message['role'] + '\n' + 'Convert the text to speech: ' + message['content'] + '<|im_end|>\n'}}{%- elif message['role'] == 'assistant' %}{{- '<|im_start|>' + message['role'] + '\n' + '<|SPEECH_GENERATION_START|>' + message['content']}}{%- endif %}{%- endfor %}" # noqa: E501
5757

5858

5959
def audio_decode_cosyvoice2(

examples/grpo/cosyvoice2/pretrained_to_huggingface.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
#!/usr/bin/env python3
2-
31
# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
42
# SPDX-License-Identifier: Apache-2.0
53
#

examples/grpo/cosyvoice2/run.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ fi
3333

3434
if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
3535
log "stage -1: download official CosyVoice2-0.5B LLM model and convert to huggingface compatible checkpoint"
36-
modelscope download --model iic/CosyVoice2-0.5B --local_dir $model_scope_model_path
36+
modelscope download --model iic/CosyVoice2-0.5B --local_dir $model_scope_model_path
3737
python3 pretrained_to_huggingface.py \
3838
--pretrained-cosyvoice2-path $model_scope_model_path \
3939
--save-path $sft_model_path
@@ -61,7 +61,7 @@ fi
6161
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
6262
log "stage 1: start token2wav asr server for reward function"
6363
python3 token2wav_asr_server.py --number-of-devices 8
64-
fi
64+
fi
6565

6666
exp_name=official_llm_aishell3_grpo
6767
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
@@ -125,7 +125,7 @@ if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
125125
--backend fsdp \
126126
--local_dir $llm_path/actor \
127127
--target_dir $llm_path/merged_hf_model || exit 1
128-
fi
128+
fi
129129

130130
if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
131131
log "stage 4: Test the model"

examples/grpo/cosyvoice2/scripts/offline-decode-files.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
#!/usr/bin/env python3
2-
#
31
# Copyright (c) 2023 by manyeyes
42
# Copyright (c) 2023 Xiaomi Corporation
53

@@ -195,7 +193,7 @@ def write_error_stats(
195193
hyp = list("".join(hyp))
196194
results[i] = (cut_id, ref, hyp)
197195

198-
for cut_id, ref, hyp in results:
196+
for _cut_id, ref, hyp in results:
199197
ali = kaldialign.align(ref, hyp, ERR, sclite_mode=sclite_mode)
200198
for ref_word, hyp_word in ali:
201199
if ref_word == ERR:

examples/grpo/cosyvoice2/token2wav_asr_server.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ def main():
295295
metrics_port=8002,
296296
)
297297

298-
device_ids = [i for i in range(args.number_of_devices)]
298+
device_ids = list(range(args.number_of_devices))
299299
device_ids = device_ids * args.number_of_instances_per_device
300300

301301
with Triton(config=triton_config) as triton:
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
## Accelerating CosyVoice with DiT-based Token2Wav, NVIDIA Triton Inference Server and TensorRT-LLM
2+
3+
Contributed by Yuekai Zhang (NVIDIA).
4+
5+
This document describes how to accelerate CosyVoice with a DiT-based Token2Wav module from Step-Audio2, using NVIDIA Triton Inference Server and TensorRT-LLM.
6+
7+
### Quick Start
8+
9+
Launch the service directly with Docker Compose:
10+
```sh
11+
docker compose -f docker-compose.dit.yml up
12+
```
13+
14+
### Build the Docker Image
15+
16+
To build the image from scratch:
17+
```sh
18+
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
19+
```
20+
21+
### Run a Docker Container
22+
```sh
23+
your_mount_dir=/mnt:/mnt
24+
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
25+
```
26+
27+
### Understanding `run_stepaudio2_dit_token2wav.sh`
28+
29+
The `run_stepaudio2_dit_token2wav.sh` script orchestrates the entire workflow through numbered stages.
30+
31+
You can run a subset of stages with:
32+
```sh
33+
bash run_stepaudio2_dit_token2wav.sh <start_stage> <stop_stage>
34+
```
35+
- `<start_stage>`: The stage to start from.
36+
- `<stop_stage>`: The stage to stop after.
37+
38+
**Stages:**
39+
40+
- **Stage -1**: Clones the `Step-Audio2` and `CosyVoice` repositories.
41+
- **Stage 0**: Downloads the `cosyvoice2_llm`, `CosyVoice2-0.5B`, and `Step-Audio-2-mini` models.
42+
- **Stage 1**: Converts the HuggingFace checkpoint for the LLM to the TensorRT-LLM format and builds the TensorRT engines.
43+
- **Stage 2**: Creates the Triton model repository, including configurations for `cosyvoice2_dit` and `token2wav_dit`.
44+
- **Stage 3**: Launches the Triton Inference Server for Token2Wav module and uses `trtllm-serve` to deploy Cosyvoice2 LLM.
45+
- **Stage 4**: Runs the gRPC benchmark client for performance testing.
46+
- **Stage 5**: Runs the offline TTS inference benchmark test.
47+
- **Stage 6**: Runs a standalone inference script for the Step-Audio2-mini DiT Token2Wav model.
48+
- **Stage 7**: Launches servers in a disaggregated setup, with the LLM on GPU 0 and Token2Wav servers on GPUs 1-3.
49+
- **Stage 8**: Runs the benchmark client for the disaggregated server configuration.
50+
### Export Models and Launch Server
51+
52+
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
53+
```sh
54+
# This command runs stages 0, 1, 2, and 3
55+
bash run_stepaudio2_dit_token2wav.sh 0 3
56+
```
57+
58+
### Benchmark with client-server mode
59+
60+
To benchmark the running Triton server, run stage 4:
61+
```sh
62+
bash run_stepaudio2_dit_token2wav.sh 4 4
63+
64+
# You can customize parameters such as the number of tasks inside the script.
65+
```
66+
The following results were obtained by decoding on a single L20 GPU with the `yuekai/seed_tts_cosy2` dataset.
67+
68+
#### Total Request Latency
69+
70+
| Concurrent Tasks | RTF | Average (ms) | 50th Percentile (ms) | 90th Percentile (ms) | 95th Percentile (ms) | 99th Percentile (ms) |
71+
| ---------------- | ------ | ------------ | -------------------- | -------------------- | -------------------- | -------------------- |
72+
| 1 | 0.1228 | 833.66 | 779.98 | 1297.05 | 1555.97 | 1653.02 |
73+
| 2 | 0.0901 | 1166.23 | 1124.69 | 1762.76 | 1900.64 | 2204.14 |
74+
| 4 | 0.0741 | 1849.30 | 1759.42 | 2624.50 | 2822.20 | 3128.42 |
75+
| 6 | 0.0774 | 2936.13 | 3054.64 | 3849.60 | 3900.49 | 4245.79 |
76+
| 8 | 0.0691 | 3408.56 | 3434.98 | 4547.13 | 5047.76 | 5346.53 |
77+
| 10 | 0.0707 | 4306.56 | 4343.44 | 5769.64 | 5876.09 | 5939.79 |
78+
79+
#### First Chunk Latency
80+
81+
| Concurrent Tasks | Average (ms) | 50th Percentile (ms) | 90th Percentile (ms) | 95th Percentile (ms) | 99th Percentile (ms) |
82+
| ---------------- | ------------ | -------------------- | -------------------- | -------------------- | -------------------- |
83+
| 1 | 197.50 | 196.13 | 214.65 | 215.96 | 229.21 |
84+
| 2 | 281.15 | 278.20 | 345.18 | 361.79 | 395.97 |
85+
| 4 | 510.65 | 530.50 | 630.13 | 642.44 | 666.65 |
86+
| 6 | 921.54 | 918.86 | 1079.97 | 1265.22 | 1524.41 |
87+
| 8 | 1019.95 | 1085.26 | 1371.05 | 1402.24 | 1410.66 |
88+
| 10 | 1214.98 | 1293.54 | 1575.36 | 1654.51 | 2161.76 |
89+
90+
### Benchmark with offline inference mode
91+
For offline inference mode benchmark, please run stage 5:
92+
```sh
93+
bash run_stepaudio2_dit_token2wav.sh 5 5
94+
```
95+
96+
The following results were obtained by decoding on a single L20 GPU with the `yuekai/seed_tts_cosy2` dataset.
97+
98+
#### Offline TTS (Cosyvoice2 0.5B LLM + StepAudio2 DiT Token2Wav)
99+
| Backend | Batch Size | llm_time_seconds | total_time_seconds | RTF |
100+
|---------|------------|------------------|-----------------------|--|
101+
| TRTLLM | 16 | 2.01 | 5.03 | 0.0292 |
102+
103+
104+
### Disaggregated Server
105+
When the LLM and token2wav components are deployed on the same GPU, they compete for resources. To optimize performance, we use a disaggregated setup where the LLM is deployed on one dedicated L20 GPU, taking advantage of in-flight batching for inference. The token2wav module is deployed on separate, dedicated GPUs.
106+
107+
The table below shows the first chunk latency results for this configuration. In our tests, we deploy two token2wav instances on each dedicated token2wav GPU.
108+
109+
| token2wav_num_gpu | concurrent_task_per_instance | concurrent_tasks_per_gpu | avg (ms) | p50 (ms) | p90 (ms) | p99 (ms) |
110+
|---|---|---|---|---|---|---|
111+
| 1 | 1 | 1.00 | 218.53 | 217.86 | 254.07 | 296.49 |
112+
| 2 | 1 | 1.33 | 218.82 | 219.21 | 256.62 | 303.13 |
113+
| 3 | 1 | 1.50 | 229.08 | 223.27 | 302.13 | 324.41 |
114+
| 4 | 1 | 1.60 | 203.87 | 198.23 | 254.92 | 279.31 |
115+
| 1 | 2 | 2.00 | 293.46 | 280.53 | 370.81 | 407.40 |
116+
| 2 | 2 | 2.67 | 263.38 | 236.84 | 350.82 | 397.39 |
117+
| 3 | 2 | 3.00 | 308.09 | 275.48 | 385.22 | 521.45 |
118+
| 4 | 2 | 3.20 | 271.85 | 253.25 | 359.03 | 387.91 |
119+
| 1 | 3 | 3.00 | 389.15 | 373.01 | 469.22 | 542.89 |
120+
| 2 | 3 | 4.00 | 403.48 | 394.80 | 481.24 | 507.75 |
121+
| 3 | 3 | 4.50 | 406.33 | 391.28 | 495.43 | 571.29 |
122+
| 4 | 3 | 4.80 | 436.72 | 383.81 | 638.44 | 879.23 |
123+
| 1 | 4 | 4.00 | 520.12 | 493.98 | 610.38 | 739.85 |
124+
| 2 | 4 | 5.33 | 494.60 | 490.50 | 605.93 | 708.09 |
125+
| 3 | 4 | 6.00 | 538.23 | 508.33 | 687.62 | 736.96 |
126+
| 4 | 4 | 6.40 | 579.68 | 546.20 | 721.53 | 958.04 |
127+
| 1 | 5 | 5.00 | 635.02 | 623.30 | 786.85 | 819.84 |
128+
| 2 | 5 | 6.67 | 598.23 | 617.09 | 741.00 | 788.96 |
129+
| 3 | 5 | 7.50 | 644.78 | 684.40 | 786.45 | 1009.45 |
130+
| 4 | 5 | 8.00 | 733.92 | 642.26 | 1024.79 | 1281.55 |
131+
| 1 | 6 | 6.00 | 715.38 | 745.68 | 887.04 | 906.68 |
132+
| 2 | 6 | 8.00 | 748.31 | 753.94 | 873.59 | 1007.14 |
133+
| 3 | 6 | 9.00 | 900.27 | 822.28 | 1431.14 | 1800.23 |
134+
| 4 | 6 | 9.60 | 857.54 | 820.33 | 1150.30 | 1298.53 |
135+
136+
The `concurrent_task_per_gpu` is calculated as:
137+
`concurrent_task_per_gpu = concurrent_task_per_instance * num_token2wav_instance_per_gpu (2) * token2wav_gpus / (token2wav_gpus + llm_gpus (1))`
138+
139+
### Acknowledgements
140+
141+
This work originates from the NVIDIA CISI project. For more multimodal resources, please see [mair-hub](https://github.com/nvidia-china-sae/mair-hub).

0 commit comments

Comments
 (0)