You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Stage 4**: Runs the gRPC benchmark client for performance testing.
46
46
-**Stage 5**: Runs the offline TTS inference benchmark test.
47
47
-**Stage 6**: Runs a standalone inference script for the Step-Audio2-mini DiT Token2Wav model.
48
-
48
+
-**Stage 7**: Launches servers in a disaggregated setup, with the LLM on GPU 0 and Token2Wav servers on GPUs 1-3.
49
+
-**Stage 8**: Runs the benchmark client for the disaggregated server configuration.
49
50
### Export Models and Launch Server
50
51
51
52
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
@@ -100,6 +101,40 @@ The following results were obtained by decoding on a single L20 GPU with the `yu
100
101
| TRTLLM | 16 | 2.01 | 5.03 | 0.0292 |
101
102
102
103
104
+
### Disaggregated Server
105
+
When the LLM and token2wav components are deployed on the same GPU, they compete for resources. To optimize performance, we use a disaggregated setup where the LLM is deployed on one dedicated L20 GPU, taking advantage of in-flight batching for inference. The token2wav module is deployed on separate, dedicated GPUs.
106
+
107
+
The table below shows the first chunk latency results for this configuration. In our tests, we deploy two token2wav instances on each dedicated token2wav GPU.
0 commit comments