You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are five models in the [`all_models/inflight_batcher_llm`](./all_models/inflight_batcher_llm/)
166
166
directory that will be used in this example:
167
-
- "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
168
-
- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
169
-
- "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
170
-
- "ensemble": This model can be used to chain the preprocessing, tensorrt_llm and postprocessing models together.
171
-
- "tensorrt_llm_bls": This model can also be used to chain the preprocessing, tensorrt_llm and postprocessing models together. The BLS model has an optional parameter `accumulate_tokens` which can be used in streaming mode to call the preprocessing model with all accumulated tokens, instead of only one token. This might be necessary for certain tokenizers.
167
+
- "preprocessing": This model is used for tokenizing, meaning the conversion from
168
+
prompts(string) to input_ids(list of ints).
169
+
- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used
170
+
for inferencing
171
+
- "postprocessing": This model is used for de-tokenizing, meaning the conversion
172
+
from output_ids(list of ints) to outputs(string).
173
+
- "ensemble": This model can be used to chain the preprocessing, tensorrt_llm
174
+
and postprocessing models together.
175
+
- "tensorrt_llm_bls": This model can also be used to chain the preprocessing,
176
+
tensorrt_llm and postprocessing models together. The BLS model has an optional
177
+
parameter `accumulate_tokens` which can be used in streaming mode to call the
178
+
preprocessing model with all accumulated tokens, instead of only one token.
179
+
This might be necessary for certain tokenizers.
172
180
173
181
To learn more about ensemble and BLS models, please see the
174
-
[Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) and [Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting) sections of the Triton Inference Server documentation.
You will find that the generation process is stopped early and therefore the number of generated tokens is lower than 200.
343
-
You can have a look at the client code to see how early stopping is achieved.
353
+
You will find that the generation process is stopped early and therefore the
354
+
number of generated tokens is lower than 200. You can have a look at the
355
+
client code to see how early stopping is achieved.
344
356
345
357
### Launch Triton server *within Slurm based clusters*
346
358
@@ -391,49 +403,59 @@ pkill tritonserver
391
403
```
392
404
393
405
## Triton Metrics
394
-
Starting with the 23.11 release of Triton, users can now obtain TRT LLM Batch Manager [statistics](https://github.com/NVIDIA/TensorRT-LLM/blob/ffd5af342a817a2689d38e4af2cc59ded877e339/docs/source/batch_manager.md#statistics) by querying the Triton metrics endpoint. This can be accomplished by launching a Triton server in any of the ways described above (ensuring the build code / container is 23.11 or later) and querying the sever with the generate endpoint. Upon receiving a successful response, you can query the metrics endpoint by entering the following:
406
+
Starting with the 23.11 release of Triton, users can now obtain TRT LLM Batch
by querying the Triton metrics endpoint. This can be accomplished by launching
409
+
a Triton server in any of the ways described above (ensuring the build code /
410
+
container is 23.11 or later) and querying the server. Upon receiving a
411
+
successful response, you can query the metrics endpoint by entering the
412
+
following:
395
413
```bash
396
414
curl localhost:8002/metrics
397
415
```
398
-
Batch manager statistics are reported by the metrics endpoint in fields that are prefixed with `nv_trt_llm_`. Your output for these fields should look similar to the following (assuming your model is an inflight batcher model):
416
+
Batch manager statistics are reported by the metrics endpoint in fields that
417
+
are prefixed with `nv_trt_llm_`. Your output for these fields should look
418
+
similar to the following (assuming your model is an inflight batcher model):
399
419
```bash
400
-
# HELP nv_trt_llm_request_statistics TRT LLM request metrics
If, instead, you launched a V1 model, your output will look similar to the output above except the inflight batcher related fields will be replaced with something similar to the following:
447
+
If, instead, you launched a V1 model, your output will look similar to the
448
+
output above except the inflight batcher related fields will be replaced
449
+
with something similar to the following:
428
450
```bash
429
-
# HELP nv_trt_llm_v1_statistics TRT LLM v1-specific metrics
Please note that as of the 23.11 Triton release, a link between base Triton metrics (such as inference request count and latency) is being actively developed, but is not yet supported.
436
-
As such, the following fields will report 0:
457
+
Please note that versions of Triton prior to the 23.12 release do not
458
+
support base Triton metrics. As such, the following fields will report 0:
437
459
```bash
438
460
# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
0 commit comments