Skip to content

Commit 3a61c37

Browse files
authored
Update TensorRT-LLM backend (triton-inference-server#223)
* Update TensorRT-LLM backend
1 parent 1309995 commit 3a61c37

32 files changed

+802
-386
lines changed

README.md

Lines changed: 67 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -164,14 +164,24 @@ python3 build.py --model_dir=./c-model/gpt2/4-gpu/ \
164164

165165
There are five models in the [`all_models/inflight_batcher_llm`](./all_models/inflight_batcher_llm/)
166166
directory that will be used in this example:
167-
- "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
168-
- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
169-
- "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
170-
- "ensemble": This model can be used to chain the preprocessing, tensorrt_llm and postprocessing models together.
171-
- "tensorrt_llm_bls": This model can also be used to chain the preprocessing, tensorrt_llm and postprocessing models together. The BLS model has an optional parameter `accumulate_tokens` which can be used in streaming mode to call the preprocessing model with all accumulated tokens, instead of only one token. This might be necessary for certain tokenizers.
167+
- "preprocessing": This model is used for tokenizing, meaning the conversion from
168+
prompts(string) to input_ids(list of ints).
169+
- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used
170+
for inferencing
171+
- "postprocessing": This model is used for de-tokenizing, meaning the conversion
172+
from output_ids(list of ints) to outputs(string).
173+
- "ensemble": This model can be used to chain the preprocessing, tensorrt_llm
174+
and postprocessing models together.
175+
- "tensorrt_llm_bls": This model can also be used to chain the preprocessing,
176+
tensorrt_llm and postprocessing models together. The BLS model has an optional
177+
parameter `accumulate_tokens` which can be used in streaming mode to call the
178+
preprocessing model with all accumulated tokens, instead of only one token.
179+
This might be necessary for certain tokenizers.
172180

173181
To learn more about ensemble and BLS models, please see the
174-
[Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) and [Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting) sections of the Triton Inference Server documentation.
182+
[Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models)
183+
and [Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
184+
sections of the Triton Inference Server documentation.
175185

176186
```bash
177187
# Create the model repository that will be used by the Triton server
@@ -333,14 +343,16 @@ He was a member of the French Academy of Sciences and the French Academy of Arts
333343
Soyer was a member of the French Academy of Sciences and
334344
```
335345

336-
You can also stop the generation process early by using the `--stop-after-ms` option to send a stop request after a few milliseconds:
346+
You can also stop the generation process early by using the `--stop-after-ms`
347+
option to send a stop request after a few milliseconds:
337348

338349
```bash
339350
python inflight_batcher_llm/client/inflight_batcher_llm_client.py --stop-after-ms 200 --request-output-len 200 --tokenizer-dir /workspace/tensorrtllm_backend/tensorrt_llm/examples/gpt/gpt2
340351
```
341352

342-
You will find that the generation process is stopped early and therefore the number of generated tokens is lower than 200.
343-
You can have a look at the client code to see how early stopping is achieved.
353+
You will find that the generation process is stopped early and therefore the
354+
number of generated tokens is lower than 200. You can have a look at the
355+
client code to see how early stopping is achieved.
344356

345357
### Launch Triton server *within Slurm based clusters*
346358

@@ -391,49 +403,59 @@ pkill tritonserver
391403
```
392404

393405
## Triton Metrics
394-
Starting with the 23.11 release of Triton, users can now obtain TRT LLM Batch Manager [statistics](https://github.com/NVIDIA/TensorRT-LLM/blob/ffd5af342a817a2689d38e4af2cc59ded877e339/docs/source/batch_manager.md#statistics) by querying the Triton metrics endpoint. This can be accomplished by launching a Triton server in any of the ways described above (ensuring the build code / container is 23.11 or later) and querying the sever with the generate endpoint. Upon receiving a successful response, you can query the metrics endpoint by entering the following:
406+
Starting with the 23.11 release of Triton, users can now obtain TRT LLM Batch
407+
Manager [statistics](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/batch_manager.md#statistics)
408+
by querying the Triton metrics endpoint. This can be accomplished by launching
409+
a Triton server in any of the ways described above (ensuring the build code /
410+
container is 23.11 or later) and querying the server. Upon receiving a
411+
successful response, you can query the metrics endpoint by entering the
412+
following:
395413
```bash
396414
curl localhost:8002/metrics
397415
```
398-
Batch manager statistics are reported by the metrics endpoint in fields that are prefixed with `nv_trt_llm_`. Your output for these fields should look similar to the following (assuming your model is an inflight batcher model):
416+
Batch manager statistics are reported by the metrics endpoint in fields that
417+
are prefixed with `nv_trt_llm_`. Your output for these fields should look
418+
similar to the following (assuming your model is an inflight batcher model):
399419
```bash
400-
# HELP nv_trt_llm_request_statistics TRT LLM request metrics
401-
# TYPE nv_trt_llm_request_statistics gauge
402-
nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="context",version="1"} 1
403-
nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="scheduled",version="1"} 1
404-
nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="max",version="1"} 512
405-
nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="active",version="1"} 0
406-
# HELP nv_trt_llm_runtime_memory_statistics TRT LLM runtime memory metrics
407-
# TYPE nv_trt_llm_runtime_memory_statistics gauge
408-
nv_trt_llm_runtime_memory_statistics{memory_type="pinned",model="tensorrt_llm",version="1"} 0
409-
nv_trt_llm_runtime_memory_statistics{memory_type="gpu",model="tensorrt_llm",version="1"} 1610236
410-
nv_trt_llm_runtime_memory_statistics{memory_type="cpu",model="tensorrt_llm",version="1"} 0
411-
# HELP nv_trt_llm_kv_cache_block_statistics TRT LLM KV cache block metrics
412-
# TYPE nv_trt_llm_kv_cache_block_statistics gauge
413-
nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="tokens_per",model="tensorrt_llm",version="1"} 64
414-
nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} 1
415-
nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="free",model="tensorrt_llm",version="1"} 6239
416-
nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} 6239
417-
# HELP nv_trt_llm_inflight_batcher_statistics TRT LLM inflight_batcher-specific metrics
418-
# TYPE nv_trt_llm_inflight_batcher_statistics gauge
419-
nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="micro_batch_id",model="tensorrt_llm",version="1"} 0
420-
nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="generation_requests",model="tensorrt_llm",version="1"} 0
421-
nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="total_context_tokens",model="tensorrt_llm",version="1"} 0
422-
# HELP nv_trt_llm_general_statistics General TRT LLM statistics
423-
# TYPE nv_trt_llm_general_statistics gauge
424-
nv_trt_llm_general_statistics{general_type="iteration_counter",model="tensorrt_llm",version="1"} 0
425-
nv_trt_llm_general_statistics{general_type="timestamp",model="tensorrt_llm",version="1"} 1700074049
420+
# HELP nv_trt_llm_request_metrics TRT LLM request metrics
421+
# TYPE nv_trt_llm_request_metrics gauge
422+
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="context",version="1"} 1
423+
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"} 1
424+
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="max",version="1"} 512
425+
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"} 0
426+
# HELP nv_trt_llm_runtime_memory_metrics TRT LLM runtime memory metrics
427+
# TYPE nv_trt_llm_runtime_memory_metrics gauge
428+
nv_trt_llm_runtime_memory_metrics{memory_type="pinned",model="tensorrt_llm",version="1"} 0
429+
nv_trt_llm_runtime_memory_metrics{memory_type="gpu",model="tensorrt_llm",version="1"} 1610236
430+
nv_trt_llm_runtime_memory_metrics{memory_type="cpu",model="tensorrt_llm",version="1"} 0
431+
# HELP nv_trt_llm_kv_cache_block_metrics TRT LLM KV cache block metrics
432+
# TYPE nv_trt_llm_kv_cache_block_metrics gauge
433+
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="tokens_per",model="tensorrt_llm",version="1"} 64
434+
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} 1
435+
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="free",model="tensorrt_llm",version="1"} 6239
436+
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} 6239
437+
# HELP nv_trt_llm_inflight_batcher_metrics TRT LLM inflight_batcher-specific metrics
438+
# TYPE nv_trt_llm_inflight_batcher_metrics gauge
439+
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="micro_batch_id",model="tensorrt_llm",version="1"} 0
440+
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="generation_requests",model="tensorrt_llm",version="1"} 0
441+
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="total_context_tokens",model="tensorrt_llm",version="1"} 0
442+
# HELP nv_trt_llm_general_metrics General TRT LLM metrics
443+
# TYPE nv_trt_llm_general_metrics gauge
444+
nv_trt_llm_general_metrics{general_type="iteration_counter",model="tensorrt_llm",version="1"} 0
445+
nv_trt_llm_general_metrics{general_type="timestamp",model="tensorrt_llm",version="1"} 1700074049
426446
```
427-
If, instead, you launched a V1 model, your output will look similar to the output above except the inflight batcher related fields will be replaced with something similar to the following:
447+
If, instead, you launched a V1 model, your output will look similar to the
448+
output above except the inflight batcher related fields will be replaced
449+
with something similar to the following:
428450
```bash
429-
# HELP nv_trt_llm_v1_statistics TRT LLM v1-specific metrics
430-
# TYPE nv_trt_llm_v1_statistics gauge
431-
nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="total_generation_tokens",version="1"} 20
432-
nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="empty_generation_slots",version="1"} 0
433-
nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="total_context_tokens",version="1"} 5
451+
# HELP nv_trt_llm_v1_metrics TRT LLM v1-specific metrics
452+
# TYPE nv_trt_llm_v1_metrics gauge
453+
nv_trt_llm_v1_metrics{model="tensorrt_llm",v1_specific_metric="total_generation_tokens",version="1"} 20
454+
nv_trt_llm_v1_metrics{model="tensorrt_llm",v1_specific_metric="empty_generation_slots",version="1"} 0
455+
nv_trt_llm_v1_metrics{model="tensorrt_llm",v1_specific_metric="total_context_tokens",version="1"} 5
434456
```
435-
Please note that as of the 23.11 Triton release, a link between base Triton metrics (such as inference request count and latency) is being actively developed, but is not yet supported.
436-
As such, the following fields will report 0:
457+
Please note that versions of Triton prior to the 23.12 release do not
458+
support base Triton metrics. As such, the following fields will report 0:
437459
```bash
438460
# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
439461
# TYPE nv_inference_request_success counter

all_models/gpt/postprocessing/1/model.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@ def initialize(self, args):
3737
self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
3838
padding_side='left')
3939
elif tokenizer_type == 'auto':
40-
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
41-
padding_side='left')
40+
self.tokenizer = AutoTokenizer.from_pretrained(
41+
tokenizer_dir, padding_side='left', trust_remote_code=True)
4242
elif tokenizer_type == 'llama':
4343
self.tokenizer = LlamaTokenizer.from_pretrained(
4444
tokenizer_dir, legacy=False, padding_side='left')

all_models/gpt/preprocessing/1/model.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,8 @@ def initialize(self, args):
4040
self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
4141
padding_side='left')
4242
elif tokenizer_type == 'auto':
43-
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
44-
padding_side='left')
43+
self.tokenizer = AutoTokenizer.from_pretrained(
44+
tokenizer_dir, padding_side='left', trust_remote_code=True)
4545
elif tokenizer_type == 'llama':
4646
self.tokenizer = LlamaTokenizer.from_pretrained(
4747
tokenizer_dir, legacy=False, padding_side='left')

all_models/inflight_batcher_llm/postprocessing/1/model.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,8 @@ def initialize(self, args):
6767
self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
6868
padding_side='left')
6969
elif tokenizer_type == 'auto':
70-
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
71-
padding_side='left')
70+
self.tokenizer = AutoTokenizer.from_pretrained(
71+
tokenizer_dir, padding_side='left', trust_remote_code=True)
7272
elif tokenizer_type == 'llama':
7373
self.tokenizer = LlamaTokenizer.from_pretrained(
7474
tokenizer_dir, legacy=False, padding_side='left')

all_models/inflight_batcher_llm/preprocessing/1/model.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,8 +68,8 @@ def initialize(self, args):
6868
self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
6969
padding_side='left')
7070
elif tokenizer_type == 'auto':
71-
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
72-
padding_side='left')
71+
self.tokenizer = AutoTokenizer.from_pretrained(
72+
tokenizer_dir, padding_side='left', trust_remote_code=True)
7373
elif tokenizer_type == 'llama':
7474
self.tokenizer = LlamaTokenizer.from_pretrained(
7575
tokenizer_dir, legacy=False, padding_side='left')

all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -289,8 +289,8 @@ parameters: {
289289
}
290290
}
291291
parameters: {
292-
key: "use_context_fmha_for_generation"
292+
key: "enable_kv_cache_reuse"
293293
value: {
294-
string_value: "${use_context_fmha_for_generation}"
294+
string_value: "${enable_kv_cache_reuse}"
295295
}
296296
}

0 commit comments

Comments
 (0)