Skip to content

Commit a653f76

Browse files
authored
Update TensorRT-LLM backend (triton-inference-server#290)
* Update TensorRT-LLM backend
1 parent 6e6e34e commit a653f76

File tree

24 files changed

+658
-69
lines changed

24 files changed

+658
-69
lines changed

README.md

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ The following table shows the fields that may to be modified before deployment:
218218
| `max_beam_width` | Optional (default=1). The maximum beam width that any request may ask for when using beam search.|
219219
| `max_tokens_in_paged_kv_cache` | Optional (default=unspecified). The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as 'infinite'. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. |
220220
| `max_attention_window_size` | Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. |
221-
| `kv_cache_free_gpu_mem_fraction` | Optional (default=0.85). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
221+
| `kv_cache_free_gpu_mem_fraction` | Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
222222
| `max_num_sequences` | Optional (default=`max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size). Maximum number of sequences that the in-flight batching scheme can maintain state for.
223223
| `enable_trt_overlap` | Optional (default=`true`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
224224
| `exclude_input_in_output` | Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
@@ -346,6 +346,7 @@ He was a member of the French Academy of Sciences and the French Academy of Arts
346346
Soyer was a member of the French Academy of Sciences and
347347
```
348348

349+
#### Early stopping
349350
You can also stop the generation process early by using the `--stop-after-ms`
350351
option to send a stop request after a few milliseconds:
351352

@@ -357,6 +358,54 @@ You will find that the generation process is stopped early and therefore the
357358
number of generated tokens is lower than 200. You can have a look at the
358359
client code to see how early stopping is achieved.
359360

361+
#### Return context logits and/or generation logits
362+
If you want to get context logits and/or generation logits, you need to enable `--gather_context_logits` and/or `--gather_generation_logits` when building the engine (or `--enable gather_all_token_logits` to enable both at the same time). For more setting details about these two flags, please refer to [build.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/gpt/build.py) or [gpt_runtime](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_runtime.md).
363+
364+
After launching the server, you could get the output of logits by passing the corresponding parameters `--return-context-logits` and/or `--return-generation-logits` in the client scripts (`end_to_end_grpc_client.py` and `inflight_batcher_llm_client.py`). For example:
365+
```bash
366+
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 20 --tokenizer-dir /path/to/tokenizer/ \
367+
--return-context-logits \
368+
--return-generation-logits
369+
```
370+
371+
The result should be similar to the following:
372+
```
373+
Input sequence: [28524, 287, 5093, 12, 23316, 4881, 11, 30022, 263, 8776, 355, 257]
374+
Got completed request
375+
Input: Born in north-east France, Soyer trained as a
376+
Output beam 0: has since worked in restaurants in London,
377+
Output sequence: [21221, 878, 3867, 284, 3576, 287, 262, 1903, 6303, 82, 13, 679, 468, 1201, 3111, 287, 10808, 287, 3576, 11]
378+
context_logits.shape: (1, 12, 50257)
379+
context_logits: [[[ -65.9822 -62.267445 -70.08991 ... -76.16964 -78.8893
380+
-65.90678 ]
381+
[-103.40278 -102.55243 -106.119026 ... -108.925415 -109.408585
382+
-101.37687 ]
383+
[ -63.971176 -64.03466 -67.58809 ... -72.141235 -71.16892
384+
-64.23846 ]
385+
...
386+
[ -80.776375 -79.1815 -85.50916 ... -87.07368 -88.02817
387+
-79.28435 ]
388+
[ -10.551408 -7.786484 -14.524468 ... -13.805856 -15.767286
389+
-7.9322424]
390+
[-106.33096 -105.58956 -111.44852 ... -111.04858 -111.994194
391+
-105.40376 ]]]
392+
generation_logits.shape: (1, 1, 20, 50257)
393+
generation_logits: [[[[-106.33096 -105.58956 -111.44852 ... -111.04858 -111.994194
394+
-105.40376 ]
395+
[ -77.867424 -76.96638 -83.119095 ... -87.82542 -88.53957
396+
-75.64877 ]
397+
[-136.92282 -135.02484 -140.96051 ... -141.78284 -141.55045
398+
-136.01668 ]
399+
...
400+
[-100.03721 -98.98237 -105.25507 ... -108.49254 -109.45882
401+
-98.95136 ]
402+
[-136.78777 -136.16165 -139.13437 ... -142.21495 -143.57468
403+
-134.94667 ]
404+
[ 19.222942 19.127287 14.804495 ... 10.556551 9.685863
405+
19.625107]]]]
406+
```
407+
408+
360409
### Launch Triton server *within Slurm based clusters*
361410

362411
#### Prepare some scripts

all_models/gpt/ensemble/config.pbtxt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,12 @@ input [
7676
dims: [ 1 ]
7777
optional: true
7878
},
79+
{
80+
name: "frequency_penalty"
81+
data_type: TYPE_FP32
82+
dims: [ 1 ]
83+
optional: true
84+
},
7985
{
8086
name: "random_seed"
8187
data_type: TYPE_UINT64
@@ -187,6 +193,10 @@ ensemble_scheduling {
187193
key: "presence_penalty"
188194
value: "presence_penalty"
189195
}
196+
input_map {
197+
key: "frequency_penalty"
198+
value: "frequency_penalty"
199+
}
190200
input_map {
191201
key: "random_seed"
192202
value: "random_seed"

all_models/gpt/tensorrt_llm/1/model.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,8 @@ def execute(self, requests):
173173
request, 'min_length')
174174
inputs['presence_penalty'] = get_input_scalar_by_name(
175175
request, 'presence_penalty')
176+
inputs['frequency_penalty'] = get_input_scalar_by_name(
177+
request, 'frequency_penalty')
176178
inputs['random_seed'] = get_input_scalar_by_name(
177179
request, 'random_seed')
178180
inputs['output_log_probs'] = get_input_scalar_by_name(
@@ -203,6 +205,8 @@ def execute(self, requests):
203205
sampling_config.min_length = inputs['min_length']
204206
if inputs['presence_penalty'] is not None:
205207
sampling_config.presence_penalty = inputs['presence_penalty']
208+
if inputs['frequency_penalty'] is not None:
209+
sampling_config.frequency_penalty = inputs['frequency_penalty']
206210
sampling_config.random_seed = inputs['random_seed']
207211
sampling_config.output_log_probs = inputs['output_log_probs']
208212
if self.remove_input_padding:

all_models/gpt/tensorrt_llm/config.pbtxt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,13 @@ input [
9292
reshape: { shape: [ ] }
9393
optional: true
9494
},
95+
{
96+
name: "frequency_penalty"
97+
data_type: TYPE_FP32
98+
dims: [ 1 ]
99+
reshape: { shape: [ ] }
100+
optional: true
101+
},
95102
{
96103
name: "random_seed"
97104
data_type: TYPE_UINT64

all_models/inflight_batcher_llm/ensemble/config.pbtxt

100755100644
Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,12 @@ input [
104104
dims: [ 1 ]
105105
optional: true
106106
},
107+
{
108+
name: "frequency_penalty"
109+
data_type: TYPE_FP32
110+
dims: [ 1 ]
111+
optional: true
112+
},
107113
{
108114
name: "random_seed"
109115
data_type: TYPE_UINT64
@@ -116,6 +122,18 @@ input [
116122
dims: [ 1 ]
117123
optional: true
118124
},
125+
{
126+
name: "return_context_logits"
127+
data_type: TYPE_BOOL
128+
dims: [ 1 ]
129+
optional: true
130+
},
131+
{
132+
name: "return_generation_logits"
133+
data_type: TYPE_BOOL
134+
dims: [ 1 ]
135+
optional: true
136+
},
119137
{
120138
name: "beam_width"
121139
data_type: TYPE_INT32
@@ -168,6 +186,16 @@ output [
168186
name: "output_log_probs"
169187
data_type: TYPE_FP32
170188
dims: [ -1, -1 ]
189+
},
190+
{
191+
name: "context_logits"
192+
data_type: TYPE_FP32
193+
dims: [ -1, -1 ]
194+
},
195+
{
196+
name: "generation_logits"
197+
data_type: TYPE_FP32
198+
dims: [ -1, -1, -1 ]
171199
}
172200
]
173201
ensemble_scheduling {
@@ -199,6 +227,14 @@ ensemble_scheduling {
199227
key: "EMBEDDING_BIAS_WEIGHTS"
200228
value: "embedding_bias_weights"
201229
}
230+
input_map {
231+
key: "END_ID"
232+
value: "end_id"
233+
}
234+
input_map {
235+
key: "PAD_ID"
236+
value: "pad_id"
237+
}
202238
output_map {
203239
key: "REQUEST_INPUT_LEN"
204240
value: "_REQUEST_INPUT_LEN"
@@ -223,6 +259,14 @@ ensemble_scheduling {
223259
key: "EMBEDDING_BIAS"
224260
value: "_EMBEDDING_BIAS"
225261
}
262+
output_map {
263+
key: "OUT_END_ID"
264+
value: "_PREPROCESSOR_END_ID"
265+
}
266+
output_map {
267+
key: "OUT_PAD_ID"
268+
value: "_PREPROCESSOR_PAD_ID"
269+
}
226270
},
227271
{
228272
model_name: "tensorrt_llm"
@@ -241,11 +285,11 @@ ensemble_scheduling {
241285
}
242286
input_map {
243287
key: "end_id"
244-
value: "end_id"
288+
value: "_PREPROCESSOR_END_ID"
245289
}
246290
input_map {
247291
key: "pad_id"
248-
value: "pad_id"
292+
value: "_PREPROCESSOR_PAD_ID"
249293
}
250294
input_map {
251295
key: "embedding_bias"
@@ -279,6 +323,10 @@ ensemble_scheduling {
279323
key: "presence_penalty"
280324
value: "presence_penalty"
281325
}
326+
input_map {
327+
key: "frequency_penalty"
328+
value: "frequency_penalty"
329+
}
282330
input_map {
283331
key: "random_seed"
284332
value: "random_seed"
@@ -287,6 +335,14 @@ ensemble_scheduling {
287335
key: "return_log_probs"
288336
value: "return_log_probs"
289337
}
338+
input_map {
339+
key: "return_context_logits"
340+
value: "return_context_logits"
341+
}
342+
input_map {
343+
key: "return_generation_logits"
344+
value: "return_generation_logits"
345+
}
290346
input_map {
291347
key: "beam_width"
292348
value: "beam_width"
@@ -326,6 +382,14 @@ ensemble_scheduling {
326382
output_map {
327383
key: "output_log_probs"
328384
value: "_OUTPUT_LOG_PROBS"
385+
},
386+
output_map {
387+
key: "context_logits"
388+
value: "_CONTEXT_LOGITS"
389+
},
390+
output_map {
391+
key: "generation_logits"
392+
value: "_GENERATION_LOGITS"
329393
}
330394
},
331395
{
@@ -343,6 +407,14 @@ ensemble_scheduling {
343407
key: "OUTPUT_LOG_PROBS"
344408
value: "_OUTPUT_LOG_PROBS"
345409
}
410+
input_map {
411+
key: "CONTEXT_LOGITS"
412+
value: "_CONTEXT_LOGITS"
413+
}
414+
input_map {
415+
key: "GENERATION_LOGITS"
416+
value: "_GENERATION_LOGITS"
417+
}
346418
input_map {
347419
key: "SEQUENCE_LENGTH"
348420
value: "_SEQUENCE_LENGTH"
@@ -359,6 +431,14 @@ ensemble_scheduling {
359431
key: "OUT_CUM_LOG_PROBS"
360432
value: "cum_log_probs"
361433
}
434+
output_map {
435+
key: "OUT_CONTEXT_LOGITS"
436+
value: "context_logits"
437+
}
438+
output_map {
439+
key: "OUT_GENERATION_LOGITS"
440+
value: "generation_logits"
441+
}
362442
}
363443
]
364444
}

all_models/inflight_batcher_llm/postprocessing/1/model.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,14 @@ def execute(self, requests):
126126
output_log_probs = pb_utils.get_input_tensor_by_name(
127127
request, 'OUTPUT_LOG_PROBS').as_numpy()
128128

129+
# Get context logits
130+
context_logits = pb_utils.get_input_tensor_by_name(
131+
request, 'CONTEXT_LOGITS').as_numpy()
132+
133+
# Get generation logits
134+
generation_logits = pb_utils.get_input_tensor_by_name(
135+
request, 'GENERATION_LOGITS').as_numpy()
136+
129137
# Reshape Input
130138
# tokens_batch = tokens_batch.reshape([-1, tokens_batch.shape[0]])
131139
# tokens_batch = tokens_batch.T
@@ -145,6 +153,12 @@ def execute(self, requests):
145153
out_output_log_probs = pb_utils.Tensor('OUT_OUTPUT_LOG_PROBS',
146154
output_log_probs)
147155

156+
out_context_logits = pb_utils.Tensor('OUT_CONTEXT_LOGITS',
157+
context_logits)
158+
159+
out_generation_logits = pb_utils.Tensor('OUT_GENERATION_LOGITS',
160+
generation_logits)
161+
148162
# Create InferenceResponse. You can set an error here in case
149163
# there was a problem with handling this inference request.
150164
# Below is an example of how you can set errors in inference
@@ -153,7 +167,8 @@ def execute(self, requests):
153167
# pb_utils.InferenceResponse(
154168
# output_tensors=..., TritonError("An error occurred"))
155169
inference_response = pb_utils.InferenceResponse(output_tensors=[
156-
output_tensor, out_cum_log_probs, out_output_log_probs
170+
output_tensor, out_cum_log_probs, out_output_log_probs,
171+
out_context_logits, out_generation_logits
157172
])
158173
responses.append(inference_response)
159174

all_models/inflight_batcher_llm/postprocessing/config.pbtxt

100755100644
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,18 @@ input [
4747
name: "OUTPUT_LOG_PROBS"
4848
data_type: TYPE_FP32
4949
dims: [ -1, -1 ]
50+
},
51+
{
52+
name: "CONTEXT_LOGITS"
53+
data_type: TYPE_FP32
54+
dims: [ -1, -1 ]
55+
optional: true
56+
},
57+
{
58+
name: "GENERATION_LOGITS"
59+
data_type: TYPE_FP32
60+
dims: [ -1, -1, -1 ]
61+
optional: true
5062
}
5163
]
5264
output [
@@ -64,6 +76,16 @@ output [
6476
name: "OUT_OUTPUT_LOG_PROBS"
6577
data_type: TYPE_FP32
6678
dims: [ -1, -1 ]
79+
},
80+
{
81+
name: "OUT_CONTEXT_LOGITS"
82+
data_type: TYPE_FP32
83+
dims: [ -1, -1 ]
84+
},
85+
{
86+
name: "OUT_GENERATION_LOGITS"
87+
data_type: TYPE_FP32
88+
dims: [ -1, -1, -1 ]
6789
}
6890
]
6991

0 commit comments

Comments
 (0)