Skip to content

Commit 171ed05

Browse files
authored
Update TensorRT-LLM backend (triton-inference-server#180)
1 parent e8ae70c commit 171ed05

26 files changed

+3115
-1320
lines changed

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,5 @@ build/
77
*.so
88
*.egg-info/
99
.coverage
10-
*.csv
1110
*.onnx
1211
tmp/

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,5 +44,6 @@ repos:
4444
rev: v2.2.4
4545
hooks:
4646
- id: codespell
47+
exclude: tools/dataset/
4748
args:
4849
- --skip=".git,tensorrt_llm"

README.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -162,16 +162,16 @@ python3 build.py --model_dir=./c-model/gpt2/4-gpu/ \
162162

163163
### Create the model repository
164164

165-
There are four models in the [`all_models/inflight_batcher_llm`](./all_models/inflight_batcher_llm/)
165+
There are five models in the [`all_models/inflight_batcher_llm`](./all_models/inflight_batcher_llm/)
166166
directory that will be used in this example:
167167
- "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
168168
- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
169169
- "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
170-
- "ensemble": This model is used to chain the three models above together:
171-
preprocessing -> tensorrt_llm -> postprocessing
170+
- "ensemble": This model can be used to chain the preprocessing, tensorrt_llm and postprocessing models together.
171+
- "tensorrt_llm_bls": This model can also be used to chain the preprocessing, tensorrt_llm and postprocessing models together. The BLS model has an optional parameter `accumulate_tokens` which can be used in streaming mode to call the preprocessing model with all accumulated tokens, instead of only one token. This might be necessary for certain tokenizers.
172172

173-
To learn more about ensemble model, please see
174-
[here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models).
173+
To learn more about ensemble and BLS models, please see the
174+
[Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) and [Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting) sections of the Triton Inference Server documentation.
175175

176176
```bash
177177
# Create the model repository that will be used by the Triton server
@@ -258,8 +258,8 @@ environment/container:
258258
curl -X POST localhost:8000/v2/models/${MODEL_NAME}/generate -d '{"{PARAM1_KEY}": "{PARAM1_VALUE}", ... }'
259259
```
260260

261-
In the case of the models used in this example, you can replace MODEL_NAME with `ensemble`. Examining the
262-
ensemble model's config.pbtxt file, you can see that 4 parameters are required to generate a response
261+
In the case of the models used in this example, you can replace MODEL_NAME with `ensemble` or `tensorrt_llm_bls`. Examining the
262+
`ensemble` and `tensorrt_llm_bls` model's config.pbtxt file, you can see that 4 parameters are required to generate a response
263263
for this model:
264264

265265
- "text_input": Input text to generate a response from
@@ -272,6 +272,11 @@ Therefore, we can query the server in the following way:
272272
```bash
273273
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
274274
```
275+
if using the `ensemble` model or
276+
```
277+
curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
278+
```
279+
if using the `tensorrt_llm_bls` model.
275280

276281
Which should return a result similar to (formatted for readability):
277282
```json
@@ -292,7 +297,7 @@ You can send requests to the "tensorrt_llm" model with the provided
292297
as following:
293298

294299
```bash
295-
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer_dir /workspace/tensorrtllm_backend/tensorrt_llm/examples/gpt/gpt2
300+
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer-dir /workspace/tensorrtllm_backend/tensorrt_llm/examples/gpt/gpt2
296301
```
297302

298303
The result should be similar to the following:
@@ -323,7 +328,7 @@ Soyer was a member of the French Academy of Sciences and
323328
You can also stop the generation process early by using the `--stop-after-ms` option to send a stop request after a few milliseconds:
324329

325330
```bash
326-
python inflight_batcher_llm/client/inflight_batcher_llm_client.py --stop-after-ms 200 --request-output-len 200 --tokenizer_dir /workspace/tensorrtllm_backend/tensorrt_llm/examples/gpt/gpt2
331+
python inflight_batcher_llm/client/inflight_batcher_llm_client.py --stop-after-ms 200 --request-output-len 200 --tokenizer-dir /workspace/tensorrtllm_backend/tensorrt_llm/examples/gpt/gpt2
327332
```
328333

329334
You will find that the generation process is stopped early and therefore the number of generated tokens is lower than 200.

all_models/inflight_batcher_llm/postprocessing/1/model.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,11 @@ def initialize(self, args):
5757
'string_value']
5858
tokenizer_type = model_config['parameters']['tokenizer_type'][
5959
'string_value']
60+
self.skip_special_tokens = model_config['parameters'].get(
61+
'skip_special_tokens',
62+
{'string_value': "true"})['string_value'].lower() in [
63+
'true', '1', 't', 'y', 'yes'
64+
]
6065

6166
if tokenizer_type == 't5':
6267
self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
@@ -168,6 +173,8 @@ def _postprocessing(self, tokens_batch, sequence_lengths):
168173
for batch_idx, beam_tokens in enumerate(tokens_batch):
169174
for beam_idx, tokens in enumerate(beam_tokens):
170175
seq_len = sequence_lengths[batch_idx][beam_idx]
171-
output = self.tokenizer.decode(tokens[:seq_len])
176+
output = self.tokenizer.decode(
177+
tokens[:seq_len],
178+
skip_special_tokens=self.skip_special_tokens)
172179
outputs.append(output.encode('utf8'))
173180
return outputs

all_models/inflight_batcher_llm/postprocessing/config.pbtxt

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,9 +81,16 @@ parameters {
8181
}
8282
}
8383

84+
parameters {
85+
key: "skip_special_tokens"
86+
value: {
87+
string_value: "True"
88+
}
89+
}
90+
8491
instance_group [
8592
{
86-
count: 1
93+
count: ${postprocessing_instance_count}
8794
kind: KIND_CPU
8895
}
8996
]

all_models/inflight_batcher_llm/preprocessing/1/model.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,11 @@ def initialize(self, args):
5858
'string_value']
5959
tokenizer_type = model_config['parameters']['tokenizer_type'][
6060
'string_value']
61+
self.add_special_tokens = model_config['parameters'].get(
62+
'add_special_tokens',
63+
{'string_value': "false"})['string_value'].lower() in [
64+
'true', '1', 't', 'y', 'yes'
65+
]
6166

6267
if tokenizer_type == 't5':
6368
self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
@@ -207,7 +212,10 @@ def _create_request(self, query):
207212
query : batch string (2D numpy array)
208213
"""
209214
start_ids = [
210-
np.array(self.tokenizer.encode(s[0].decode())).astype(int)
215+
np.array(
216+
self.tokenizer.encode(
217+
s[0].decode(),
218+
add_special_tokens=self.add_special_tokens)).astype(int)
211219
for s in query
212220
]
213221
start_lengths = np.array([[len(ids)] for ids in start_ids]).astype(int)

all_models/inflight_batcher_llm/preprocessing/config.pbtxt

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,16 @@ parameters {
110110
}
111111
}
112112

113+
parameters {
114+
key: "add_special_tokens"
115+
value: {
116+
string_value: "False"
117+
}
118+
}
119+
113120
instance_group [
114121
{
115-
count: 1
122+
count: ${preprocessing_instance_count}
116123
kind: KIND_CPU
117124
}
118125
]

0 commit comments

Comments
 (0)