You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are four models in the [`all_models/inflight_batcher_llm`](./all_models/inflight_batcher_llm/)
165
+
There are five models in the [`all_models/inflight_batcher_llm`](./all_models/inflight_batcher_llm/)
166
166
directory that will be used in this example:
167
167
- "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
168
168
- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
169
169
- "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
170
-
- "ensemble": This model is used to chain the three models above together:
171
-
preprocessing -> tensorrt_llm -> postprocessing
170
+
- "ensemble": This model can be used to chain the preprocessing, tensorrt_llm and postprocessing models together.
171
+
- "tensorrt_llm_bls": This model can also be used to chain the preprocessing, tensorrt_llm and postprocessing models together. The BLS model has an optional parameter `accumulate_tokens` which can be used in streaming mode to call the preprocessing model with all accumulated tokens, instead of only one token. This might be necessary for certain tokenizers.
To learn more about ensemble and BLS models, please see the
174
+
[Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) and [Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting) sections of the Triton Inference Server documentation.
175
175
176
176
```bash
177
177
# Create the model repository that will be used by the Triton server
@@ -258,8 +258,8 @@ environment/container:
258
258
curl -X POST localhost:8000/v2/models/${MODEL_NAME}/generate -d '{"{PARAM1_KEY}": "{PARAM1_VALUE}", ... }'
259
259
```
260
260
261
-
In the case of the models used in this example, you can replace MODEL_NAME with `ensemble`. Examining the
262
-
ensemble model's config.pbtxt file, you can see that 4 parameters are required to generate a response
261
+
In the case of the models used in this example, you can replace MODEL_NAME with `ensemble` or `tensorrt_llm_bls`. Examining the
262
+
`ensemble` and `tensorrt_llm_bls` model's config.pbtxt file, you can see that 4 parameters are required to generate a response
263
263
for this model:
264
264
265
265
- "text_input": Input text to generate a response from
@@ -272,6 +272,11 @@ Therefore, we can query the server in the following way:
272
272
```bash
273
273
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
274
274
```
275
+
if using the `ensemble` model or
276
+
```
277
+
curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
278
+
```
279
+
if using the `tensorrt_llm_bls` model.
275
280
276
281
Which should return a result similar to (formatted for readability):
277
282
```json
@@ -292,7 +297,7 @@ You can send requests to the "tensorrt_llm" model with the provided
0 commit comments