Skip to content

Commit cad2233

Browse files
authored
Update TensorRT-LLM backend (triton-inference-server#301)
* Update TensorRT-LLM backend
1 parent a653f76 commit cad2233

File tree

7 files changed

+13
-13
lines changed

7 files changed

+13
-13
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -220,7 +220,7 @@ The following table shows the fields that may to be modified before deployment:
220220
| `max_attention_window_size` | Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. |
221221
| `kv_cache_free_gpu_mem_fraction` | Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
222222
| `max_num_sequences` | Optional (default=`max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size). Maximum number of sequences that the in-flight batching scheme can maintain state for.
223-
| `enable_trt_overlap` | Optional (default=`true`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
223+
| `enable_trt_overlap` | Optional (default=`false`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
224224
| `exclude_input_in_output` | Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
225225
| `normalize_log_probs` | Optional (default=`true`). Set to `false` to skip normalization of `output_log_probs` |
226226

dockerfile/Dockerfile.trt_llm_backend

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver
2-
ARG BASE_TAG=23.10-py3
2+
ARG BASE_TAG=23.12-py3
33

44
FROM ${BASE_IMAGE}:${BASE_TAG} as base
55

inflight_batcher_llm/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -88,25 +88,25 @@ parameters: {
8888
}
8989
```
9090

91-
By default, in-flight batching will try to overlap the execution of batches of
91+
In-flight batching is able to overlap the execution of batches of
9292
requests. It may have a negative impact on performance when the number of
93-
requests is too small. To disable that feature, set the `enable_trt_overlap`
94-
parameter to `False` in the `config.pbtxt` file:
93+
requests is too small. To enable that feature, set the `enable_trt_overlap`
94+
parameter to `True` in the `config.pbtxt` file:
9595

9696
```
9797
parameters: {
9898
key: "enable_trt_overlap"
9999
value: {
100-
string_value: "False"
100+
string_value: "True"
101101
}
102102
}
103103
```
104104

105-
Or, equivalently, add `enable_trt_overlap:False` to the invocation of the
105+
Or, equivalently, add `enable_trt_overlap:True` to the invocation of the
106106
`fill_template.py` tool:
107107

108108
```bash
109-
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt "enable_trt_overlap:False"
109+
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt "enable_trt_overlap:True"
110110
```
111111

112112
To reuse previously computed KV cache values (e.g. for system prompt), set `enable_kv_cache_reuse`

inflight_batcher_llm/src/model_instance_state.cc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -174,15 +174,15 @@ ModelInstanceState::ModelInstanceState(ModelState* model_state, TRITONBACKEND_Mo
174174
TLLM_LOG_WARNING("max_num_sequences is not specified, will be set to the TRT engine max_batch_size");
175175
}
176176

177-
bool enableTrtOverlap = true;
177+
bool enableTrtOverlap = false;
178178
try
179179
{
180180
enableTrtOverlap = model_state_->GetParameter<bool>("enable_trt_overlap");
181181
}
182182
catch (const std::exception& e)
183183
{
184184
// If parameter is not specified, just ignore
185-
TLLM_LOG_WARNING("enable_trt_overlap is not specified, will be set to true");
185+
TLLM_LOG_WARNING("enable_trt_overlap is not specified, will be set to false");
186186
}
187187

188188
bool normalizeLogProbs = true;

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
regex
22
fire
33
tritonclient[all]
4-
transformers==4.31.0
4+
transformers==4.36.1
55
pandas
66
tabulate

tensorrt_llm

Submodule tensorrt_llm updated 255 files

tools/version.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
77a564a261cdb68c9091ac04c87d5c704da48da5
1+
ad7d4adac6bebead80be01388b94d1f57a50245a

0 commit comments

Comments
 (0)