Skip to content

Commit ae52bce

Browse files
Update TensorRT-LLM backend (triton-inference-server#454)
* Update TensorRT-LLM backend --------- Co-authored-by: XiaobingSuper <[email protected]>
1 parent e239adc commit ae52bce

File tree

27 files changed

+1439
-858
lines changed

27 files changed

+1439
-858
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,10 @@ The below commands will build the same Triton TRT-LLM container as the one on th
7070
# Prepare the TRT-LLM base image using the dockerfile from tensorrtllm_backend.
7171
cd tensorrtllm_backend
7272
# Specify the build args for the dockerfile.
73-
BASE_IMAGE=nvcr.io/nvidia/pytorch:24.02-py3
74-
TRT_VERSION=9.3.0.1
75-
TRT_URL_x86=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.3.0/tensorrt-9.3.0.1.linux.x86_64-gnu.cuda-12.2.tar.gz
76-
TRT_URL_ARM=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.3.0/tensorrt-9.3.0.1.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz
73+
BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3
74+
TRT_VERSION=10.0.1.6
75+
TRT_URL_x86=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-12.4.tar.gz
76+
TRT_URL_ARM=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.ubuntu-22.04.aarch64-gnu.cuda-12.4.tar.gz
7777

7878
docker build -t trtllm_base \
7979
--build-arg BASE_IMAGE="${BASE_IMAGE}" \
@@ -297,9 +297,9 @@ The following table shows the fields that may to be modified before deployment:
297297
| `max_tokens_in_paged_kv_cache` | Optional (default=unspecified). The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as 'infinite'. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. |
298298
| `max_attention_window_size` | Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. |
299299
| `kv_cache_free_gpu_mem_fraction` | Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
300-
| `enable_trt_overlap` | Optional (default=`false`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
301300
| `exclude_input_in_output` | Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
302301
| `cancellation_check_period_ms` | Optional (default=100). The time for cancellation check thread to sleep before doing the next check. It checks if any of the current active requests are cancelled through triton and prevent further execution of them. |
302+
| `stats_check_period_ms` | Optional (default=100). The time for the statistics reporting thread to sleep before doing the next check. |
303303
| `iter_stats_max_iterations` | Optional (default=executor::kDefaultIterStatsMaxIterations). The numbers of iteration stats to be kept. |
304304
| `request_stats_max_iterations` | Optional (default=executor::kDefaultRequestStatsMaxIterations). The numbers of request stats to be kept. |
305305
| `normalize_log_probs` | Optional (default=`true`). Set to `false` to skip normalization of `output_log_probs` |

all_models/inflight_batcher_llm/postprocessing/1/model.py

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,28 @@ def initialize(self, args):
5555
model_config = json.loads(args['model_config'])
5656
tokenizer_dir = model_config['parameters']['tokenizer_dir'][
5757
'string_value']
58-
self.skip_special_tokens = model_config['parameters'].get(
59-
'skip_special_tokens',
60-
{'string_value': "true"})['string_value'].lower() in [
61-
'true', '1', 't', 'y', 'yes'
62-
]
58+
59+
skip_special_tokens = model_config['parameters'].get(
60+
'skip_special_tokens')
61+
if skip_special_tokens is not None:
62+
skip_special_tokens_str = skip_special_tokens[
63+
'string_value'].lower()
64+
if skip_special_tokens_str in [
65+
'true', 'false', '1', '0', 't', 'f', 'y', 'n', 'yes', 'no'
66+
]:
67+
self.skip_special_tokens = skip_special_tokens_str in [
68+
'true', '1', 't', 'y', 'yes'
69+
]
70+
else:
71+
print(
72+
f"[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is {skip_special_tokens['string_value']}). Set it as True by default."
73+
)
74+
self.skip_special_tokens = True
75+
else:
76+
print(
77+
f"[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens'. Set it as True by default."
78+
)
79+
self.skip_special_tokens = True
6380

6481
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
6582
legacy=False,

all_models/inflight_batcher_llm/postprocessing/config.pbtxt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ parameters {
101101
parameters {
102102
key: "skip_special_tokens"
103103
value: {
104-
string_value: "True"
104+
string_value: "${skip_special_tokens}"
105105
}
106106
}
107107

all_models/inflight_batcher_llm/preprocessing/1/model.py

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,11 +56,27 @@ def initialize(self, args):
5656
model_config = json.loads(args['model_config'])
5757
tokenizer_dir = model_config['parameters']['tokenizer_dir'][
5858
'string_value']
59-
self.add_special_tokens = model_config['parameters'].get(
60-
'add_special_tokens',
61-
{'string_value': "false"})['string_value'].lower() in [
62-
'true', '1', 't', 'y', 'yes'
63-
]
59+
60+
add_special_tokens = model_config['parameters'].get(
61+
'add_special_tokens')
62+
if add_special_tokens is not None:
63+
add_special_tokens_str = add_special_tokens['string_value'].lower()
64+
if add_special_tokens_str in [
65+
'true', 'false', '1', '0', 't', 'f', 'y', 'n', 'yes', 'no'
66+
]:
67+
self.add_special_tokens = add_special_tokens_str in [
68+
'true', '1', 't', 'y', 'yes'
69+
]
70+
else:
71+
print(
72+
f"[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is {add_special_tokens['string_value']}). Set it as True by default."
73+
)
74+
self.add_special_tokens = True
75+
else:
76+
print(
77+
f"[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens'. Set it as True by default."
78+
)
79+
self.add_special_tokens = True
6480

6581
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
6682
legacy=False,

0 commit comments

Comments
 (0)