Skip to content

Commit e432c6a

Browse files
authored
Update TensorRT-LLM backend (triton-inference-server#312)
* Update TensorRT-LLM backend
1 parent cad2233 commit e432c6a

File tree

18 files changed

+307
-267
lines changed

18 files changed

+307
-267
lines changed

.github/ISSUE_TEMPLATE/bug_report.yml

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
name: "Bug Report"
2+
description: Submit a bug report to help us improve TensorRT-LLM backend
3+
labels: [ "bug" ]
4+
body:
5+
- type: textarea
6+
id: system-info
7+
attributes:
8+
label: System Info
9+
description: Please share your system info with us.
10+
placeholder: |
11+
- CPU architecture (e.g., x86_64, aarch64)
12+
- CPU/Host memory size (if known)
13+
- GPU properties
14+
- GPU name (e.g., NVIDIA H100, NVIDIA A100, NVIDIA L40S)
15+
- GPU memory size (if known)
16+
- Clock frequencies used (if applicable)
17+
- Libraries
18+
- TensorRT-LLM branch or tag (e.g., main, v0.7.1)
19+
- TensorRT-LLM commit (if known)
20+
- Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used
21+
- Container used (if running TensorRT-LLM in a container)
22+
- NVIDIA driver version
23+
- OS (Ubuntu 22.04, CentOS 7, Windows 10)
24+
- Docker image version
25+
- Any other information that may be useful in reproducing the bug
26+
validations:
27+
required: true
28+
29+
- type: textarea
30+
id: who-can-help
31+
attributes:
32+
label: Who can help?
33+
description: |
34+
To expedite the response to your issue, it would be helpful if you could identify the appropriate person
35+
to tag using the **@** symbol. Here is a general guideline on **whom to tag**.
36+
37+
Rest assured that all issues are reviewed by the core maintainers. If you are unsure about whom to tag,
38+
you can leave it blank, and a core maintainer will make sure to involve the appropriate person.
39+
40+
Please tag fewer than 3 people.
41+
42+
Quantization: @Tracin
43+
44+
Documentation: @juney-nvidia
45+
46+
Feature request: @ncomly-nvidia
47+
48+
Performance: @kaiyux
49+
50+
Others: @byshiue @schetlur-nv
51+
52+
placeholder: "@Username ..."
53+
54+
- type: checkboxes
55+
id: information-scripts-examples
56+
attributes:
57+
label: Information
58+
description: 'The problem arises when using:'
59+
options:
60+
- label: "The official example scripts"
61+
- label: "My own modified scripts"
62+
63+
- type: checkboxes
64+
id: information-tasks
65+
attributes:
66+
label: Tasks
67+
description: "The tasks I am working on are:"
68+
options:
69+
- label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
70+
- label: "My own task or dataset (give details below)"
71+
72+
- type: textarea
73+
id: reproduction
74+
validations:
75+
required: true
76+
attributes:
77+
label: Reproduction
78+
description: |
79+
Kindly share a code example that demonstrates the issue you encountered. It is recommending to provide a code snippet directly.
80+
Additionally, if you have any error messages, or stack traces related to the problem, please include them here.
81+
82+
Remember to use code tags to properly format your code. You can refer to the
83+
link https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting for guidance on code formatting.
84+
85+
Please refrain from using screenshots, as they can be difficult to read and prevent others from copying and pasting your code.
86+
It would be most helpful if we could reproduce your issue by simply copying and pasting your scripts and codes.
87+
88+
placeholder: |
89+
Steps to reproduce the behavior:
90+
91+
1.
92+
2.
93+
3.
94+
95+
- type: textarea
96+
id: expected-behavior
97+
validations:
98+
required: true
99+
attributes:
100+
label: Expected behavior
101+
description: "Provide a brief summary of the expected behavior of the software. Provide output files or examples if possible."
102+
103+
- type: textarea
104+
id: actual-behavior
105+
validations:
106+
required: true
107+
attributes:
108+
label: actual behavior
109+
description: "Describe the actual behavior of the software and how it deviates from the expected behavior. Provide output files or examples if possible."
110+
111+
- type: textarea
112+
id: additioanl-notes
113+
validations:
114+
required: true
115+
attributes:
116+
label: additional notes
117+
description: "Provide any additional context here you think might be useful for the TensorRT-LLM team to help debug this issue (such as experiments done, potential things to investigate)."

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,6 @@ The following table shows the fields that may to be modified before deployment:
219219
| `max_tokens_in_paged_kv_cache` | Optional (default=unspecified). The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as 'infinite'. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. |
220220
| `max_attention_window_size` | Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. |
221221
| `kv_cache_free_gpu_mem_fraction` | Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
222-
| `max_num_sequences` | Optional (default=`max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size). Maximum number of sequences that the in-flight batching scheme can maintain state for.
223222
| `enable_trt_overlap` | Optional (default=`false`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
224223
| `exclude_input_in_output` | Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
225224
| `normalize_log_probs` | Optional (default=`true`). Set to `false` to skip normalization of `output_log_probs` |

all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -332,12 +332,6 @@ parameters: {
332332
string_value: "${kv_cache_free_gpu_mem_fraction}"
333333
}
334334
}
335-
parameters: {
336-
key: "max_num_sequences"
337-
value: {
338-
string_value: "${max_num_sequences}"
339-
}
340-
}
341335
parameters: {
342336
key: "enable_trt_overlap"
343337
value: {

ci/L0_backend_trtllm/custom_metrics_verification_tests.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@
4646
"inflight_batcher_specific_metric=micro_batch_id": "MicroBatch ID",
4747
"inflight_batcher_specific_metric=generation_requests":
4848
"Generation Requests",
49+
"inflight_batcher_specific_metric=terminated_requests":
50+
"Terminated Requests",
4951
"v1_specific_metric=total_context_tokens": "Total Context Tokens",
5052
"v1_specific_metric=total_generation_tokens": "Total Generation Tokens",
5153
"v1_specific_metric=empty_generation_slots": "Empty Generation Slots",

0 commit comments

Comments
 (0)