Skip to content

Commit

Permalink
Update TensorRT-LLM backend (triton-inference-server#312)
Browse files Browse the repository at this point in the history
* Update TensorRT-LLM backend
  • Loading branch information
kaiyux authored Jan 23, 2024
1 parent cad2233 commit e432c6a
Show file tree
Hide file tree
Showing 18 changed files with 307 additions and 267 deletions.
117 changes: 117 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
name: "Bug Report"
description: Submit a bug report to help us improve TensorRT-LLM backend
labels: [ "bug" ]
body:
- type: textarea
id: system-info
attributes:
label: System Info
description: Please share your system info with us.
placeholder: |
- CPU architecture (e.g., x86_64, aarch64)
- CPU/Host memory size (if known)
- GPU properties
- GPU name (e.g., NVIDIA H100, NVIDIA A100, NVIDIA L40S)
- GPU memory size (if known)
- Clock frequencies used (if applicable)
- Libraries
- TensorRT-LLM branch or tag (e.g., main, v0.7.1)
- TensorRT-LLM commit (if known)
- Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used
- Container used (if running TensorRT-LLM in a container)
- NVIDIA driver version
- OS (Ubuntu 22.04, CentOS 7, Windows 10)
- Docker image version
- Any other information that may be useful in reproducing the bug
validations:
required: true

- type: textarea
id: who-can-help
attributes:
label: Who can help?
description: |
To expedite the response to your issue, it would be helpful if you could identify the appropriate person
to tag using the **@** symbol. Here is a general guideline on **whom to tag**.
Rest assured that all issues are reviewed by the core maintainers. If you are unsure about whom to tag,
you can leave it blank, and a core maintainer will make sure to involve the appropriate person.
Please tag fewer than 3 people.
Quantization: @Tracin
Documentation: @juney-nvidia
Feature request: @ncomly-nvidia
Performance: @kaiyux
Others: @byshiue @schetlur-nv
placeholder: "@Username ..."

- type: checkboxes
id: information-scripts-examples
attributes:
label: Information
description: 'The problem arises when using:'
options:
- label: "The official example scripts"
- label: "My own modified scripts"

- type: checkboxes
id: information-tasks
attributes:
label: Tasks
description: "The tasks I am working on are:"
options:
- label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
- label: "My own task or dataset (give details below)"

- type: textarea
id: reproduction
validations:
required: true
attributes:
label: Reproduction
description: |
Kindly share a code example that demonstrates the issue you encountered. It is recommending to provide a code snippet directly.
Additionally, if you have any error messages, or stack traces related to the problem, please include them here.
Remember to use code tags to properly format your code. You can refer to the
link https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting for guidance on code formatting.
Please refrain from using screenshots, as they can be difficult to read and prevent others from copying and pasting your code.
It would be most helpful if we could reproduce your issue by simply copying and pasting your scripts and codes.
placeholder: |
Steps to reproduce the behavior:
1.
2.
3.
- type: textarea
id: expected-behavior
validations:
required: true
attributes:
label: Expected behavior
description: "Provide a brief summary of the expected behavior of the software. Provide output files or examples if possible."

- type: textarea
id: actual-behavior
validations:
required: true
attributes:
label: actual behavior
description: "Describe the actual behavior of the software and how it deviates from the expected behavior. Provide output files or examples if possible."

- type: textarea
id: additioanl-notes
validations:
required: true
attributes:
label: additional notes
description: "Provide any additional context here you think might be useful for the TensorRT-LLM team to help debug this issue (such as experiments done, potential things to investigate)."
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,6 @@ The following table shows the fields that may to be modified before deployment:
| `max_tokens_in_paged_kv_cache` | Optional (default=unspecified). The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as 'infinite'. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. |
| `max_attention_window_size` | Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. |
| `kv_cache_free_gpu_mem_fraction` | Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
| `max_num_sequences` | Optional (default=`max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size). Maximum number of sequences that the in-flight batching scheme can maintain state for.
| `enable_trt_overlap` | Optional (default=`false`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
| `exclude_input_in_output` | Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
| `normalize_log_probs` | Optional (default=`true`). Set to `false` to skip normalization of `output_log_probs` |
Expand Down
6 changes: 0 additions & 6 deletions all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
Original file line number Diff line number Diff line change
Expand Up @@ -332,12 +332,6 @@ parameters: {
string_value: "${kv_cache_free_gpu_mem_fraction}"
}
}
parameters: {
key: "max_num_sequences"
value: {
string_value: "${max_num_sequences}"
}
}
parameters: {
key: "enable_trt_overlap"
value: {
Expand Down
2 changes: 2 additions & 0 deletions ci/L0_backend_trtllm/custom_metrics_verification_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@
"inflight_batcher_specific_metric=micro_batch_id": "MicroBatch ID",
"inflight_batcher_specific_metric=generation_requests":
"Generation Requests",
"inflight_batcher_specific_metric=terminated_requests":
"Terminated Requests",
"v1_specific_metric=total_context_tokens": "Total Context Tokens",
"v1_specific_metric=total_generation_tokens": "Total Generation Tokens",
"v1_specific_metric=empty_generation_slots": "Empty Generation Slots",
Expand Down
Loading

0 comments on commit e432c6a

Please sign in to comment.