TheCodeWrangler
diff --git a/‎.github/ISSUE_TEMPLATE/bug_report.yml‎
Lines changed: 117 additions & 0 deletions b/‎.github/ISSUE_TEMPLATE/bug_report.yml‎
Lines changed: 117 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 0 additions & 1 deletion b/‎README.md‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt‎
Lines changed: 0 additions & 6 deletions b/‎all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt‎
Lines changed: 0 additions & 6 deletions
diff --git a/‎ci/L0_backend_trtllm/custom_metrics_verification_tests.py‎
Lines changed: 2 additions & 0 deletions b/‎ci/L0_backend_trtllm/custom_metrics_verification_tests.py‎
Lines changed: 2 additions & 0 deletions
@@ -0,0 +1,117 @@
+name: "Bug Report"
+description: Submit a bug report to help us improve TensorRT-LLM backend
+labels: [ "bug" ]
+body:
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info
+      description: Please share your system info with us.
+      placeholder: |
+        - CPU architecture (e.g., x86_64, aarch64)
+        - CPU/Host memory size (if known)
+        - GPU properties
+          - GPU name (e.g., NVIDIA H100, NVIDIA A100, NVIDIA L40S)
+          - GPU memory size (if known)
+          - Clock frequencies used (if applicable)
+        - Libraries
+          - TensorRT-LLM branch or tag (e.g., main, v0.7.1)
+          - TensorRT-LLM commit (if known)
+          - Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used
+          - Container used (if running TensorRT-LLM in a container)
+        - NVIDIA driver version
+        - OS (Ubuntu 22.04, CentOS 7, Windows 10)
+        - Docker image version
+        - Any other information that may be useful in reproducing the bug
+    validations:
+      required: true
+
+  - type: textarea
+    id: who-can-help
+    attributes:
+      label: Who can help?
+      description: |
+        To expedite the response to your issue, it would be helpful if you could identify the appropriate person
+        to tag using the **@** symbol. Here is a general guideline on **whom to tag**.
+
+        Rest assured that all issues are reviewed by the core maintainers. If you are unsure about whom to tag,
+        you can leave it blank, and a core maintainer will make sure to involve the appropriate person.
+
+        Please tag fewer than 3 people.
+
+        Quantization: @Tracin
+
+        Documentation: @juney-nvidia
+
+        Feature request: @ncomly-nvidia
+
+        Performance: @kaiyux
+
+        Others: @byshiue @schetlur-nv
+
+      placeholder: "@Username ..."
+
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information
+      description: 'The problem arises when using:'
+      options:
+        - label: "The official example scripts"
+        - label: "My own modified scripts"
+
+  - type: checkboxes
+    id: information-tasks
+    attributes:
+      label: Tasks
+      description: "The tasks I am working on are:"
+      options:
+        - label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
+        - label: "My own task or dataset (give details below)"
+
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction
+      description: |
+        Kindly share a code example that demonstrates the issue you encountered. It is recommending to provide a code snippet directly.
+        Additionally, if you have any error messages, or stack traces related to the problem, please include them here.
+
+        Remember to use code tags to properly format your code. You can refer to the
+        link https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting for guidance on code formatting.
+
+        Please refrain from using screenshots, as they can be difficult to read and prevent others from copying and pasting your code.
+        It would be most helpful if we could reproduce your issue by simply copying and pasting your scripts and codes.
+
+      placeholder: |
+        Steps to reproduce the behavior:
+
+          1.
+          2.
+          3.
+
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior
+      description: "Provide a brief summary of the expected behavior of the software. Provide output files or examples if possible."
+
+  - type: textarea
+    id: actual-behavior
+    validations:
+      required: true
+    attributes:
+      label: actual behavior
+      description: "Describe the actual behavior of the software and how it deviates from the expected behavior. Provide output files or examples if possible."
+
+  - type: textarea
+    id: additioanl-notes
+    validations:
+      required: true
+    attributes:
+      label: additional notes
+      description: "Provide any additional context here you think might be useful for the TensorRT-LLM team to help debug this issue (such as experiments done, potential things to investigate)."
@@ -219,7 +219,6 @@ The following table shows the fields that may to be modified before deployment:
 | `max_tokens_in_paged_kv_cache` | Optional (default=unspecified). The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as 'infinite'. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. |
 | `max_attention_window_size` | Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. |
 | `kv_cache_free_gpu_mem_fraction` | Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
-| `max_num_sequences` | Optional (default=`max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size). Maximum number of sequences that the in-flight batching scheme can maintain state for.
 | `enable_trt_overlap` | Optional (default=`false`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
 | `exclude_input_in_output` | Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens  |
 | `normalize_log_probs` | Optional (default=`true`). Set to `false` to skip normalization of `output_log_probs`  |
 
@@ -332,12 +332,6 @@ parameters: {
     string_value: "${kv_cache_free_gpu_mem_fraction}"
   }
 }
-parameters: {
-  key: "max_num_sequences"
-  value: {
-    string_value: "${max_num_sequences}"
-  }
-}
 parameters: {
   key: "enable_trt_overlap"
   value: {
 
@@ -46,6 +46,8 @@
     "inflight_batcher_specific_metric=micro_batch_id": "MicroBatch ID",
     "inflight_batcher_specific_metric=generation_requests":
     "Generation Requests",
+    "inflight_batcher_specific_metric=terminated_requests":
+    "Terminated Requests",
     "v1_specific_metric=total_context_tokens": "Total Context Tokens",
     "v1_specific_metric=total_generation_tokens": "Total Generation Tokens",
     "v1_specific_metric=empty_generation_slots": "Empty Generation Slots",
Original file line number	Diff line number	Diff line change
`@@ -332,12 +332,6 @@ parameters: {`
`332`	`332`	`string_value: "${kv_cache_free_gpu_mem_fraction}"`
`333`	`333`	`}`
`334`	`334`	`}`
`335`		`-parameters: {`
`336`		`- key: "max_num_sequences"`
`337`		`- value: {`
`338`		`- string_value: "${max_num_sequences}"`
`339`		`- }`
`340`		`-}`
`341`	`335`	`parameters: {`
`342`	`336`	`key: "enable_trt_overlap"`
`343`	`337`	`value: {`