TheCodeWrangler
diff --git a/‎README.md
Lines changed: 77 additions & 0 deletions b/‎README.md
Lines changed: 77 additions & 0 deletions
diff --git a/‎all_models/gpt/ensemble/config.pbtxt
Lines changed: 6 additions & 6 deletions b/‎all_models/gpt/ensemble/config.pbtxt
Lines changed: 6 additions & 6 deletions
diff --git a/‎all_models/gpt/preprocessing/config.pbtxt
Lines changed: 3 additions & 3 deletions b/‎all_models/gpt/preprocessing/config.pbtxt
Lines changed: 3 additions & 3 deletions
diff --git a/‎all_models/gpt/tensorrt_llm/config.pbtxt
Lines changed: 6 additions & 6 deletions b/‎all_models/gpt/tensorrt_llm/config.pbtxt
Lines changed: 6 additions & 6 deletions
diff --git a/‎all_models/inflight_batcher_llm/ensemble/config.pbtxt
Lines changed: 51 additions & 7 deletions b/‎all_models/inflight_batcher_llm/ensemble/config.pbtxt
Lines changed: 51 additions & 7 deletions
diff --git a/‎all_models/inflight_batcher_llm/postprocessing/1/model.py
Lines changed: 17 additions & 2 deletions b/‎all_models/inflight_batcher_llm/postprocessing/1/model.py
Lines changed: 17 additions & 2 deletions
@@ -377,6 +377,83 @@ You might have to contact your cluster's administrator to help you customize the
 pkill tritonserver
 ```
 
+## Triton Metrics
+Starting with the 23.11 release of Triton, users can now obtain TRT LLM Batch Manager [statistics](https://github.com/NVIDIA/TensorRT-LLM/blob/ffd5af342a817a2689d38e4af2cc59ded877e339/docs/source/batch_manager.md#statistics) by querying the Triton metrics endpoint. This can be accomplished by launching a Triton server in any of the ways described above (ensuring the build code / container is 23.11 or later) and querying the sever with the generate endpoint. Upon receiving a successful response, you can query the metrics endpoint by entering the following:
+```bash
+curl localhost:8002/metrics
+```
+Batch manager statistics are reported by the metrics endpoint in fields that are prefixed with `nv_trt_llm_`. Your output for these fields should look similar to the following (assuming your model is an inflight batcher model):
+```bash
+# HELP nv_trt_llm_request_statistics TRT LLM request metrics
+# TYPE nv_trt_llm_request_statistics gauge
+nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="context",version="1"} 1
+nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="scheduled",version="1"} 1
+nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="max",version="1"} 512
+nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="active",version="1"} 0
+# HELP nv_trt_llm_runtime_memory_statistics TRT LLM runtime memory metrics
+# TYPE nv_trt_llm_runtime_memory_statistics gauge
+nv_trt_llm_runtime_memory_statistics{memory_type="pinned",model="tensorrt_llm",version="1"} 0
+nv_trt_llm_runtime_memory_statistics{memory_type="gpu",model="tensorrt_llm",version="1"} 1610236
+nv_trt_llm_runtime_memory_statistics{memory_type="cpu",model="tensorrt_llm",version="1"} 0
+# HELP nv_trt_llm_kv_cache_block_statistics TRT LLM KV cache block metrics
+# TYPE nv_trt_llm_kv_cache_block_statistics gauge
+nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="tokens_per",model="tensorrt_llm",version="1"} 64
+nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} 1
+nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="free",model="tensorrt_llm",version="1"} 6239
+nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} 6239
+# HELP nv_trt_llm_inflight_batcher_statistics TRT LLM inflight_batcher-specific metrics
+# TYPE nv_trt_llm_inflight_batcher_statistics gauge
+nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="micro_batch_id",model="tensorrt_llm",version="1"} 0
+nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="generation_requests",model="tensorrt_llm",version="1"} 0
+nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="total_context_tokens",model="tensorrt_llm",version="1"} 0
+# HELP nv_trt_llm_general_statistics General TRT LLM statistics
+# TYPE nv_trt_llm_general_statistics gauge
+nv_trt_llm_general_statistics{general_type="iteration_counter",model="tensorrt_llm",version="1"} 0
+nv_trt_llm_general_statistics{general_type="timestamp",model="tensorrt_llm",version="1"} 1700074049
+```
+If, instead, you launched a V1 model, your output will look similar to the output above except the inflight batcher related fields will be replaced with something similar to the following:
+```bash
+# HELP nv_trt_llm_v1_statistics TRT LLM v1-specific metrics
+# TYPE nv_trt_llm_v1_statistics gauge
+nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="total_generation_tokens",version="1"} 20
+nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="empty_generation_slots",version="1"} 0
+nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="total_context_tokens",version="1"} 5
+```
+Please note that as of the 23.11 Triton release, a link between base Triton metrics (such as inference request count and latency) is being actively developed, but is not yet supported.
+As such, the following fields will report 0:
+```bash
+# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
+# TYPE nv_inference_request_success counter
+nv_inference_request_success{model="tensorrt_llm",version="1"} 0
+# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
+# TYPE nv_inference_request_failure counter
+nv_inference_request_failure{model="tensorrt_llm",version="1"} 0
+# HELP nv_inference_count Number of inferences performed (does not include cached requests)
+# TYPE nv_inference_count counter
+nv_inference_count{model="tensorrt_llm",version="1"} 0
+# HELP nv_inference_exec_count Number of model executions performed (does not include cached requests)
+# TYPE nv_inference_exec_count counter
+nv_inference_exec_count{model="tensorrt_llm",version="1"} 0
+# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
+# TYPE nv_inference_request_duration_us counter
+nv_inference_request_duration_us{model="tensorrt_llm",version="1"} 0
+# HELP nv_inference_queue_duration_us Cumulative inference queuing duration in microseconds (includes cached requests)
+# TYPE nv_inference_queue_duration_us counter
+nv_inference_queue_duration_us{model="tensorrt_llm",version="1"} 0
+# HELP nv_inference_compute_input_duration_us Cumulative compute input duration in microseconds (does not include cached requests)
+# TYPE nv_inference_compute_input_duration_us counter
+nv_inference_compute_input_duration_us{model="tensorrt_llm",version="1"} 0
+# HELP nv_inference_compute_infer_duration_us Cumulative compute inference duration in microseconds (does not include cached requests)
+# TYPE nv_inference_compute_infer_duration_us counter
+nv_inference_compute_infer_duration_us{model="tensorrt_llm",version="1"} 0
+# HELP nv_inference_compute_output_duration_us Cumulative inference compute output duration in microseconds (does not include cached requests)
+# TYPE nv_inference_compute_output_duration_us counter
+nv_inference_compute_output_duration_us{model="tensorrt_llm",version="1"} 0
+# HELP nv_inference_pending_request_count Instantaneous number of pending requests awaiting execution per-model.
+# TYPE nv_inference_pending_request_count gauge
+nv_inference_pending_request_count{model="tensorrt_llm",version="1"} 0
+```
+
 ## Testing the TensorRT-LLM Backend
 Please follow the guide in [`ci/README.md`](ci/README.md) to see how to run
 the testing for TensorRT-LLM backend.
@@ -9,7 +9,7 @@ input [
   },
   {
     name: "max_tokens"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ -1 ]
   },
   {
@@ -24,19 +24,19 @@ input [
   },
   {
     name: "end_id"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
   {
     name: "pad_id"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
   {
     name: "top_k"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
@@ -66,7 +66,7 @@ input [
   },
   {
     name: "min_length"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
@@ -84,7 +84,7 @@ input [
   },
   {
     name: "beam_width"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
 
@@ -19,7 +19,7 @@ input [
     },
     {
         name: "REQUEST_OUTPUT_LEN"
-        data_type: TYPE_UINT32
+        data_type: TYPE_INT32
         dims: [ -1 ]
     }
 ]
@@ -46,12 +46,12 @@ output [
     },
     {
         name: "REQUEST_OUTPUT_LEN"
-        data_type: TYPE_UINT32
+        data_type: TYPE_INT32
         dims: [ -1 ]
     },
     {
         name: "PROMPT_LEARNING_TASK_NAME_IDS"
-        data_type: TYPE_UINT32
+        data_type: TYPE_INT32
         dims: [ 1 ]
     }
 ]
 
@@ -21,24 +21,24 @@ input [
   },
   {
     name: "request_output_len"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ -1 ]
   },
   {
     name: "end_id"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     reshape: { shape: [ ] }
   },
   {
     name: "pad_id"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     reshape: { shape: [ ] }
   },
   {
     name: "beam_width"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     reshape: { shape: [ ] }
     optional: true
@@ -52,7 +52,7 @@ input [
   },
   {
     name: "runtime_top_k"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     reshape: { shape: [ ] }
     optional: true
@@ -80,7 +80,7 @@ input [
   },
   {
     name: "min_length"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     reshape: { shape: [ ] }
     optional: true
 
@@ -35,7 +35,7 @@ input [
   },
   {
     name: "max_tokens"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ -1 ]
   },
   {
@@ -52,19 +52,19 @@ input [
   },
   {
     name: "end_id"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
   {
     name: "pad_id"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
   {
     name: "top_k"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
@@ -94,7 +94,7 @@ input [
   },
   {
     name: "min_length"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
@@ -110,9 +110,15 @@ input [
     dims: [ 1 ]
     optional: true
   },
+  {
+    name: "return_log_probs"
+    data_type: TYPE_BOOL
+    dims: [ 1 ]
+    optional: true
+  },
   {
     name: "beam_width"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
@@ -130,7 +136,7 @@ input [
   },
   {
     name: "prompt_vocab_size"
-    data_type: TYPE_UINT32
+    data_type: TYPE_INT32
     dims: [ 1 ]
     optional: true
   },
@@ -152,6 +158,16 @@ output [
     name: "text_output"
     data_type: TYPE_STRING
     dims: [ -1 ]
+  },
+  {
+    name: "cum_log_probs"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  },
+  {
+    name: "output_log_probs"
+    data_type: TYPE_FP32
+    dims: [ -1, -1 ]
   }
 ]
 ensemble_scheduling {
@@ -267,6 +283,10 @@ ensemble_scheduling {
           key: "random_seed"
           value: "random_seed"
       }
+      input_map {
+          key: "return_log_probs"
+          value: "return_log_probs"
+      }
       input_map {
           key: "beam_width"
           value: "beam_width"
@@ -298,6 +318,14 @@ ensemble_scheduling {
       output_map {
         key: "sequence_length"
         value: "_SEQUENCE_LENGTH"
+      },
+      output_map {
+        key: "cum_log_probs"
+        value: "_CUM_LOG_PROBS"
+      }
+      output_map {
+        key: "output_log_probs"
+        value: "_OUTPUT_LOG_PROBS"
       }
     },
     {
@@ -307,6 +335,14 @@ ensemble_scheduling {
         key: "TOKENS_BATCH"
         value: "_TOKENS_BATCH"
       }
+      input_map {
+        key: "CUM_LOG_PROBS"
+        value: "_CUM_LOG_PROBS"
+      }
+      input_map {
+        key: "OUTPUT_LOG_PROBS"
+        value: "_OUTPUT_LOG_PROBS"
+      }
       input_map {
         key: "SEQUENCE_LENGTH"
         value: "_SEQUENCE_LENGTH"
@@ -315,6 +351,14 @@ ensemble_scheduling {
         key: "OUTPUT"
         value: "text_output"
       }
+      output_map {
+        key: "OUT_OUTPUT_LOG_PROBS"
+        value: "output_log_probs"
+      }
+      output_map {
+        key: "OUT_CUM_LOG_PROBS"
+        value: "cum_log_probs"
+      }
     }
   ]
 }
@@ -113,6 +113,14 @@ def execute(self, requests):
             sequence_lengths = pb_utils.get_input_tensor_by_name(
                 request, 'SEQUENCE_LENGTH').as_numpy()
 
+            # Get cum log probs
+            cum_log_probs = pb_utils.get_input_tensor_by_name(
+                request, 'CUM_LOG_PROBS').as_numpy()
+
+            # Get sequence length
+            output_log_probs = pb_utils.get_input_tensor_by_name(
+                request, 'OUTPUT_LOG_PROBS').as_numpy()
+
             # Reshape Input
             # tokens_batch = tokens_batch.reshape([-1, tokens_batch.shape[0]])
             # tokens_batch = tokens_batch.T
@@ -126,15 +134,22 @@ def execute(self, requests):
                 'OUTPUT',
                 np.array(outputs).astype(self.output_dtype))
 
+            out_cum_log_probs = pb_utils.Tensor('OUT_CUM_LOG_PROBS',
+                                                cum_log_probs)
+
+            out_output_log_probs = pb_utils.Tensor('OUT_OUTPUT_LOG_PROBS',
+                                                   output_log_probs)
+
             # Create InferenceResponse. You can set an error here in case
             # there was a problem with handling this inference request.
             # Below is an example of how you can set errors in inference
             # response:
             #
             # pb_utils.InferenceResponse(
             #    output_tensors=..., TritonError("An error occurred"))
-            inference_response = pb_utils.InferenceResponse(
-                output_tensors=[output_tensor])
+            inference_response = pb_utils.InferenceResponse(output_tensors=[
+                output_tensor, out_cum_log_probs, out_output_log_probs
+            ])
             responses.append(inference_response)
 
         # You should return a list of pb_utils.InferenceResponse. Length
Original file line number	Diff line number	Diff line change
`@@ -19,7 +19,7 @@ input [`
`19`	`19`	`},`
`20`	`20`	`{`
`21`	`21`	`name: "REQUEST_OUTPUT_LEN"`
`22`		`- data_type: TYPE_UINT32`
	`22`	`+ data_type: TYPE_INT32`
`23`	`23`	`dims: [ -1 ]`
`24`	`24`	`}`
`25`	`25`	`]`
`@@ -46,12 +46,12 @@ output [`
`46`	`46`	`},`
`47`	`47`	`{`
`48`	`48`	`name: "REQUEST_OUTPUT_LEN"`
`49`		`- data_type: TYPE_UINT32`
	`49`	`+ data_type: TYPE_INT32`
`50`	`50`	`dims: [ -1 ]`
`51`	`51`	`},`
`52`	`52`	`{`
`53`	`53`	`name: "PROMPT_LEARNING_TASK_NAME_IDS"`
`54`		`- data_type: TYPE_UINT32`
	`54`	`+ data_type: TYPE_INT32`
`55`	`55`	`dims: [ 1 ]`
`56`	`56`	`}`
`57`	`57`	`]`