TheCodeWrangler
diff --git a/‎README.md
Lines changed: 67 additions & 45 deletions b/‎README.md
Lines changed: 67 additions & 45 deletions
diff --git a/‎all_models/gpt/postprocessing/1/model.py
Lines changed: 2 additions & 2 deletions b/‎all_models/gpt/postprocessing/1/model.py
Lines changed: 2 additions & 2 deletions
diff --git a/‎all_models/gpt/preprocessing/1/model.py
Lines changed: 2 additions & 2 deletions b/‎all_models/gpt/preprocessing/1/model.py
Lines changed: 2 additions & 2 deletions
diff --git a/‎all_models/inflight_batcher_llm/postprocessing/1/model.py
Lines changed: 2 additions & 2 deletions b/‎all_models/inflight_batcher_llm/postprocessing/1/model.py
Lines changed: 2 additions & 2 deletions
diff --git a/‎all_models/inflight_batcher_llm/preprocessing/1/model.py
Lines changed: 2 additions & 2 deletions b/‎all_models/inflight_batcher_llm/preprocessing/1/model.py
Lines changed: 2 additions & 2 deletions
diff --git a/‎all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
Lines changed: 2 additions & 2 deletions b/‎all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
Lines changed: 2 additions & 2 deletions
@@ -164,14 +164,24 @@ python3 build.py --model_dir=./c-model/gpt2/4-gpu/ \
 
 There are five models in the [`all_models/inflight_batcher_llm`](./all_models/inflight_batcher_llm/)
 directory that will be used in this example:
-- "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
-- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
-- "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
-- "ensemble": This model can be used to chain the preprocessing, tensorrt_llm and postprocessing models together.
-- "tensorrt_llm_bls": This model can also be used to chain the preprocessing, tensorrt_llm and postprocessing models together. The BLS model has an optional parameter `accumulate_tokens` which can be used in streaming mode to call the preprocessing model with all accumulated tokens, instead of only one token. This might be necessary for certain tokenizers.
+- "preprocessing": This model is used for tokenizing, meaning the conversion from
+prompts(string) to input_ids(list of ints).
+- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used
+for inferencing
+- "postprocessing": This model is used for de-tokenizing, meaning the conversion
+from output_ids(list of ints) to outputs(string).
+- "ensemble": This model can be used to chain the preprocessing, tensorrt_llm
+and postprocessing models together.
+- "tensorrt_llm_bls": This model can also be used to chain the preprocessing,
+tensorrt_llm and postprocessing models together. The BLS model has an optional
+parameter `accumulate_tokens` which can be used in streaming mode to call the
+preprocessing model with all accumulated tokens, instead of only one token.
+This might be necessary for certain tokenizers.
 
 To learn more about ensemble and BLS models, please see the
-[Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) and [Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting) sections of the Triton Inference Server documentation.
+[Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models)
+and [Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
+sections of the Triton Inference Server documentation.
 
 ```bash
 # Create the model repository that will be used by the Triton server
@@ -333,14 +343,16 @@ He was a member of the French Academy of Sciences and the French Academy of Arts
 Soyer was a member of the French Academy of Sciences and
 ```
 
-You can also stop the generation process early by using the `--stop-after-ms` option to send a stop request after a few milliseconds:
+You can also stop the generation process early by using the `--stop-after-ms`
+option to send a stop request after a few milliseconds:
 
 ```bash
 python inflight_batcher_llm/client/inflight_batcher_llm_client.py --stop-after-ms 200 --request-output-len 200 --tokenizer-dir /workspace/tensorrtllm_backend/tensorrt_llm/examples/gpt/gpt2
 ```
 
-You will find that the generation process is stopped early and therefore the number of generated tokens is lower than 200.
-You can have a look at the client code to see how early stopping is achieved.
+You will find that the generation process is stopped early and therefore the
+number of generated tokens is lower than 200. You can have a look at the
+client code to see how early stopping is achieved.
 
 ### Launch Triton server *within Slurm based clusters*
 
@@ -391,49 +403,59 @@ pkill tritonserver
 ```
 
 ## Triton Metrics
-Starting with the 23.11 release of Triton, users can now obtain TRT LLM Batch Manager [statistics](https://github.com/NVIDIA/TensorRT-LLM/blob/ffd5af342a817a2689d38e4af2cc59ded877e339/docs/source/batch_manager.md#statistics) by querying the Triton metrics endpoint. This can be accomplished by launching a Triton server in any of the ways described above (ensuring the build code / container is 23.11 or later) and querying the sever with the generate endpoint. Upon receiving a successful response, you can query the metrics endpoint by entering the following:
+Starting with the 23.11 release of Triton, users can now obtain TRT LLM Batch
+Manager [statistics](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/batch_manager.md#statistics)
+by querying the Triton metrics endpoint. This can be accomplished by launching
+a Triton server in any of the ways described above (ensuring the build code /
+container is 23.11 or later) and querying the server. Upon receiving a
+successful response, you can query the metrics endpoint by entering the
+following:
 ```bash
 curl localhost:8002/metrics
 ```
-Batch manager statistics are reported by the metrics endpoint in fields that are prefixed with `nv_trt_llm_`. Your output for these fields should look similar to the following (assuming your model is an inflight batcher model):
+Batch manager statistics are reported by the metrics endpoint in fields that
+are prefixed with `nv_trt_llm_`. Your output for these fields should look
+similar to the following (assuming your model is an inflight batcher model):
 ```bash
-# HELP nv_trt_llm_request_statistics TRT LLM request metrics
-# TYPE nv_trt_llm_request_statistics gauge
-nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="context",version="1"} 1
-nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="scheduled",version="1"} 1
-nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="max",version="1"} 512
-nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="active",version="1"} 0
-# HELP nv_trt_llm_runtime_memory_statistics TRT LLM runtime memory metrics
-# TYPE nv_trt_llm_runtime_memory_statistics gauge
-nv_trt_llm_runtime_memory_statistics{memory_type="pinned",model="tensorrt_llm",version="1"} 0
-nv_trt_llm_runtime_memory_statistics{memory_type="gpu",model="tensorrt_llm",version="1"} 1610236
-nv_trt_llm_runtime_memory_statistics{memory_type="cpu",model="tensorrt_llm",version="1"} 0
-# HELP nv_trt_llm_kv_cache_block_statistics TRT LLM KV cache block metrics
-# TYPE nv_trt_llm_kv_cache_block_statistics gauge
-nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="tokens_per",model="tensorrt_llm",version="1"} 64
-nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} 1
-nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="free",model="tensorrt_llm",version="1"} 6239
-nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} 6239
-# HELP nv_trt_llm_inflight_batcher_statistics TRT LLM inflight_batcher-specific metrics
-# TYPE nv_trt_llm_inflight_batcher_statistics gauge
-nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="micro_batch_id",model="tensorrt_llm",version="1"} 0
-nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="generation_requests",model="tensorrt_llm",version="1"} 0
-nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="total_context_tokens",model="tensorrt_llm",version="1"} 0
-# HELP nv_trt_llm_general_statistics General TRT LLM statistics
-# TYPE nv_trt_llm_general_statistics gauge
-nv_trt_llm_general_statistics{general_type="iteration_counter",model="tensorrt_llm",version="1"} 0
-nv_trt_llm_general_statistics{general_type="timestamp",model="tensorrt_llm",version="1"} 1700074049
+# HELP nv_trt_llm_request_metrics TRT LLM request metrics
+# TYPE nv_trt_llm_request_metrics gauge
+nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="context",version="1"} 1
+nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"} 1
+nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="max",version="1"} 512
+nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"} 0
+# HELP nv_trt_llm_runtime_memory_metrics TRT LLM runtime memory metrics
+# TYPE nv_trt_llm_runtime_memory_metrics gauge
+nv_trt_llm_runtime_memory_metrics{memory_type="pinned",model="tensorrt_llm",version="1"} 0
+nv_trt_llm_runtime_memory_metrics{memory_type="gpu",model="tensorrt_llm",version="1"} 1610236
+nv_trt_llm_runtime_memory_metrics{memory_type="cpu",model="tensorrt_llm",version="1"} 0
+# HELP nv_trt_llm_kv_cache_block_metrics TRT LLM KV cache block metrics
+# TYPE nv_trt_llm_kv_cache_block_metrics gauge
+nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="tokens_per",model="tensorrt_llm",version="1"} 64
+nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} 1
+nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="free",model="tensorrt_llm",version="1"} 6239
+nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} 6239
+# HELP nv_trt_llm_inflight_batcher_metrics TRT LLM inflight_batcher-specific metrics
+# TYPE nv_trt_llm_inflight_batcher_metrics gauge
+nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="micro_batch_id",model="tensorrt_llm",version="1"} 0
+nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="generation_requests",model="tensorrt_llm",version="1"} 0
+nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="total_context_tokens",model="tensorrt_llm",version="1"} 0
+# HELP nv_trt_llm_general_metrics General TRT LLM metrics
+# TYPE nv_trt_llm_general_metrics gauge
+nv_trt_llm_general_metrics{general_type="iteration_counter",model="tensorrt_llm",version="1"} 0
+nv_trt_llm_general_metrics{general_type="timestamp",model="tensorrt_llm",version="1"} 1700074049
 ```
-If, instead, you launched a V1 model, your output will look similar to the output above except the inflight batcher related fields will be replaced with something similar to the following:
+If, instead, you launched a V1 model, your output will look similar to the
+output above except the inflight batcher related fields will be replaced
+with something similar to the following:
 ```bash
-# HELP nv_trt_llm_v1_statistics TRT LLM v1-specific metrics
-# TYPE nv_trt_llm_v1_statistics gauge
-nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="total_generation_tokens",version="1"} 20
-nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="empty_generation_slots",version="1"} 0
-nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="total_context_tokens",version="1"} 5
+# HELP nv_trt_llm_v1_metrics TRT LLM v1-specific metrics
+# TYPE nv_trt_llm_v1_metrics gauge
+nv_trt_llm_v1_metrics{model="tensorrt_llm",v1_specific_metric="total_generation_tokens",version="1"} 20
+nv_trt_llm_v1_metrics{model="tensorrt_llm",v1_specific_metric="empty_generation_slots",version="1"} 0
+nv_trt_llm_v1_metrics{model="tensorrt_llm",v1_specific_metric="total_context_tokens",version="1"} 5
 ```
-Please note that as of the 23.11 Triton release, a link between base Triton metrics (such as inference request count and latency) is being actively developed, but is not yet supported.
-As such, the following fields will report 0:
+Please note that versions of Triton prior to the 23.12 release do not
+support base Triton metrics. As such, the following fields will report 0:
 ```bash
 # HELP nv_inference_request_success Number of successful inference requests, all batch sizes
 # TYPE nv_inference_request_success counter
 
@@ -37,8 +37,8 @@ def initialize(self, args):
             self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
                                          padding_side='left')
         elif tokenizer_type == 'auto':
-            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
-                                                           padding_side='left')
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                tokenizer_dir, padding_side='left', trust_remote_code=True)
         elif tokenizer_type == 'llama':
             self.tokenizer = LlamaTokenizer.from_pretrained(
                 tokenizer_dir, legacy=False, padding_side='left')
 
@@ -40,8 +40,8 @@ def initialize(self, args):
             self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
                                          padding_side='left')
         elif tokenizer_type == 'auto':
-            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
-                                                           padding_side='left')
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                tokenizer_dir, padding_side='left', trust_remote_code=True)
         elif tokenizer_type == 'llama':
             self.tokenizer = LlamaTokenizer.from_pretrained(
                 tokenizer_dir, legacy=False, padding_side='left')
 
@@ -67,8 +67,8 @@ def initialize(self, args):
             self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
                                          padding_side='left')
         elif tokenizer_type == 'auto':
-            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
-                                                           padding_side='left')
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                tokenizer_dir, padding_side='left', trust_remote_code=True)
         elif tokenizer_type == 'llama':
             self.tokenizer = LlamaTokenizer.from_pretrained(
                 tokenizer_dir, legacy=False, padding_side='left')
 
@@ -68,8 +68,8 @@ def initialize(self, args):
             self.tokenizer = T5Tokenizer(vocab_file=tokenizer_dir,
                                          padding_side='left')
         elif tokenizer_type == 'auto':
-            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
-                                                           padding_side='left')
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                tokenizer_dir, padding_side='left', trust_remote_code=True)
         elif tokenizer_type == 'llama':
             self.tokenizer = LlamaTokenizer.from_pretrained(
                 tokenizer_dir, legacy=False, padding_side='left')
 
@@ -289,8 +289,8 @@ parameters: {
   }
 }
 parameters: {
-  key: "use_context_fmha_for_generation"
+  key: "enable_kv_cache_reuse"
   value: {
-    string_value: "${use_context_fmha_for_generation}"
+    string_value: "${enable_kv_cache_reuse}"
   }
 }
Original file line number	Diff line number	Diff line change
`@@ -289,8 +289,8 @@ parameters: {`
`289`	`289`	`}`
`290`	`290`	`}`
`291`	`291`	`parameters: {`
`292`		`- key: "use_context_fmha_for_generation"`
	`292`	`+ key: "enable_kv_cache_reuse"`
`293`	`293`	`value: {`
`294`		`- string_value: "${use_context_fmha_for_generation}"`
	`294`	`+ string_value: "${enable_kv_cache_reuse}"`
`295`	`295`	`}`
`296`	`296`	`}`