Skip to content

Commit

Permalink
Update TensorRT-LLM backend (triton-inference-server#161)
Browse files Browse the repository at this point in the history
* Update TensorRT-LLM backend
  • Loading branch information
kaiyux authored Nov 24, 2023
1 parent 37ed967 commit e8ae70c
Show file tree
Hide file tree
Showing 30 changed files with 1,278 additions and 248 deletions.
77 changes: 77 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -377,6 +377,83 @@ You might have to contact your cluster's administrator to help you customize the
pkill tritonserver
```

## Triton Metrics
Starting with the 23.11 release of Triton, users can now obtain TRT LLM Batch Manager [statistics](https://github.com/NVIDIA/TensorRT-LLM/blob/ffd5af342a817a2689d38e4af2cc59ded877e339/docs/source/batch_manager.md#statistics) by querying the Triton metrics endpoint. This can be accomplished by launching a Triton server in any of the ways described above (ensuring the build code / container is 23.11 or later) and querying the sever with the generate endpoint. Upon receiving a successful response, you can query the metrics endpoint by entering the following:
```bash
curl localhost:8002/metrics
```
Batch manager statistics are reported by the metrics endpoint in fields that are prefixed with `nv_trt_llm_`. Your output for these fields should look similar to the following (assuming your model is an inflight batcher model):
```bash
# HELP nv_trt_llm_request_statistics TRT LLM request metrics
# TYPE nv_trt_llm_request_statistics gauge
nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="context",version="1"} 1
nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="scheduled",version="1"} 1
nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="max",version="1"} 512
nv_trt_llm_request_statistics{model="tensorrt_llm",request_type="active",version="1"} 0
# HELP nv_trt_llm_runtime_memory_statistics TRT LLM runtime memory metrics
# TYPE nv_trt_llm_runtime_memory_statistics gauge
nv_trt_llm_runtime_memory_statistics{memory_type="pinned",model="tensorrt_llm",version="1"} 0
nv_trt_llm_runtime_memory_statistics{memory_type="gpu",model="tensorrt_llm",version="1"} 1610236
nv_trt_llm_runtime_memory_statistics{memory_type="cpu",model="tensorrt_llm",version="1"} 0
# HELP nv_trt_llm_kv_cache_block_statistics TRT LLM KV cache block metrics
# TYPE nv_trt_llm_kv_cache_block_statistics gauge
nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="tokens_per",model="tensorrt_llm",version="1"} 64
nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} 1
nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="free",model="tensorrt_llm",version="1"} 6239
nv_trt_llm_kv_cache_block_statistics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} 6239
# HELP nv_trt_llm_inflight_batcher_statistics TRT LLM inflight_batcher-specific metrics
# TYPE nv_trt_llm_inflight_batcher_statistics gauge
nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="micro_batch_id",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="generation_requests",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_statistics{inflight_batcher_specific_metric="total_context_tokens",model="tensorrt_llm",version="1"} 0
# HELP nv_trt_llm_general_statistics General TRT LLM statistics
# TYPE nv_trt_llm_general_statistics gauge
nv_trt_llm_general_statistics{general_type="iteration_counter",model="tensorrt_llm",version="1"} 0
nv_trt_llm_general_statistics{general_type="timestamp",model="tensorrt_llm",version="1"} 1700074049
```
If, instead, you launched a V1 model, your output will look similar to the output above except the inflight batcher related fields will be replaced with something similar to the following:
```bash
# HELP nv_trt_llm_v1_statistics TRT LLM v1-specific metrics
# TYPE nv_trt_llm_v1_statistics gauge
nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="total_generation_tokens",version="1"} 20
nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="empty_generation_slots",version="1"} 0
nv_trt_llm_v1_statistics{model="tensorrt_llm",v1_specific_metric="total_context_tokens",version="1"} 5
```
Please note that as of the 23.11 Triton release, a link between base Triton metrics (such as inference request count and latency) is being actively developed, but is not yet supported.
As such, the following fields will report 0:
```bash
# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="tensorrt_llm",version="1"} 0
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{model="tensorrt_llm",version="1"} 0
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="tensorrt_llm",version="1"} 0
# HELP nv_inference_exec_count Number of model executions performed (does not include cached requests)
# TYPE nv_inference_exec_count counter
nv_inference_exec_count{model="tensorrt_llm",version="1"} 0
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="tensorrt_llm",version="1"} 0
# HELP nv_inference_queue_duration_us Cumulative inference queuing duration in microseconds (includes cached requests)
# TYPE nv_inference_queue_duration_us counter
nv_inference_queue_duration_us{model="tensorrt_llm",version="1"} 0
# HELP nv_inference_compute_input_duration_us Cumulative compute input duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_input_duration_us counter
nv_inference_compute_input_duration_us{model="tensorrt_llm",version="1"} 0
# HELP nv_inference_compute_infer_duration_us Cumulative compute inference duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_infer_duration_us counter
nv_inference_compute_infer_duration_us{model="tensorrt_llm",version="1"} 0
# HELP nv_inference_compute_output_duration_us Cumulative inference compute output duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_output_duration_us counter
nv_inference_compute_output_duration_us{model="tensorrt_llm",version="1"} 0
# HELP nv_inference_pending_request_count Instantaneous number of pending requests awaiting execution per-model.
# TYPE nv_inference_pending_request_count gauge
nv_inference_pending_request_count{model="tensorrt_llm",version="1"} 0
```

## Testing the TensorRT-LLM Backend
Please follow the guide in [`ci/README.md`](ci/README.md) to see how to run
the testing for TensorRT-LLM backend.
12 changes: 6 additions & 6 deletions all_models/gpt/ensemble/config.pbtxt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ input [
},
{
name: "max_tokens"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ -1 ]
},
{
Expand All @@ -24,19 +24,19 @@ input [
},
{
name: "end_id"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "pad_id"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "top_k"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
Expand Down Expand Up @@ -66,7 +66,7 @@ input [
},
{
name: "min_length"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
Expand All @@ -84,7 +84,7 @@ input [
},
{
name: "beam_width"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
Expand Down
6 changes: 3 additions & 3 deletions all_models/gpt/preprocessing/config.pbtxt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ input [
},
{
name: "REQUEST_OUTPUT_LEN"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ -1 ]
}
]
Expand All @@ -46,12 +46,12 @@ output [
},
{
name: "REQUEST_OUTPUT_LEN"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "PROMPT_LEARNING_TASK_NAME_IDS"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
}
]
Expand Down
12 changes: 6 additions & 6 deletions all_models/gpt/tensorrt_llm/config.pbtxt
Original file line number Diff line number Diff line change
Expand Up @@ -21,24 +21,24 @@ input [
},
{
name: "request_output_len"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "end_id"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "pad_id"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "beam_width"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
Expand All @@ -52,7 +52,7 @@ input [
},
{
name: "runtime_top_k"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
Expand Down Expand Up @@ -80,7 +80,7 @@ input [
},
{
name: "min_length"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
Expand Down
58 changes: 51 additions & 7 deletions all_models/inflight_batcher_llm/ensemble/config.pbtxt
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ input [
},
{
name: "max_tokens"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ -1 ]
},
{
Expand All @@ -52,19 +52,19 @@ input [
},
{
name: "end_id"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "pad_id"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "top_k"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
Expand Down Expand Up @@ -94,7 +94,7 @@ input [
},
{
name: "min_length"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
Expand All @@ -110,9 +110,15 @@ input [
dims: [ 1 ]
optional: true
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "beam_width"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
Expand All @@ -130,7 +136,7 @@ input [
},
{
name: "prompt_vocab_size"
data_type: TYPE_UINT32
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
Expand All @@ -152,6 +158,16 @@ output [
name: "text_output"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
}
]
ensemble_scheduling {
Expand Down Expand Up @@ -267,6 +283,10 @@ ensemble_scheduling {
key: "random_seed"
value: "random_seed"
}
input_map {
key: "return_log_probs"
value: "return_log_probs"
}
input_map {
key: "beam_width"
value: "beam_width"
Expand Down Expand Up @@ -298,6 +318,14 @@ ensemble_scheduling {
output_map {
key: "sequence_length"
value: "_SEQUENCE_LENGTH"
},
output_map {
key: "cum_log_probs"
value: "_CUM_LOG_PROBS"
}
output_map {
key: "output_log_probs"
value: "_OUTPUT_LOG_PROBS"
}
},
{
Expand All @@ -307,6 +335,14 @@ ensemble_scheduling {
key: "TOKENS_BATCH"
value: "_TOKENS_BATCH"
}
input_map {
key: "CUM_LOG_PROBS"
value: "_CUM_LOG_PROBS"
}
input_map {
key: "OUTPUT_LOG_PROBS"
value: "_OUTPUT_LOG_PROBS"
}
input_map {
key: "SEQUENCE_LENGTH"
value: "_SEQUENCE_LENGTH"
Expand All @@ -315,6 +351,14 @@ ensemble_scheduling {
key: "OUTPUT"
value: "text_output"
}
output_map {
key: "OUT_OUTPUT_LOG_PROBS"
value: "output_log_probs"
}
output_map {
key: "OUT_CUM_LOG_PROBS"
value: "cum_log_probs"
}
}
]
}
19 changes: 17 additions & 2 deletions all_models/inflight_batcher_llm/postprocessing/1/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,14 @@ def execute(self, requests):
sequence_lengths = pb_utils.get_input_tensor_by_name(
request, 'SEQUENCE_LENGTH').as_numpy()

# Get cum log probs
cum_log_probs = pb_utils.get_input_tensor_by_name(
request, 'CUM_LOG_PROBS').as_numpy()

# Get sequence length
output_log_probs = pb_utils.get_input_tensor_by_name(
request, 'OUTPUT_LOG_PROBS').as_numpy()

# Reshape Input
# tokens_batch = tokens_batch.reshape([-1, tokens_batch.shape[0]])
# tokens_batch = tokens_batch.T
Expand All @@ -126,15 +134,22 @@ def execute(self, requests):
'OUTPUT',
np.array(outputs).astype(self.output_dtype))

out_cum_log_probs = pb_utils.Tensor('OUT_CUM_LOG_PROBS',
cum_log_probs)

out_output_log_probs = pb_utils.Tensor('OUT_OUTPUT_LOG_PROBS',
output_log_probs)

# Create InferenceResponse. You can set an error here in case
# there was a problem with handling this inference request.
# Below is an example of how you can set errors in inference
# response:
#
# pb_utils.InferenceResponse(
# output_tensors=..., TritonError("An error occurred"))
inference_response = pb_utils.InferenceResponse(
output_tensors=[output_tensor])
inference_response = pb_utils.InferenceResponse(output_tensors=[
output_tensor, out_cum_log_probs, out_output_log_probs
])
responses.append(inference_response)

# You should return a list of pb_utils.InferenceResponse. Length
Expand Down
Loading

0 comments on commit e8ae70c

Please sign in to comment.