-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Description
Facing high latency on some pods of triton inference server randomly
Triton Information
triton inference server: version: 2.59.0
Node: L4 GPUs - g2-standard-8 (we are using ~1000 machines at peak across all services)
Driver Version: 535.230.02
CUDA Version: 12.9
GKE version: 1.31.8-gke.1045000
We are running only 1 pod per 1 node.
Are you using the Triton container or did you build it yourself?: we are using the open source container
Expected behaviour: We are expecting the same latency across all the pods, but randomly some pods would go latent (with high latency as shown below). And we try deleting the pod, but a different pod coming on the same node is again going latent.
In the below graph, we see that the pod: ending pmc64
had consistently high latency. Pods ending with dbpc9
and pj8v7
also had high latency initially (probably due to warmpup) but it settled in these pods later and went to the baseline latency of ~10ms. But the pmc64
pod consistently had higher latency.
Ultimately we had to cordon the GKE node completely to fix this.
but, How can we go ahead and debug this high latency ? as there no errors in the triton inference server logs as well (FYI, we are using the default logging)
We have done the following debugging but wouldn't able to figure out if its an hardware issue or the server issue:
- We have logged into the triton container and ran
dcgmi diag -r memory -p "memory.minimum_allocation_percentage=20" -j
-> this shows that all the tests have passed - We tried running the dcgm gather logs script as mentioned in this page. But the dcgm script is failing due to this error:
"cp: cannot stat '/var/log/nvidia-dcgm/*': No such file or directory"
Any idea how to fix this ? Does the dcgm installed in triton inference server support running this gather logs script ? - From the node, we were able to run the nvidia-bug-report.sh. Dont see any specific errors apart from the following:
Oct 07 05:58:47 gke-k8s-datascience--np-dsci-ml-g2-st-bb69837e-4b5x gcfsd[4690]: time="2025-10-07T05:58:47.960410379Z" level=warning msg="Prefetching for image failed" error="getting prefetch image report for imageName=\"gke.gcr.io/nvidia-gpu-device-plugin@sha256:6bdbad6ab76e4c4889cfc2f31e116bb38827c152b4e93797e165bcf1f3a1932e\" failed: rpc error: code = FailedPrecondition desc = Precondition check failed." imageName="gke.gcr.io/nvidia-gpu-device-plugin@sha256:6bdbad6ab76e4c4889cfc2f31e116bb38827c152b4e93797e165bcf1f3a1932e" module=filesystem_lib
[ 0.997701] GPT:41943039 != 209715199
[ 0.998658] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[ 0.999713] GPT: Use GNU Parted to correct GPT errors.
Can these be the cause ?
Any help on this as to how to debug further is greatly appreciated as we are facing this issue across all our services randomly and we are having to manually cordon the node everytime.
Model backend: Tensor RT
config.pbtxt:
name: "dl_inference_model "
platform: "tensorrt_plan"
backend: "tensorrt"
runtime: ""
version_policy {
latest {
num_versions: 1
}
}
max_batch_size: 240
input [
{
name: "input__0"
data_type: TYPE_FP16
format: FORMAT_NONE
dims: [128]
is_shape_tensor: false
allow_ragged_batch: false
optional: false
},
{
name: "input__1"
data_type: TYPE_FP16
format: FORMAT_NONE
dims: [128]
is_shape_tensor: false
allow_ragged_batch: false
optional: false
},
{
name: "input__2"
data_type: TYPE_FP16
format: FORMAT_NONE
dims: [187]
is_shape_tensor: false
allow_ragged_batch: false
optional: false
},
{
name: "input__3"
data_type: TYPE_INT32
format: FORMAT_NONE
dims: [51]
is_shape_tensor: false
allow_ragged_batch: false
optional: false
},
{
name: "input__4"
data_type: TYPE_FP16
format: FORMAT_NONE
dims: [35]
is_shape_tensor: false
allow_ragged_batch: false
optional: false
}
]
output [
{
name: "output__0"
data_type: TYPE_FP16
dims: [3]
reshape {
shape: [3]
}
label_filename: ""
is_shape_tensor: false
}
]
batch_input: []
batch_output: []
optimization {
priority: PRIORITY_DEFAULT
input_pinned_memory {
enable: true
}
output_pinned_memory {
enable: true
}
gather_kernel_buffer_threshold: 0
eager_batching: false
}
dynamic_batching {
preferred_batch_size: [240]
max_queue_delay_microseconds: 0
preserve_ordering: false
priority_levels: 0
default_priority_level: 0
priority_queue_policy: {}
}
instance_group [
{
name: "pdp_organic_multi_task_relv4_w_rt_hasp_3_task__ads"
kind: KIND_GPU
count: 35
gpus: [0]
secondary_devices: []
profile: []
passive: false
host_policy: ""
}
]
default_model_filename: "model.plan"
cc_model_filenames: {}
metric_tags: {}
parameters: {}
model_warmup [
{
name: "warmup_sample"
count: 50
batch_size: 240
inputs {
key: "input__0"
value: {
data_type: TYPE_FP16
dims: [128]
random_data: true
}
}
inputs {
key: "input__1"
value: {
data_type: TYPE_FP16
dims: [128]
random_data: true
}
}
inputs {
key: "input__2"
value: {
data_type: TYPE_FP16
dims: [187]
random_data: true
}
}
inputs {
key: "input__3"
value: {
data_type: TYPE_INT32
dims: [51]
random_data: true
}
}
inputs {
key: "input__4"
value: {
data_type: TYPE_FP16
dims: [35]
random_data: true
}
}
}
]