Skip to content

Facing consistent high latency on some containers of triton inference server in GPUs randomly #8451

@jayakommuru

Description

@jayakommuru

Description
Facing high latency on some pods of triton inference server randomly

Triton Information
triton inference server: version: 2.59.0
Node: L4 GPUs - g2-standard-8 (we are using ~1000 machines at peak across all services)
Driver Version: 535.230.02
CUDA Version: 12.9
GKE version: 1.31.8-gke.1045000
We are running only 1 pod per 1 node.

Are you using the Triton container or did you build it yourself?: we are using the open source container

Expected behaviour: We are expecting the same latency across all the pods, but randomly some pods would go latent (with high latency as shown below). And we try deleting the pod, but a different pod coming on the same node is again going latent.

In the below graph, we see that the pod: ending pmc64 had consistently high latency. Pods ending with dbpc9 and pj8v7 also had high latency initially (probably due to warmpup) but it settled in these pods later and went to the baseline latency of ~10ms. But the pmc64 pod consistently had higher latency.

Image

Ultimately we had to cordon the GKE node completely to fix this.

but, How can we go ahead and debug this high latency ? as there no errors in the triton inference server logs as well (FYI, we are using the default logging)

We have done the following debugging but wouldn't able to figure out if its an hardware issue or the server issue:

  1. We have logged into the triton container and ran dcgmi diag -r memory -p "memory.minimum_allocation_percentage=20" -j -> this shows that all the tests have passed
  2. We tried running the dcgm gather logs script as mentioned in this page. But the dcgm script is failing due to this error: "cp: cannot stat '/var/log/nvidia-dcgm/*': No such file or directory" Any idea how to fix this ? Does the dcgm installed in triton inference server support running this gather logs script ?
  3. From the node, we were able to run the nvidia-bug-report.sh. Dont see any specific errors apart from the following:
Oct 07 05:58:47 gke-k8s-datascience--np-dsci-ml-g2-st-bb69837e-4b5x gcfsd[4690]: time="2025-10-07T05:58:47.960410379Z" level=warning msg="Prefetching for image failed" error="getting prefetch image report for imageName=\"gke.gcr.io/nvidia-gpu-device-plugin@sha256:6bdbad6ab76e4c4889cfc2f31e116bb38827c152b4e93797e165bcf1f3a1932e\" failed: rpc error: code = FailedPrecondition desc = Precondition check failed." imageName="gke.gcr.io/nvidia-gpu-device-plugin@sha256:6bdbad6ab76e4c4889cfc2f31e116bb38827c152b4e93797e165bcf1f3a1932e" module=filesystem_lib
[    0.997701] GPT:41943039 != 209715199
[    0.998658] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[    0.999713] GPT: Use GNU Parted to correct GPT errors.

Can these be the cause ?

Any help on this as to how to debug further is greatly appreciated as we are facing this issue across all our services randomly and we are having to manually cordon the node everytime.

Model backend: Tensor RT

config.pbtxt:
name: "dl_inference_model "
platform: "tensorrt_plan"
backend: "tensorrt"
runtime: ""
version_policy {
  latest {
    num_versions: 1
  }
}
max_batch_size: 240
input [
  {
    name: "input__0"
    data_type: TYPE_FP16
    format: FORMAT_NONE
    dims: [128]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  },
  {
    name: "input__1"
    data_type: TYPE_FP16
    format: FORMAT_NONE
    dims: [128]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  },
  {
    name: "input__2"
    data_type: TYPE_FP16
    format: FORMAT_NONE
    dims: [187]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  },
  {
    name: "input__3"
    data_type: TYPE_INT32
    format: FORMAT_NONE
    dims: [51]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  },
  {
    name: "input__4"
    data_type: TYPE_FP16
    format: FORMAT_NONE
    dims: [35]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP16
    dims: [3]
    reshape {
      shape: [3]
    }
    label_filename: ""
    is_shape_tensor: false
  }
]
batch_input: []
batch_output: []
optimization {
  priority: PRIORITY_DEFAULT
  input_pinned_memory {
    enable: true
  }
  output_pinned_memory {
    enable: true
  }
  gather_kernel_buffer_threshold: 0
  eager_batching: false
}
dynamic_batching {
  preferred_batch_size: [240]
  max_queue_delay_microseconds: 0
  preserve_ordering: false
  priority_levels: 0
  default_priority_level: 0
  priority_queue_policy: {}
}
instance_group [
  {
    name: "pdp_organic_multi_task_relv4_w_rt_hasp_3_task__ads"
    kind: KIND_GPU
    count: 35
    gpus: [0]
    secondary_devices: []
    profile: []
    passive: false
    host_policy: ""
  }
]
default_model_filename: "model.plan"
cc_model_filenames: {}
metric_tags: {}
parameters: {}
model_warmup [
  {
    name: "warmup_sample"
    count: 50
    batch_size: 240
    inputs {
      key: "input__0"
      value: {
        data_type: TYPE_FP16
        dims: [128]
        random_data: true
      }
    }
    inputs {
      key: "input__1"
      value: {
        data_type: TYPE_FP16
        dims: [128]
        random_data: true
      }
    }
    inputs {
      key: "input__2"
      value: {
        data_type: TYPE_FP16
        dims: [187]
        random_data: true
      }
    }
    inputs {
      key: "input__3"
      value: {
        data_type: TYPE_INT32
        dims: [51]
        random_data: true
      }
    }
    inputs {
      key: "input__4"
      value: {
        data_type: TYPE_FP16
        dims: [35]
        random_data: true
      }
    }
  }
]

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions