Facing consistent high latency on some containers of triton inference server in GPUs randomly

**Description**
Facing high latency on some pods of triton inference server randomly

**Triton Information**
triton inference server: version: 2.59.0
Node: L4 GPUs - g2-standard-8 (we are using \~1000 machines at peak across all services)
Driver Version: 535.230.02   
CUDA Version: 12.9
GKE version: 1.31.8-gke.1045000
**We are running only 1 pod per 1 node.**

Are you using the Triton container or did you build it yourself?: we are using the open source container

**Expected behaviour**: We are expecting the same latency across all the pods, but randomly some pods would go latent (with high latency as shown below). And we try deleting the pod, but a different pod coming on the same node is again going latent.

In the below graph, we see that the pod: ending `pmc64` had consistently high latency. Pods ending with `dbpc9` and `pj8v7` also had high latency initially (probably due to warmpup) but it settled in these pods later and went to the baseline latency of \~10ms. But the `pmc64` pod consistently had higher latency.

<img src="https://github.com/user-attachments/assets/8ab13f58-94a8-4aa7-842f-ff7fe38b1966 " alt="Image" width="1881" data-linear-height="708" />

Ultimately we had to cordon the GKE node completely to fix this.

but, How can we go ahead and debug this high latency ? as there no errors in the triton inference server logs  as well (FYI, we are using the default logging)

We have done the following debugging but wouldn't able to figure out if its an hardware issue or the server issue:

1. We have logged into the triton container and ran `dcgmi diag -r memory -p "memory.minimum_allocation_percentage=20" -j` -> this shows that all the tests have passed
2. We tried running the dcgm gather logs script as mentioned in this [page](https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-gpus). But the dcgm script is failing due to this error: `"cp: cannot stat '/var/log/nvidia-dcgm/*': No such file or directory"` Any idea how to fix this ? Does the dcgm installed in triton inference server support running this gather logs script ?
3. From the node, we were able to run the [nvidia-bug-report.sh](http://nvidia-bug-report.sh). Dont see any specific errors apart from the following:

```
Oct 07 05:58:47 gke-k8s-datascience--np-dsci-ml-g2-st-bb69837e-4b5x gcfsd[4690]: time="2025-10-07T05:58:47.960410379Z" level=warning msg="Prefetching for image failed" error="getting prefetch image report for imageName=\"gke.gcr.io/nvidia-gpu-device-plugin@sha256:6bdbad6ab76e4c4889cfc2f31e116bb38827c152b4e93797e165bcf1f3a1932e\" failed: rpc error: code = FailedPrecondition desc = Precondition check failed." imageName="gke.gcr.io/nvidia-gpu-device-plugin@sha256:6bdbad6ab76e4c4889cfc2f31e116bb38827c152b4e93797e165bcf1f3a1932e" module=filesystem_lib
```

```
[    0.997701] GPT:41943039 != 209715199
[    0.998658] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[    0.999713] GPT: Use GNU Parted to correct GPT errors.
```

Can these be the cause ?

Any help on this as to how to debug further is greatly appreciated as we are facing this issue across all our services randomly and we are having to manually cordon the node everytime.

Model backend: Tensor RT

```
config.pbtxt:
name: "dl_inference_model "
platform: "tensorrt_plan"
backend: "tensorrt"
runtime: ""
version_policy {
  latest {
    num_versions: 1
  }
}
max_batch_size: 240
input [
  {
    name: "input__0"
    data_type: TYPE_FP16
    format: FORMAT_NONE
    dims: [128]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  },
  {
    name: "input__1"
    data_type: TYPE_FP16
    format: FORMAT_NONE
    dims: [128]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  },
  {
    name: "input__2"
    data_type: TYPE_FP16
    format: FORMAT_NONE
    dims: [187]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  },
  {
    name: "input__3"
    data_type: TYPE_INT32
    format: FORMAT_NONE
    dims: [51]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  },
  {
    name: "input__4"
    data_type: TYPE_FP16
    format: FORMAT_NONE
    dims: [35]
    is_shape_tensor: false
    allow_ragged_batch: false
    optional: false
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP16
    dims: [3]
    reshape {
      shape: [3]
    }
    label_filename: ""
    is_shape_tensor: false
  }
]
batch_input: []
batch_output: []
optimization {
  priority: PRIORITY_DEFAULT
  input_pinned_memory {
    enable: true
  }
  output_pinned_memory {
    enable: true
  }
  gather_kernel_buffer_threshold: 0
  eager_batching: false
}
dynamic_batching {
  preferred_batch_size: [240]
  max_queue_delay_microseconds: 0
  preserve_ordering: false
  priority_levels: 0
  default_priority_level: 0
  priority_queue_policy: {}
}
instance_group [
  {
    name: "pdp_organic_multi_task_relv4_w_rt_hasp_3_task__ads"
    kind: KIND_GPU
    count: 35
    gpus: [0]
    secondary_devices: []
    profile: []
    passive: false
    host_policy: ""
  }
]
default_model_filename: "model.plan"
cc_model_filenames: {}
metric_tags: {}
parameters: {}
model_warmup [
  {
    name: "warmup_sample"
    count: 50
    batch_size: 240
    inputs {
      key: "input__0"
      value: {
        data_type: TYPE_FP16
        dims: [128]
        random_data: true
      }
    }
    inputs {
      key: "input__1"
      value: {
        data_type: TYPE_FP16
        dims: [128]
        random_data: true
      }
    }
    inputs {
      key: "input__2"
      value: {
        data_type: TYPE_FP16
        dims: [187]
        random_data: true
      }
    }
    inputs {
      key: "input__3"
      value: {
        data_type: TYPE_INT32
        dims: [51]
        random_data: true
      }
    }
    inputs {
      key: "input__4"
      value: {
        data_type: TYPE_FP16
        dims: [35]
        random_data: true
      }
    }
  }
]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Facing consistent high latency on some containers of triton inference server in GPUs randomly #8451

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Facing consistent high latency on some containers of triton inference server in GPUs randomly #8451

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions