DCGM_FI_DEV_GPU_UTIL abnormal point #418

dafu-wu · 2024-11-17T07:48:57Z

What is the version?

nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04

What happened?

DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="550.90.07",Hostname="node-3",UUID="GPU-xxxxx-35bc-edfc-df1d-b3d3145daba0",container="pytorch",device="nvidia2",gpu="2",instance="10.10.11.11:9400",job="gpu-metrics",kubernetes_node="node-3",modelName="NVIDIA H100 80GB HBM3",namespace="zlm",pod="pytorchjob-worker-2"} 113522

There are abnormal points as shown above. I don’t know what causes the above phenomenon.How to troubleshoot？

What did you expect to happen?

0～100

What is the GPU model?

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|

What is the environment?

pod

How did you deploy the dcgm-exporter and what is the configuration?

use daemonset to deploy it in k8s

How to reproduce the issue?

No response

Anything else we need to know?

No response

nvvfedorov · 2024-11-18T16:33:18Z

Thank you for the reporting the issue. Please troubleshoot the issue on your environment, by using command line: dcgmi dmon -e 203 -i 2, where:

-e 230 is a DCGM_FI_DEV_GPU_UTIL metric;
-i 2 is a GPU with GPU ID = 2

We need to understand is it issue in on the DCGM-exporter or DCGM side. Thank you in advance.

dafu-wu · 2024-11-24T00:47:27Z

@nvvfedorov Thank you reply, found this issue,I use nvidia's gpu operator to deploy nvidia-dcgm-exporter. Is this problem a deployment configuration issue?

luccabb · 2024-12-20T00:51:32Z

@nvvfedorov I'm observing the same issue

This is quite rare, when tracking down to a single GPU UUID I've noticed 1 weird sample where DCGM_FI_DEV_GPU_UTIL is higher than 100 for more than 176k samples with the expected value (i.e. value between [0, 100]). I'm sampling every 60s.

some sample values observed that are higher than 100 (these are coming from different GPUs, each line is a unique sample):

808,334,905
808,334,905
808,334,905
686,103
627,618
127,622
2,619
2,619
2,619
2,619
2,619
1,980
1,980
1,980

these are the only weird samples from ~13 million unique samples

dcgm-exporter version: 3.3.7-3.5.0-ubi9

nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.118                Driver Version: 550.118        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:1A:00.0 Off |                    0 |
| N/A   66C    P0            686W /  700W |   62246MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
...

dafu-wu added the bug Something isn't working label Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCGM_FI_DEV_GPU_UTIL abnormal point #418

DCGM_FI_DEV_GPU_UTIL abnormal point #418

dafu-wu commented Nov 17, 2024 •

edited

Loading

nvvfedorov commented Nov 18, 2024

dafu-wu commented Nov 24, 2024 •

edited

Loading

luccabb commented Dec 20, 2024

DCGM_FI_DEV_GPU_UTIL abnormal point #418

DCGM_FI_DEV_GPU_UTIL abnormal point #418

Comments

dafu-wu commented Nov 17, 2024 • edited Loading

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

nvvfedorov commented Nov 18, 2024

dafu-wu commented Nov 24, 2024 • edited Loading

luccabb commented Dec 20, 2024

dafu-wu commented Nov 17, 2024 •

edited

Loading

dafu-wu commented Nov 24, 2024 •

edited

Loading