-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DCGM_FI_DEV_GPU_UTIL abnormal point #418
Comments
Thank you for the reporting the issue. Please troubleshoot the issue on your environment, by using command line:
We need to understand is it issue in on the DCGM-exporter or DCGM side. Thank you in advance. |
|
@nvvfedorov I'm observing the same issue This is quite rare, when tracking down to a single GPU UUID I've noticed 1 weird sample where DCGM_FI_DEV_GPU_UTIL is higher than 100 for more than 176k samples with the expected value (i.e. value between [0, 100]). I'm sampling every 60s. some sample values observed that are higher than 100 (these are coming from different GPUs, each line is a unique sample):
these are the only weird samples from ~13 million unique samples
|
What is the version?
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
What happened?
DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="550.90.07",Hostname="node-3",UUID="GPU-xxxxx-35bc-edfc-df1d-b3d3145daba0",container="pytorch",device="nvidia2",gpu="2",instance="10.10.11.11:9400",job="gpu-metrics",kubernetes_node="node-3",modelName="NVIDIA H100 80GB HBM3",namespace="zlm",pod="pytorchjob-worker-2"} 113522
There are abnormal points as shown above. I don’t know what causes the above phenomenon.How to troubleshoot?
What did you expect to happen?
0~100
What is the GPU model?
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 65C P0 650W / 700W | 62844MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 55C P0 654W / 700W | 62768MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 56C P0 650W / 700W | 62844MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 65C P0 652W / 700W | 62768MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 66C P0 634W / 700W | 62844MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 54C P0 630W / 700W | 62732MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 65C P0 663W / 700W | 62824MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 53C P0 626W / 700W | 62748MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
What is the environment?
pod
How did you deploy the dcgm-exporter and what is the configuration?
use daemonset to deploy it in k8s
How to reproduce the issue?
No response
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: