Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM is not getting loaded #422

Open
Pryz opened this issue Nov 19, 2024 · 3 comments
Open

DCGM is not getting loaded #422

Pryz opened this issue Nov 19, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@Pryz
Copy link

Pryz commented Nov 19, 2024

What is the version?

3.3.9-3.6.1-ubuntu22.04

What happened?

Hi there,

We are deploying the exporter version 3.3.9-3.6.1-ubuntu22.04 on in Docker on ECS. The task is configured with CAP_SYS_ADMIN, PID host and has access to all GPUs.

The logs indicate that the DCGM module is not loaded even tho, if I understand correctly, the exporter is supposed to use it via an embedded mode. Here are the logs:

time="2024-11-19T18:32:35Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-11-19T18:32:35Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/default-counters.csv'"
time="2024-11-19T18:32:35Z" level=warning msg="Skipping line 6 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-11-19T18:32:35Z" level=warning msg="Skipping line 7 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"

This is the last Docker configuration I tried:

docker run --privileged --cap-add SYS_ADMIN --runtime=nvidia -v /proc:/proc --gpus all --pid=host --cap-add=SYS_ADMIN --gpus all --rm -p 9401:9400 -e DCGM_EXPORTER_ENABLE_DCGM_LOG='true' -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_MIG_CONFIG_DEVICES=all -e NVIDIA_MIG_MONITOR_DEVICES=all nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04

Any recommendation on where to go from there?

What did you expect to happen?

I am expecting to collect all DCGM metrics.

What is the GPU model?

NVIDIA L4

What is the environment?

Running in AWS ECS.

NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5

How did you deploy the dcgm-exporter and what is the configuration?

Running in Docker via an ECS daemon.

How to reproduce the issue?

No response

Anything else we need to know?

No response

@Pryz Pryz added the bug Something isn't working label Nov 19, 2024
@nvvfedorov
Copy link
Collaborator

nvvfedorov commented Nov 19, 2024

@Pryz, For the troubleshooting, do the following:

  1. Provide output of the following command line, we need to see your GPU model and configuration

docker run --gpus all --cap-add SYS_ADMIN --entrypoint=bash nvcr.io/nvidia/cloud-native/dcgm:3.3.9-1-ubuntu22.04 -c nvidia-smi

  1. Enable DCGM to debug logging by setting the following environment variables:
DCGM_EXPORTER_ENABLE_DCGM_LOG = true	
DCGM_EXPORTER_DCGM_LOG_LEVEL  = DEBUG

This will help us to see what is the GPU model and why the DCGM doesn't load profiling module.

@Pryz
Copy link
Author

Pryz commented Nov 20, 2024

Sure thing. Thanks @nvvfedorov. Here are the infos:

Tue Nov 19 23:56:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   24C    P8             13W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

output.log

@Pryz
Copy link
Author

Pryz commented Nov 21, 2024

@nvvfedorov any other info I can provide to help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants