Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing NVSwitch Bandwidth (RX / TX) in dcgm-exporter #446

Open
suranchoi opened this issue Jan 24, 2025 · 2 comments
Open

Missing NVSwitch Bandwidth (RX / TX) in dcgm-exporter #446

suranchoi opened this issue Jan 24, 2025 · 2 comments
Labels
question Further information is requested

Comments

@suranchoi
Copy link

Ask your question

Version

  • GPU : H200
  • Server : XD 670
  • GPU Driver : 565.57
  • dcgm-exporter : 3.3.8 (build binary file)

Hello, I’m running a custom server (not a DGX) that has 8x NVIDIA H100 GPUs connected via NVSwitch.
I’m using dcgm-exporter to monitor GPU metrics. Additionally, to verify the NVSwitch bandwidth, I ran NCCL tests and also performed model training using DDP. While I can see NVLink traffic clearly, the NVSwitch traffic metrics remain at 0

I also have dcgmi installed, which runs but doesn’t appear to expose any NVSwitch-specific data.

Could you clarify whether dcgm-exporter (or DCGM in general) supports NVSwitch metrics on non-DGX servers? If so, are there any extra steps or configurations needed to enable this? If not, is there another recommended approach or tool to measure NVSwitch traffic on a non-DGX system?

@suranchoi suranchoi added the question Further information is requested label Jan 24, 2025
@glowkey
Copy link
Collaborator

glowkey commented Jan 24, 2025

Which metrics are you monitoring? Can you attach the output of the exporter? Is the libnvidia-nscq library installed?

@suranchoi
Copy link
Author

The metrics is "DCGM_FI_DEV_NVSWITCH_THROUGHPUT_TX/RX" or "DCGM_FI_DEV_NVSWITCH_LINK_THROUGHPUT_TX/RX" (I don't remember exactly)

This is a customer environment, so I can not attach the output..

AND "libnvidia-nscq library (matched GPU Driver version) is installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants