You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I’m running a custom server (not a DGX) that has 8x NVIDIA H100 GPUs connected via NVSwitch.
I’m using dcgm-exporter to monitor GPU metrics. Additionally, to verify the NVSwitch bandwidth, I ran NCCL tests and also performed model training using DDP. While I can see NVLink traffic clearly, the NVSwitch traffic metrics remain at 0
I also have dcgmi installed, which runs but doesn’t appear to expose any NVSwitch-specific data.
Could you clarify whether dcgm-exporter (or DCGM in general) supports NVSwitch metrics on non-DGX servers? If so, are there any extra steps or configurations needed to enable this? If not, is there another recommended approach or tool to measure NVSwitch traffic on a non-DGX system?
The text was updated successfully, but these errors were encountered:
Ask your question
Version
Hello, I’m running a custom server (not a DGX) that has 8x NVIDIA H100 GPUs connected via NVSwitch.
I’m using dcgm-exporter to monitor GPU metrics. Additionally, to verify the NVSwitch bandwidth, I ran NCCL tests and also performed model training using DDP. While I can see NVLink traffic clearly, the NVSwitch traffic metrics remain at 0
I also have dcgmi installed, which runs but doesn’t appear to expose any NVSwitch-specific data.
Could you clarify whether dcgm-exporter (or DCGM in general) supports NVSwitch metrics on non-DGX servers? If so, are there any extra steps or configurations needed to enable this? If not, is there another recommended approach or tool to measure NVSwitch traffic on a non-DGX system?
The text was updated successfully, but these errors were encountered: