-
Is this a duplicate?
AreaNot sure Is your feature request related to a problem? Please describe.We have encountered situations where the performance of certain GPUs in the cluster degrades significantly, but these issues cannot be reproduced by simply restarting the tasks. This makes it difficult to accurately identify which GPUs are problematic. If there were a way to promptly query and list the GPU kernel tasks that have been submitted by the CPU but are still queued for execution, it would greatly reduce the time required for fault diagnosis. Describe the solution you'd likeSimilar to how nvidia-smi provides real-time GPU status, we are looking for a lightweight tool or API that can query and list the current execution queue of a GPU, including pending kernel tasks Describe alternatives you've consideredNo response Additional contextNo response |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @Ind1x1 thanks for reaching out 👋 Unfortunately we do not have any public access to hook into the kernel queue. Even if we do, I am not sure how it'd actually help you debug faulty GPU issues, which is what you really are concerned about. I would like to mention, in case you don't know already, that NVIDIA DCGM is designed to help such use cases. It can be used to provide GPU diagnostics. See their docs here: https://docs.nvidia.com/datacenter/dcgm/latest/index.html and they can be reached out on GitHub here: https://github.com/NVIDIA/DCGM. |
Beta Was this translation helpful? Give feedback.
Hi @Ind1x1 thanks for reaching out 👋 Unfortunately we do not have any public access to hook into the kernel queue. Even if we do, I am not sure how it'd actually help you debug faulty GPU issues, which is what you really are concerned about.
I would like to mention, in case you don't know already, that NVIDIA DCGM is designed to help such use cases. It can be used to provide GPU diagnostics. See their docs here: https://docs.nvidia.com/datacenter/dcgm/latest/index.html and they can be reached out on GitHub here: https://github.com/NVIDIA/DCGM.