Skip to content

Commit 2075c4f

Browse files
committed
docs: add GPU Operator production troubleshooting runbook
The guide covers scenarios such as: - GPUs not detected on nodes - GPU workloads stuck in Pending - driver daemonset failures - device plugin initialization issues - missing DCGM exporter metrics - MIG configuration problems It also includes example debugging commands and sample outputs to help operators quickly recognize common failure patterns. The content is based on past operational troubleshooting experiences running GPU workloads on Kubernetes clusters using the GPU Operator.
1 parent 64c66c5 commit 2075c4f

File tree

1 file changed

+1421
-0
lines changed

1 file changed

+1421
-0
lines changed

0 commit comments

Comments
 (0)