Commit 2075c4f
committed
docs: add GPU Operator production troubleshooting runbook
The guide covers scenarios such as:
- GPUs not detected on nodes
- GPU workloads stuck in Pending
- driver daemonset failures
- device plugin initialization issues
- missing DCGM exporter metrics
- MIG configuration problems
It also includes example debugging commands and sample outputs to help
operators quickly recognize common failure patterns.
The content is based on past operational troubleshooting experiences running
GPU workloads on Kubernetes clusters using the GPU Operator.1 parent 64c66c5 commit 2075c4f
1 file changed
+1421
-0
lines changed
0 commit comments