Skip to content

Commit 4832201

Browse files
committed
docs: adding MIG debug inspired on /issues/2155
Signed-off-by: framsouza <fram.souza14@gmail.com>
1 parent 73b60a9 commit 4832201

File tree

1 file changed

+244
-9
lines changed

1 file changed

+244
-9
lines changed

docs/troubleshooting/gpu-operator-production-runbook.md

Lines changed: 244 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,12 @@ Written for production and high-scale environments, with real commands and repre
2929
10. [ClusterPolicy and Operand State Issues](#10-symptom-clusterpolicy-and-operand-state-issues)
3030
11. [Driver Upgrade Failures](#11-symptom-driver-upgrade-failures)
3131
12. [GPU Feature Discovery Issues](#12-symptom-gpu-feature-discovery-issues)
32-
13. [High-Scale and Multi-Node Debugging](#13-high-scale-and-multi-node-debugging)
33-
14. [Recommended Debugging Flow](#14-recommended-debugging-flow)
34-
15. [Minimal Command Bundle for Incident Triage](#15-minimal-command-bundle-for-incident-triage)
35-
16. [Useful Node Labels Reference](#16-useful-node-labels-reference)
36-
17. [Notes for Maintainers](#17-notes-for-maintainers)
32+
13. [Operator Pods Timeout on MIG-Enabled Nodes](#13-symptom-operator-pods-fail-with-context-deadline-exceeded-on-mig-enabled-nodes)
33+
14. [Using the GPU Operator with Host-Installed Drivers](#14-using-the-gpu-operator-with-host-installed-drivers)
34+
15. [High-Scale and Multi-Node Debugging](#15-high-scale-and-multi-node-debugging)
35+
16. [Recommended Debugging Flow](#16-recommended-debugging-flow)
36+
17. [Minimal Command Bundle for Incident Triage](#17-minimal-command-bundle-for-incident-triage)
37+
18. [Useful Node Labels Reference](#18-useful-node-labels-reference)
3738

3839
---
3940

@@ -1200,7 +1201,241 @@ nfd-worker-j8m3n 1/1 Running 0 5d
12001201

12011202
---
12021203

1203-
## 13. High-Scale and Multi-Node Debugging
1204+
## 13. Symptom: Operator Pods Fail with Context Deadline Exceeded on MIG-Enabled Nodes
1205+
1206+
When MIG is configured on GPUs with many instances (e.g., `1g.23gb` on B200, which creates up to 8 instances per GPU), operator pods such as `gpu-feature-discovery`, `nvidia-device-plugin`, or `nvidia-operator-validator` may fail with `context deadline exceeded` errors during startup.
1207+
1208+
### Root cause
1209+
1210+
The NVIDIA kernel driver uses internal locks to serialize access when processes query GPU state through NVML or `nvidia-smi`. With MIG enabled, each MIG instance is a separate device handle, multiplying the amount of work per query.
1211+
1212+
When containerd is running, the `nvidia-container-runtime` plugin holds NVML handles to all GPU devices. This creates lock contention at the kernel driver level: any concurrent `nvidia-smi` or NVML call must wait for the lock.
1213+
1214+
On a node with many MIG instances, this causes `nvidia-smi` execution time to increase significantly. For example, on B200 GPUs with `1g.23gb` MIG profile:
1215+
1216+
- With containerd **stopped**: `nvidia-smi` completes in ~5 seconds
1217+
- With containerd **running**: `nvidia-smi` takes ~44 seconds
1218+
1219+
When the node starts and all GPU operator pods are scheduled simultaneously, they all query the driver at the same time — creating a "query storm" that pushes response times beyond the configured timeouts.
1220+
1221+
### Diagnose the issue
1222+
1223+
Measure `nvidia-smi` execution time on the affected node:
1224+
1225+
```bash
1226+
time nvidia-smi
1227+
```
1228+
1229+
If this takes more than 30 seconds, the node is likely affected by driver lock contention.
1230+
1231+
Check for pods in `CrashLoopBackOff` or with `context deadline exceeded` in their logs:
1232+
1233+
```bash
1234+
kubectl get pods -n gpu-operator -o wide | grep -v Running
1235+
kubectl logs -n gpu-operator -l app=gpu-feature-discovery --tail=50
1236+
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=50
1237+
```
1238+
1239+
Look for messages like:
1240+
1241+
```
1242+
context deadline exceeded
1243+
nvidia-smi failed
1244+
GPU resources are not discovered by the node
1245+
```
1246+
1247+
### Where the timeouts are defined
1248+
1249+
The driver container startup probe runs `nvidia-smi` directly. The probe script is defined in `assets/state-driver/0400_configmap.yaml`:
1250+
1251+
```sh
1252+
if ! nvidia-smi; then
1253+
echo "nvidia-smi failed"
1254+
exit 1
1255+
fi
1256+
```
1257+
1258+
The probe timeout defaults are set in `internal/state/driver.go` (`getDefaultStartupProbe()`):
1259+
1260+
- `TimeoutSeconds: 60` — each probe attempt must complete within 60 seconds
1261+
- `PeriodSeconds: 10` — probe runs every 10 seconds
1262+
- `FailureThreshold: 120` — pod is killed after 120 consecutive failures
1263+
1264+
When the startup probe succeeds, it writes `/run/nvidia/validations/.driver-ctr-ready`. Other components (GFD, device-plugin, DCGM, MIG manager) have init containers that poll for this file every 5 seconds with no upper timeout:
1265+
1266+
```yaml
1267+
args: ["until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done"]
1268+
```
1269+
1270+
The operator validator has a hard-coded GPU resource discovery timeout of 150 seconds (30 retries x 5 seconds), defined in `cmd/nvidia-validator/main.go`:
1271+
1272+
```go
1273+
gpuResourceDiscoveryWaitRetries = 30
1274+
gpuResourceDiscoveryIntervalSeconds = 5
1275+
```
1276+
1277+
If the device-plugin hasn't registered MIG resources within 2.5 minutes (because it is also waiting on slow NVML calls), the validator fails.
1278+
1279+
### Workarounds
1280+
1281+
**Increase startup probe timeouts via ClusterPolicy:**
1282+
1283+
The `ClusterPolicy` CRD exposes probe configuration on the driver spec (`api/nvidia/v1/clusterpolicy_types.go`, `ContainerProbeSpec`):
1284+
1285+
```yaml
1286+
apiVersion: nvidia.com/v1
1287+
kind: ClusterPolicy
1288+
metadata:
1289+
name: cluster-policy
1290+
spec:
1291+
driver:
1292+
startupProbe:
1293+
initialDelaySeconds: 120
1294+
timeoutSeconds: 120
1295+
periodSeconds: 15
1296+
failureThreshold: 180
1297+
```
1298+
1299+
**Stagger operator component startup:**
1300+
1301+
Temporarily disable components on the node, let the driver/toolkit initialize first, then enable the rest:
1302+
1303+
```bash
1304+
# Disable components initially
1305+
kubectl label node <gpu-node> nvidia.com/gpu.deploy.gpu-feature-discovery=false --overwrite
1306+
kubectl label node <gpu-node> nvidia.com/gpu.deploy.device-plugin=false --overwrite
1307+
1308+
# Wait for driver and toolkit pods to be Running
1309+
kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -w
1310+
1311+
# Then enable components one at a time
1312+
kubectl label node <gpu-node> nvidia.com/gpu.deploy.device-plugin=true --overwrite
1313+
# Wait for device-plugin to be Running, then:
1314+
kubectl label node <gpu-node> nvidia.com/gpu.deploy.gpu-feature-discovery=true --overwrite
1315+
```
1316+
1317+
**Staged MIG application (if MIG is managed outside the operator):**
1318+
1319+
1. `kubectl cordon <node>`
1320+
2. `systemctl stop kubelet && systemctl stop containerd`
1321+
3. Apply MIG configuration
1322+
4. `systemctl start containerd && systemctl start kubelet`
1323+
5. `kubectl uncordon <node>`
1324+
1325+
This avoids the concurrent pod startup storm since pods come up sequentially after the node rejoins.
1326+
1327+
### Common causes and resolutions
1328+
1329+
| Cause | Resolution |
1330+
|---|---|
1331+
| Many MIG instances causing slow `nvidia-smi` | Increase startup probe `timeoutSeconds` in ClusterPolicy |
1332+
| All operator pods starting simultaneously | Stagger component startup using node labels |
1333+
| Hard-coded 150s validator timeout too short | Apply MIG config before starting kubelet (staged approach) |
1334+
| containerd + MIG lock contention | Cordon node, stop services, configure MIG, restart |
1335+
1336+
---
1337+
1338+
## 14. Using the GPU Operator with Host-Installed Drivers
1339+
1340+
The GPU Operator does not require managing the NVIDIA driver. If you already have the NVIDIA driver installed directly on your nodes (e.g., via package manager, Base Command Manager, or a pre-built machine image), you can still use the operator for all the other components: container toolkit, device plugin, GPU feature discovery, DCGM, DCGM exporter, MIG manager, and the operator validator.
1341+
1342+
### Configuration
1343+
1344+
Disable the driver component in the `ClusterPolicy`:
1345+
1346+
```yaml
1347+
apiVersion: nvidia.com/v1
1348+
kind: ClusterPolicy
1349+
metadata:
1350+
name: cluster-policy
1351+
spec:
1352+
driver:
1353+
enabled: false
1354+
# All other components remain enabled by default
1355+
toolkit:
1356+
enabled: true
1357+
devicePlugin:
1358+
enabled: true
1359+
dcgm:
1360+
enabled: true
1361+
dcgmExporter:
1362+
enabled: true
1363+
migManager:
1364+
enabled: true
1365+
gfd:
1366+
enabled: true
1367+
```
1368+
1369+
Or via Helm:
1370+
1371+
```bash
1372+
helm install gpu-operator nvidia/gpu-operator \
1373+
--set driver.enabled=false
1374+
```
1375+
1376+
### What you get
1377+
1378+
With `driver.enabled=false`, the operator skips the driver DaemonSet but still deploys:
1379+
1380+
- **nvidia-container-toolkit** — configures the container runtime (containerd/CRI-O) to expose GPUs inside containers
1381+
- **nvidia-device-plugin** — registers GPU resources with the Kubernetes scheduler (`nvidia.com/gpu` or `nvidia.com/mig-*`)
1382+
- **gpu-feature-discovery** — labels nodes with GPU properties (model, memory, compute capability, MIG status)
1383+
- **dcgm / dcgm-exporter** — GPU health monitoring and Prometheus metrics
1384+
- **nvidia-mig-manager** — manages MIG partition lifecycle
1385+
- **nvidia-operator-validator** — validates the full stack is functional
1386+
1387+
### Prerequisites
1388+
1389+
When using host-installed drivers, ensure:
1390+
1391+
1. The NVIDIA kernel module is loaded (`lsmod | grep nvidia`)
1392+
2. `nvidia-smi` works on the host
1393+
3. The driver version is compatible with the operator version (check the [GPU Operator compatibility matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html))
1394+
4. The NVIDIA device files exist under `/dev` (`/dev/nvidia0`, `/dev/nvidiactl`, `/dev/nvidia-uvm`, etc.)
1395+
1396+
### Verify the setup
1397+
1398+
```bash
1399+
# Check that the driver is detected even though it's not managed by the operator
1400+
kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset
1401+
# Expected: no pods (driver DaemonSet is not deployed)
1402+
1403+
# Verify the toolkit detects the host driver
1404+
kubectl logs -n gpu-operator -l app=nvidia-container-toolkit-daemonset --tail=20
1405+
1406+
# Confirm device plugin registered GPU resources
1407+
kubectl get node <gpu-node> -o json | jq '.status.capacity | with_entries(select(.key | startswith("nvidia.com")))'
1408+
```
1409+
1410+
Expected output:
1411+
1412+
```json
1413+
{
1414+
"nvidia.com/gpu": "8"
1415+
}
1416+
```
1417+
1418+
Or with MIG enabled:
1419+
1420+
```json
1421+
{
1422+
"nvidia.com/mig-1g.23gb": "8"
1423+
}
1424+
```
1425+
1426+
### Troubleshooting
1427+
1428+
If operator components fail with host-installed drivers, check:
1429+
1430+
| Symptom | Check |
1431+
|---|---|
1432+
| Toolkit pod stuck or failing | `nvidia-smi` works on the host; driver device files exist under `/dev` |
1433+
| Device plugin shows 0 GPUs | Toolkit pod is running; runtime is correctly configured (`/etc/nvidia-container-runtime/config.toml`) |
1434+
| Validator init container stuck on `driver-validation` | Host driver is loaded and functional; `/run/nvidia/driver` is accessible |
1435+
1436+
---
1437+
1438+
## 15. High-Scale and Multi-Node Debugging
12041439

12051440
In clusters with hundreds of GPU nodes, targeted debugging is essential.
12061441

@@ -1324,7 +1559,7 @@ gpu-node-04 NVIDIA-L4
13241559

13251560
---
13261561

1327-
## 14. Recommended Debugging Flow
1562+
## 16. Recommended Debugging Flow
13281563

13291564
```
13301565
GPU workload failing or pending
@@ -1368,7 +1603,7 @@ Check node allocatable GPUs ─────────────────
13681603
13691604
---
13701605
1371-
## 15. Minimal Command Bundle for Incident Triage
1606+
## 17. Minimal Command Bundle for Incident Triage
13721607
13731608
Copy and run these commands in the first 5 minutes of any GPU incident. They cover operator pod overview, ClusterPolicy state, per-node GPU resources, workload events, component logs in dependency order, and recent events.
13741609
@@ -1396,7 +1631,7 @@ kubectl get events -n gpu-operator --sort-by=.lastTimestamp | tail -30
13961631

13971632
---
13981633

1399-
## 16. Useful Node Labels Reference
1634+
## 18. Useful Node Labels Reference
14001635

14011636
Labels automatically managed by the GPU Operator and GPU Feature Discovery:
14021637

0 commit comments

Comments
 (0)