You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## 13. Symptom: Operator Pods Fail with Context Deadline Exceeded on MIG-Enabled Nodes
1205
+
1206
+
When MIG is configured on GPUs with many instances (e.g., `1g.23gb` on B200, which creates up to 8 instances per GPU), operator pods such as `gpu-feature-discovery`, `nvidia-device-plugin`, or `nvidia-operator-validator` may fail with `context deadline exceeded` errors during startup.
1207
+
1208
+
### Root cause
1209
+
1210
+
The NVIDIA kernel driver uses internal locks to serialize access when processes query GPU state through NVML or `nvidia-smi`. With MIG enabled, each MIG instance is a separate device handle, multiplying the amount of work per query.
1211
+
1212
+
When containerd is running, the `nvidia-container-runtime` plugin holds NVML handles to all GPU devices. This creates lock contention at the kernel driver level: any concurrent `nvidia-smi` or NVML call must wait for the lock.
1213
+
1214
+
On a node with many MIG instances, this causes `nvidia-smi` execution time to increase significantly. For example, on B200 GPUs with `1g.23gb` MIG profile:
1215
+
1216
+
- With containerd **stopped**: `nvidia-smi`completes in ~5 seconds
1217
+
- With containerd **running**: `nvidia-smi`takes ~44 seconds
1218
+
1219
+
When the node starts and all GPU operator pods are scheduled simultaneously, they all query the driver at the same time — creating a "query storm" that pushes response times beyond the configured timeouts.
1220
+
1221
+
### Diagnose the issue
1222
+
1223
+
Measure `nvidia-smi` execution time on the affected node:
1224
+
1225
+
```bash
1226
+
time nvidia-smi
1227
+
```
1228
+
1229
+
If this takes more than 30 seconds, the node is likely affected by driver lock contention.
1230
+
1231
+
Check for pods in `CrashLoopBackOff` or with `context deadline exceeded` in their logs:
The driver container startup probe runs `nvidia-smi` directly. The probe script is defined in `assets/state-driver/0400_configmap.yaml`:
1250
+
1251
+
```sh
1252
+
if ! nvidia-smi; then
1253
+
echo "nvidia-smi failed"
1254
+
exit 1
1255
+
fi
1256
+
```
1257
+
1258
+
The probe timeout defaults are set in `internal/state/driver.go` (`getDefaultStartupProbe()`):
1259
+
1260
+
- `TimeoutSeconds: 60` — each probe attempt must complete within 60 seconds
1261
+
- `PeriodSeconds: 10` — probe runs every 10 seconds
1262
+
- `FailureThreshold: 120` — pod is killed after 120 consecutive failures
1263
+
1264
+
When the startup probe succeeds, it writes `/run/nvidia/validations/.driver-ctr-ready`. Other components (GFD, device-plugin, DCGM, MIG manager) have init containers that poll for this file every 5 seconds with no upper timeout:
1265
+
1266
+
```yaml
1267
+
args: ["until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done"]
1268
+
```
1269
+
1270
+
The operator validator has a hard-coded GPU resource discovery timeout of 150 seconds (30 retries x 5 seconds), defined in `cmd/nvidia-validator/main.go`:
1271
+
1272
+
```go
1273
+
gpuResourceDiscoveryWaitRetries = 30
1274
+
gpuResourceDiscoveryIntervalSeconds = 5
1275
+
```
1276
+
1277
+
If the device-plugin hasn't registered MIG resources within 2.5 minutes (because it is also waiting on slow NVML calls), the validator fails.
1278
+
1279
+
### Workarounds
1280
+
1281
+
**Increase startup probe timeouts via ClusterPolicy:**
1282
+
1283
+
The `ClusterPolicy` CRD exposes probe configuration on the driver spec (`api/nvidia/v1/clusterpolicy_types.go`, `ContainerProbeSpec`):
1284
+
1285
+
```yaml
1286
+
apiVersion: nvidia.com/v1
1287
+
kind: ClusterPolicy
1288
+
metadata:
1289
+
name: cluster-policy
1290
+
spec:
1291
+
driver:
1292
+
startupProbe:
1293
+
initialDelaySeconds: 120
1294
+
timeoutSeconds: 120
1295
+
periodSeconds: 15
1296
+
failureThreshold: 180
1297
+
```
1298
+
1299
+
**Stagger operator component startup:**
1300
+
1301
+
Temporarily disable components on the node, let the driver/toolkit initialize first, then enable the rest:
## 14. Using the GPU Operator with Host-Installed Drivers
1339
+
1340
+
The GPU Operator does not require managing the NVIDIA driver. If you already have the NVIDIA driver installed directly on your nodes (e.g., via package manager, Base Command Manager, or a pre-built machine image), you can still use the operator for all the other components: container toolkit, device plugin, GPU feature discovery, DCGM, DCGM exporter, MIG manager, and the operator validator.
1341
+
1342
+
### Configuration
1343
+
1344
+
Disable the driver component in the `ClusterPolicy`:
1345
+
1346
+
```yaml
1347
+
apiVersion: nvidia.com/v1
1348
+
kind: ClusterPolicy
1349
+
metadata:
1350
+
name: cluster-policy
1351
+
spec:
1352
+
driver:
1353
+
enabled: false
1354
+
# All other components remain enabled by default
1355
+
toolkit:
1356
+
enabled: true
1357
+
devicePlugin:
1358
+
enabled: true
1359
+
dcgm:
1360
+
enabled: true
1361
+
dcgmExporter:
1362
+
enabled: true
1363
+
migManager:
1364
+
enabled: true
1365
+
gfd:
1366
+
enabled: true
1367
+
```
1368
+
1369
+
Or via Helm:
1370
+
1371
+
```bash
1372
+
helm install gpu-operator nvidia/gpu-operator \
1373
+
--set driver.enabled=false
1374
+
```
1375
+
1376
+
### What you get
1377
+
1378
+
With `driver.enabled=false`, the operator skips the driver DaemonSet but still deploys:
1379
+
1380
+
- **nvidia-container-toolkit** — configures the container runtime (containerd/CRI-O) to expose GPUs inside containers
1381
+
- **nvidia-device-plugin** — registers GPU resources with the Kubernetes scheduler (`nvidia.com/gpu` or `nvidia.com/mig-*`)
- **nvidia-operator-validator** — validates the full stack is functional
1386
+
1387
+
### Prerequisites
1388
+
1389
+
When using host-installed drivers, ensure:
1390
+
1391
+
1. The NVIDIA kernel module is loaded (`lsmod | grep nvidia`)
1392
+
2. `nvidia-smi` works on the host
1393
+
3. The driver version is compatible with the operator version (check the [GPU Operator compatibility matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html))
1394
+
4. The NVIDIA device files exist under `/dev` (`/dev/nvidia0`, `/dev/nvidiactl`, `/dev/nvidia-uvm`, etc.)
1395
+
1396
+
### Verify the setup
1397
+
1398
+
```bash
1399
+
# Check that the driver is detected even though it's not managed by the operator
1400
+
kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset
1401
+
# Expected: no pods (driver DaemonSet is not deployed)
Copy and run these commands in the first 5 minutes of any GPU incident. They cover operator pod overview, ClusterPolicy state, per-node GPU resources, workload events, component logs in dependency order, and recent events.
0 commit comments