diff --git a/gpu-operator/troubleshooting.rst b/gpu-operator/troubleshooting.rst index d9097d14d..9fadab98d 100644 --- a/gpu-operator/troubleshooting.rst +++ b/gpu-operator/troubleshooting.rst @@ -20,6 +20,337 @@ Troubleshooting the NVIDIA GPU Operator ####################################### +This page outlines common issues and troubleshooting steps for the NVIDIA GPU Operator. + +If you are facing an issue that is not covered by this page, please file an issue in the +`NVIDIA GPU Operator GitHub repository `_. + + +*********************************** +GPU Operator pods are stuck in Init +*********************************** + +.. rubric:: Observation + :class: h4 + +The output from ``kubectl get pods -n gpu-operator``, shows something like: + +.. code-block:: console + + gpu-feature-discovery-tmblp 0/1 Init:0/1 0 11m + nvidia-container-toolkit-daemonset-mqzwq 0/1 Init:0/1 0 2m + nvidia-dcgm-exporter-qpxxl 0/1 Init:0/1 0 8m32s + nvidia-device-plugin-daemonset-tl9k7 0/1 Init:0/1 0 11m + nvidia-operator-validator-th4w7 0/1 Init:0/4 0 10m + nvidia-driver-daemonset-4rtiu 0/2 Running 3 12m + +.. rubric:: Root Cause + :class: h4 + +This most likely refers to an issue with the nvidia-driver-daemonset. +Note that the operand pods will only come up when the driver daemonset and toolkit pods come up successfully. + +1. **Check the driver daemonset pod logs:** + + - To retrieve the main driver container logs: + + .. code-block:: console + + kubectl logs -n gpu-operator nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr + + - If you see ``Init:Error`` in the kubectl output, then retrieve the k8s-driver-manager logs + + .. code-block:: console + + kubectl logs -n gpu-operator nvidia-driver-daemonset-p97x5 -c k8s-driver-manager + +2. **Check the dmesg logs** + + - ``dmesg`` displays the messages generated by the Linux Kernel. ``dmesg`` helps us detect any issues loading the GPU driver modules especially when the driver daemonset logs don't provide a lot of information + - You can retrieve ``dmesg`` using either: kubectl exec or execute ``dmesg`` in your host terminal. + + kubectl exec + + .. code-block:: console + + kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- dmesg + + Execute ``dmesg`` in your host terminal + + .. code-block:: console + + sudo dmesg + + **TIP**: You can also grep for NVRM or Xid to view logs emitted by the driver's kernel module. + + .. code-block:: console + + sudo dmesg | grep -i NVRM + + OR + + .. code-block:: console + + sudo dmesg | grep -i Xid + +3. **Ensure that your driver daemonset has internet access to download deb/rpm packages during runtime:** + + - Check your Kubernetes cluster's VPC, Security groups and DNS settings + - Consider executing into a container shell and testing internet connectivity with a simple ``ping`` command + +************************************* +No runtime for "nvidia" is configured +************************************* + +.. rubric:: Observation + :class: h4 + +When running ``kubectl describe`` for one of the gpu-operator pods, and you see an error like: + +.. code-block:: console + + Warning FailedCreatePodSandBox 2m37s (x94 over 22m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured + +.. rubric:: Root Cause + :class: h4 + +This means that the ``RuntimeClass`` is unable to find the runtime handler named "nvidia" in your container runtime's configuration. +The runtime handler is added by the nvidia-container-toolkit, so this error message is likely related to startup issues with nvidia-container-toolkit + +.. rubric:: Action + :class: h4 + +1. **Check the nvidia-container-toolkit logs** + + - To retrieve the toolkit pod logs: + + .. code-block:: console + + kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-2rhwg -c nvidia-container-toolkit-ctr + +2. **Check the driver daemonset logs** + + - Ensure the driver daemonset is up and running. Refer to :ref:`GPU Operator pods are stuck in Init`. + +3. **Review the container runtime configuration TOML** + + - CRI-O and Containerd are the two main container runtimes supported by the toolkit. You can view the runtime configuration file and verify that the "nvidia" container runtime handler actually exists + - Here are some ways to retrieve the container runtime config: + + - If using "containerd", run the ``containerd config`` command to retrieve the active containerd configuration + - If using "cri-o", run the ``crio status config`` command to retrieve the active cri-o configuration + +***************************************************************************** +Operator validator pods crashing with "error code system not yet initialized" +***************************************************************************** + +When the operator validator pods are crashing with this error, this most likely points to a GPU node that is NVSwitch-based and requires the nvidia-fabricmanager to be installed. +NVSwitch-based systems, like NVIDIA DGX and NVIDIA HGX server systems, require the memory fabric to be setup after the GPU driver is installed. +Learn more about the Fabric Manager from the `Fabric Manager user guide `_ + +.. rubric:: Action + :class: h4 + +1. **nvidia-smi -q** + + - Execute into the driver container and run ``nvidia-smi -q`` if you are using gpu driver daemonset. + + .. code-block:: console + + kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- nvidia-smi -q + + - The ``nvidia-smi -q`` displays a verbose output with all the attributes of a GPU + - If you see the following in the ``nvidia-smi -q`` command output, then the nvidia-fabricmanager needs to be installed + + .. code-block:: console + + Fabric + State : In Progress + Status : N/A + CliqueId : N/A + ClusterUUID : N/A + + Note: If your driver is pre-installed on your host system, run ``nvidia-smi -q`` in your host's shell terminal + +2. **Refer to the nvidia-driver-daemonset logs** + + - The driver daemonset has the logic to detect NVSwitches and install the ``nvidia-fabricmanager`` if they are found + - Check the driver daemonset logs to confirm if the NVSwitch devices were detected and/or if the ``nvidia-fabricmanager`` was installed successfully + +3. **Check the Fabric Manager logs** + + - If the operator validator pods are still crashing despite fabric manager being installed, you may need to look up the fabric manager logs + - Execute into the driver container and run ``cat /var/log/fabricmanager.log`` if the gpu driver daemonset is deployed + + .. code-block:: console + + kubectl exec -n gpu-operator -it nvidia-driver-daemonset-p97x5 -c nvidia-driver-ctr -- cat /var/log/fabricmanager.log + + - If you are using a host-installed driver, SSH into the host and run ``cat /var/log/fabricmanager.log`` + +************************************************************************* +GPU Feature Discovery crashing with CreateContainerError/CrashLoopBackoff +************************************************************************* + +When the GPU Feature Discovery pods start crashing and you see the error below in the ``kubectl describe`` output, the root cause is likely a driver/hardware issue. + +.. code-block:: console + + .... + .... + Containers: + gpu-feature-discovery: + Container ID: containerd://947879d0f2a3e3a11187c3435c2e13f1d8962540b8853cebb409eaa47f661c34 Image: nvcr.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8 + Image ID: nvcr.io/nvidia/gpu-feature-discovery@sha256:84ce86490d0d313ed6517f2ac3a271e1179d7478d86c772da3846727d7feddc3 Port: + Host Port: State: Waiting + Reason: CrashLoopBackOff Last State: Terminated + Reason: StartError Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running + hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver rpc error: timed out: unknown + +.. rubric:: Action + :class: h4 + +1. **Check dmesg logs** + + - ``dmesg`` can be used to retrieve any issues stemming from gpu driver/hardware. + - You can fine tune your search by grepping for ``NVRM`` or ``Xid`` in your dmesg command output + - Your command would look like - ``sudo dmesg | grep -i NVRM`` or ``sudo dmesg | grep -i Xid`` + - If the output from the previous command has something like the snippet below, then it is likely a GPU driver/hardware issue. + + .. code-block:: console + + # dmesg |grep -i xid + NVRM: Xid (PCI:0000:ca:00): 79, pid='', name=, GPU has fallen off the bus. + + This error message indicates an Xid error with the code 79. For more information on Xid errors and its various error codes, refer to this `page `_. + +2. **Check nvidia-device-plugin-daemonset logs** + + - The ``nvidia-device-plugin`` has a health checker module which periodically monitors the NVML event stream for any Xid errors and marks a GPU as unhealthy if an Xid error is reported against it + - Retrieve the ``nvidia-device-plugin-daemonset`` pod logs + + .. code-block:: console + + kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-9bmvc -c nvidia-device-plugin + + - If there are Xid errors, the device plugin logs should look something like + + .. code-block:: console + + XidCriticalError: Xid=48 on Device=GPU-e3dbf294-2783-f38b-4274-5bc836df5be1; marking device as unhealthy. + + 'nvidia.com/gpu' device marked unhealthy: GPU-e3dbf294-2783-f38b-4274-5bc836df5be1 + +************************************************** +GPU Node does not have the expected number of GPUs +************************************************** + +When inspecting your GPU node, you may not see the expected number of "Allocatable" GPUs advertised in the node. + +For e.g., Given a GPU node with 8 GPUs, its kubectl describe output may look something like the snippet below: + +.. code-block:: console + + Name: gpu-node-1 + Roles: worker + ...... + ...... + Addresses: + InternalIP: 10.158.144.58 + Hostname: gpu-node-1 + Capacity: + cpu: 96 + ephemeral-storage: 106935552Ki + hugepages-1Gi: 0 + hugepages-2Mi: 0 + memory: 527422416Ki + nvidia.com/gpu: 7 + pods: 110 + Allocatable: + cpu: 96 + ephemeral-storage: 98551804561 + hugepages-1Gi: 0 + hugepages-2Mi: 0 + memory: 527320016Ki + nvidia.com/gpu: 7 + pods: 110 + .... + .... + +The above node only advertises 7 GPU devices as allocatable when we expect it to display 8 instead + +.. rubric:: Action + :class: h4 + +1. Check for any Xid errors in the ``nvidia-device-plugin-daemonset`` pod logs. If an Xid error is raised for a GPU, + the device plugin will automatically mark the GPU as unhealthy and take it off the list of "Allocatable" GPUs. + Here are some example device-plugin logs in the event of an Xid error: + + .. code-block:: console + + I0624 22:58:05.486593 1 health.go:159] Processing event {Device:{Handle:0x7f7597647848} EventType:8 EventData:109 GpuInstanceId:4294967295 ComputeInstanceId:4294967295} + I0624 22:58:05.486697 1 health.go:185] XidCriticalError: Xid=79 on Device=GPU-adb24b25-1db1-436e-d958-ddee5da83d07; marking device as unhealthy. + I0624 22:58:05.486727 1 server.go:276] 'nvidia.com/gpu' device marked unhealthy: GPU-adb24b25-1db1-436e-d958-ddee5da83d07 + +2. You can also check for Xid errors in GPU node's ``dmesg`` logs. + + .. code-block:: console + + sudo dmesg | grep -i xid + +3. For more information on Xid error codes and how to resolve them, you can refer to `Xid Errors `_ page. + +******************************************* +DCGM Exporter pods go into CrashLoopBackoff +******************************************* + +By default, the gpu-operator only deploys the ``dcgm-exporter`` while disabling the standalone ``dcgm``. In this setup, the ``dcgm-exporter`` spawns a dcgm process locally. If, however, ``dcgm`` is enabled and deployed as a separate pod/container, then the ``dcgm-exporter`` will attempt to connect to the ``dcgm`` pod through a Kubernetes service. If the cluster networking settings aren't applied correctly, you would likely see error messages as mentioned below in the ``dcgm-exporter`` logs: + +.. code-block:: console + + time="2025-06-25T20:09:25Z" level=info msg="Attempting to connect to remote hostengine at nvidia-dcgm:5555" + time="2025-06-25T20:09:30Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack() + /usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1() + /go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:283 +0x3d\npanic({0x18b42c0?, 0x2a8d3e0?}) + /usr/local/go/src/runtime/panic.go:770 + +.. rubric:: Action + :class: h4 + +1. If you have ``NetworkPolicies`` set up, ensure that they are configured to allow the dcgm-exporter pod to communicate with the dcgm pod +2. Ensure that you don't have security groups or network firewall settings preventing pod-pod traffic whether intranode or internode. + +*************************************** +GPU driver upgrades are not progressing +*************************************** + +Despite initiating a cluster-wide driver upgrade, not every driver daemonset gets updated to the desired version and this state may persist for a long period of time. + +.. code-block:: console + + $ kubectl get daemonsets -n gpu-operator nvidia-driver-daemonset + NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE + nvidia-driver-daemonset 4 4 4 3 4 nvidia.com/gpu.deploy.driver=true 14d + +.. rubric:: Action + :class: h4 + +1. Check for any nodes that have the ``upgrade-failed`` label. + + .. code-block:: console + + kubectl get nodes -l nvidia.com/gpu-driver-upgrade-state=upgrade-failed + +2. Check the driver daemonset pod logs in these nodes +3. If the driver daemonset pod logs aren't informative, check the node's ``dmesg`` +4. Once the issue is resolved, you can re-label the node with the command below: + + .. code-block:: console + + kubectl label node "nvidia.com/gpu-driver-upgrade-state=upgrade-required" + +5. If the driver upgrade is still stuck, delete the driver pod on the node. + **************************************************************** Pods stuck in Pending state in mixed MIG + full GPU environments **************************************************************** @@ -97,7 +428,7 @@ On each node, run the following commands to prevent loading the ``nouveau`` Linu $ sudo init 6 -************************************* +************************************* No GPU Driver or Operand Pods Running ************************************* @@ -127,7 +458,6 @@ The ``NoSchedule`` taint prevents the Operator from deploying the GPU Driver and Describe each node, identify the taints, and either remove the taints from the nodes or add the taints as tolerations to the daemon sets. - ************************************* GPU Operator Pods Stuck in Crash Loop ************************************* @@ -172,7 +502,6 @@ can get stuck in a crash loop. PIDPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:13 +0000 KubeletReady kubelet is posting ready status - .. rubric:: Root Cause :class: h4 @@ -183,7 +512,7 @@ The memory resource limit for the GPU Operator is too low for the cluster size. Increase the memory request and limit for the GPU Operator pod: -- Set the memory request to a value that matches the average memory consumption over an large time window. +- Set the memory request to a value that matches the average memory consumption over a large time window. - Set the memory limit to match the spikes in memory consumption that occur occasionally. #. Increase the memory resource limit for the GPU Operator pod: @@ -203,22 +532,17 @@ Increase the memory request and limit for the GPU Operator pod: Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop. - -************************************************ infoROM is corrupted (nvidia-smi return code 14) -************************************************ - +================================================ .. rubric:: Issue :class: h4 The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well. - .. rubric:: Observation :class: h4 - The output from the driver validation container indicates that the infoROM is corrupt: .. code-block:: console @@ -260,37 +584,32 @@ The return values for the ``nvidia-smi`` command are listed below. Return code reflects whether the operation succeeded or failed and what was the reason of failure. - · Return code 0 - Success - - · Return code 2 - A supplied argument or flag is invalid - · Return code 3 - The requested operation is not available on target device - · Return code 4 - The current user does not have permission to access this device or perform this operation - · Return code 6 - A query to find an object was unsuccessful - · Return code 8 - A device's external power cables are not properly attached - · Return code 9 - NVIDIA driver is not loaded - · Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU - · Return code 12 - NVML Shared Library couldn't be found or loaded - · Return code 13 - Local version of NVML doesn't implement this function - · Return code 14 - infoROM is corrupted - · Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible - · Return code 255 - Other error or internal driver error occurred - + · Return code 0 - Success + · Return code 2 - A supplied argument or flag is invalid + · Return code 3 - The requested operation is not available on target device + · Return code 4 - The current user does not have permission to access this device or perform this operation + · Return code 6 - A query to find an object was unsuccessful + · Return code 8 - A device's external power cables are not properly attached + · Return code 9 - NVIDIA driver is not loaded + · Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU + · Return code 12 - NVML Shared Library couldn't be found or loaded + · Return code 13 - Local version of NVML doesn't implement this function + · Return code 14 - infoROM is corrupted + · Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible + · Return code 255 - Other error or internal driver error occurred .. rubric:: Root Cause :class: h4 -The ``nvidi-smi`` command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node. +The ``nvidia-smi`` command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node. .. rubric:: Action :class: h4 Replace the faulty GPU. - -********************* EFI + Secure Boot -********************* - +================= .. rubric:: Issue :class: h4 @@ -300,9 +619,23 @@ GPU Driver pod fails to deploy. .. rubric:: Root Cause :class: h4 -EFI Secure Boot is currently not supported with GPU Operator +EFI Secure Boot is currently not supported with the GPU Operator .. rubric:: Action :class: h4 Disable EFI Secure Boot on the server. + +File an issue +================= + +If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, you can run the ``must-gather`` utility to prepare a bug report. + +.. code-block:: console + + curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh + chmod +x must-gather.sh + ./must-gather.sh + +This utility is used to collect relevant information from your cluster that is needed for diagnosing and debugging issues. +The final output is an archive file which contains the manifests and logs of all the components managed by gpu-operator.