|
| 1 | +# Troubleshooting Guide |
| 2 | + |
| 3 | +This section provides common deployment and runtime issues observed during Intel® AI for Enterprise Inference setup — along with step-by-step resolutions. |
| 4 | + |
| 5 | +**Issues:** |
| 6 | + 1. [Missing Default User](#1-ansible-deployment-failure--missing-default-user) |
| 7 | + 2. [Authorization or sudo Password Failure](#2-authorization-or-sudo-password-failure) |
| 8 | + 3. [Configuration Mismatch (Wrong Parameters)](#3-configuration-mismatch-wrong-parameters) |
| 9 | + 4. [Kubernetes Cluster Not Reachable](#4-kubernetes-cluster-not-reachable) |
| 10 | + 5. [Habana Device Plugin CrashLoopBackOff](#5-habana-device-plugin-crashloopbackoff) |
| 11 | + 6. [Model pods remain in "Pending" state](#6-model-pods-remain-in-pending-state) |
| 12 | + 7. [Models' Output is Garbled and/or Model Pods Failing](#7-models-output-is-garbled-andor-model-pods-failing) |
| 13 | + 8. [Model Deployment Failure with Padding-aware scheduling](#8-model-deployment-failure-with-padding-aware-scheduling) |
| 14 | + 9. [Inference Stack Deploy Keycloak System Error](#9-inference-stack-deploy-keycloak-system-error) |
| 15 | + 10. [Kubernetes pods failing with "disk pressure"](#10-kubernetes-pods-failing-with-disk-pressure) |
| 16 | + 11. [Hugging face authentication failure](#11-Hugging-face-authentication-failure) |
| 17 | + 12. [Docker Image Pull Failure](#12-Docker-Image-Pull-Failure) |
| 18 | + 13. [Triton Package Compatibility Issue](#13-triton-package-compatibility-issue) |
| 19 | +--- |
| 20 | + |
| 21 | +### 1. Ansible Deployment Failure — Missing Default User |
| 22 | + |
| 23 | +TASK [download : Prep_download | Create staging directory on remote node] |
| 24 | +fatal: [master1]: FAILED! => {"msg": "chown failed: failed to look up user ubuntu"} |
| 25 | + |
| 26 | + |
| 27 | +**Cause:** |
| 28 | + |
| 29 | +The default Ansible user "ubuntu" does not exist on your system. |
| 30 | + |
| 31 | +**Fix:** |
| 32 | + |
| 33 | +Many cloud images create the "ubuntu" user by default, but your system may not have it. Edit the inventory file to change the Ansible user name to your user: |
| 34 | +```bash |
| 35 | +vi inventory/hosts.yaml |
| 36 | +``` |
| 37 | + |
| 38 | +Update the "ansible_user" with the user that owns Enterprise Inference, in the example below, just "user": |
| 39 | + |
| 40 | +```bash |
| 41 | +all: |
| 42 | + hosts: |
| 43 | + master1: |
| 44 | + ansible_connection: local |
| 45 | + ansible_user: user |
| 46 | + ansible_become: true |
| 47 | +``` |
| 48 | + |
| 49 | +--- |
| 50 | + |
| 51 | +### 2. Authorization or sudo Password Failure |
| 52 | + |
| 53 | +Deployment fails with authorization or privilege escalation issues. |
| 54 | + |
| 55 | +**Fix:** |
| 56 | + |
| 57 | +Two options: |
| 58 | +1. every time, just prior to executing inference-stack-deploy.sh, execute "sudo echo sudoing" and enter your sudo password. This normally will keep your sudo authorization in effect through the execution of inference-stack-deploy.sh. |
| 59 | +2. Add `--ask-become-pass` parameter in the inference-stack-deploy.sh script. Specifically, append this flag after `--become-user=root` in the `ansible-playbook` command of `run_reset_playbook()` and `run_fresh_install_playbook()` (lines 821 and 865). NOTE that this will mean the script will wait for input of your sudo password each time it is run. |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +### 3. Configuration Mismatch (Wrong Parameters) |
| 64 | + |
| 65 | +Deployment fails due to incorrect or missing configuration values. |
| 66 | + |
| 67 | +**Fix:** |
| 68 | +Before re-running deployment, verify and update your inference-config.cfg. These values must match your actual deployment environment. |
| 69 | +```bash |
| 70 | +cluster_url=api.example.com # <-- Replace with cluster url |
| 71 | +cert_file=~/certs/cert.pem |
| 72 | +key_file=~/certs/key.pem |
| 73 | +keycloak_client_id=my-client-id # <-- Replace with your Keycloak client ID |
| 74 | +keycloak_admin_user=your-keycloak-admin-user # <-- Replace with your keycloak admin username |
| 75 | +keycloak_admin_password=changeme # <-- Replace with your keycloak admin password |
| 76 | +vault_pass_code=place-holder-123 |
| 77 | +deploy_kubernetes_fresh=on |
| 78 | +deploy_ingress_controller=on |
| 79 | +deploy_keycloak_apisix=on |
| 80 | +deploy_genai_gateway=off |
| 81 | +deploy_observability=off |
| 82 | +deploy_llm_models=on |
| 83 | +deploy_ceph=off |
| 84 | +deploy_istio=off |
| 85 | +``` |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +### 4. Kubernetes Cluster Not Reachable |
| 90 | + |
| 91 | +Deployment shows “cluster not reachable” or kubectl command failures. |
| 92 | + |
| 93 | +**Possible Causes & Fixes:** |
| 94 | + |
| 95 | + - **Cause:** Sudo authorization is not cached |
| 96 | + |
| 97 | + - **Fix:** Prior to executing inference-stack-deploy.sh, execute any sudo command, such as `sudo echo sudoing`. That will cache your credentials for the time that inference-stack-deploy.sh is executing. |
| 98 | + |
| 99 | + - **Cause:** Ansible was uninstalled |
| 100 | + |
| 101 | + - **Fix:** Reinstall manually: |
| 102 | + |
| 103 | +```bash |
| 104 | +sudo apt update |
| 105 | +sudo apt install -y ansible |
| 106 | +``` |
| 107 | + |
| 108 | + - **Cause:** Kubernetes configuration mismatch |
| 109 | + |
| 110 | + - **Fix:** Ensure `~/.kube/config` exists and the context points to the correct cluster. |
| 111 | + |
| 112 | + - **Cause:** Sudo is stripping the kubectl path from the environment, so kubectl is not found. |
| 113 | + |
| 114 | + - **Fix:** Ensure that the sudoers file includes the path `/usr/local/bin` in the `secure_path` variable. See the user-guide prerequisites for details. |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +### 5. Habana Device Plugin CrashLoopBackOff |
| 119 | + |
| 120 | +habana-ai-device-plugin-ds-* CrashLoopBackOff |
| 121 | +ERROR: failed detecting Habana's devices on the system: get device name: no habana devices on the system |
| 122 | + |
| 123 | +**Cause:** |
| 124 | +Device plugin unable to detect Gaudi3 PCIe cards. |
| 125 | + |
| 126 | +**Fix:** |
| 127 | +Update your Habana device plugin version. Version 1.22.1-6 is recommended. |
| 128 | + |
| 129 | +kubectl set image pod/habana-ai-device-plugin-ds-tjbch \ |
| 130 | + habana-ai-device-plugin=vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin:1.22.1-6 |
| 131 | + |
| 132 | +**Verification:** |
| 133 | + |
| 134 | +```bash |
| 135 | +kubectl get pods -A |
| 136 | +``` |
| 137 | + |
| 138 | +Note: Ensure the habana-ai-device-plugin status changes to Running. |
| 139 | + |
| 140 | +Check driver/NIC versions hl-smi |
| 141 | +Confirm runtime version `dpkg -l |
| 142 | +Validate Kubernetes health kubectl get nodes -o wide |
| 143 | +Check device plugin logs kubectl logs -n habana-ai-operator <device-plugin-pod> |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +### 6. Model Pods Remain in "Pending" State |
| 148 | + |
| 149 | +Problem: After the inference stack is deployed, model pods remain in the "Pending" state and do not progress to the "Running" state, as shown here: |
| 150 | + |
| 151 | +```bash |
| 152 | +user@master1:~/Enterprise-Inference/core$ kubectl get pods |
| 153 | +NAME READY STATUS RESTARTS AGE |
| 154 | +keycloak-0 1/1 Running 0 15m |
| 155 | +keycloak-postgresql-0 1/1 Running 0 15m |
| 156 | +vllm-deepkseek-r1-qwen-32b-64b885895f-dh566 0/1 Pending 0 10m |
| 157 | +vllm-llama-8b-786d7678ff-6fr6l 0/1 Pending 0 10m |
| 158 | +``` |
| 159 | + |
| 160 | +This can occur if the habana-ai-operator pod does not identify that the gaudi3 devices are allocatable. To check if this is the reason, execute the following command: |
| 161 | + |
| 162 | +```bash |
| 163 | +kubectl describe node master1 |
| 164 | +``` |
| 165 | + |
| 166 | +Look for the the "Capacity" and "Allocatable" sections as below, and ensure that both list the correct number of habana.ai/gaudi3 devices for your hardware. |
| 167 | + |
| 168 | +```bash |
| 169 | +Capacity: |
| 170 | + habana.ai/gaudi: 8 |
| 171 | +Allocatable: |
| 172 | + habana.ai/gaudi: 8 |
| 173 | +``` |
| 174 | + |
| 175 | +If the "Allocatable" section shows zero (0), your pods will remain in the pending state. |
| 176 | +To resolve this, execute the following command to restart the operator so it registers the devices: |
| 177 | + |
| 178 | +```bash |
| 179 | +kubectl rollout restart ds habana-ai-device-plugin-ds -n habana-ai-operator |
| 180 | +``` |
| 181 | + |
| 182 | +If the "rollout restart" does not resolve the issue, a system restart often works to fix it. |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +### 7. Models' Output is Garbled and/or Model Pods Failing |
| 187 | + |
| 188 | +IOMMU passthrough is required for Gaudi 3 on **Ubuntu 24.04.2/22.04.5 with Linux kernel 6.8**, and models can produce garbled output or fail if this setting is not applied. Skip this section if a different OS or kernel version is used. |
| 189 | + |
| 190 | +To enable IOMMU passthrough: |
| 191 | +1. Add `GRUB_CMDLINE_LINUX_DEFAULT="iommu=pt intel_iommu=on"` to `/etc/default/grub`. |
| 192 | +2. Run sudo update-grub. |
| 193 | +3. Reboot the system. |
| 194 | + |
| 195 | +--- |
| 196 | + |
| 197 | +### 8. Model Deployment Failure with Padding-aware scheduling |
| 198 | + |
| 199 | +**Error:** Padding-aware scheduling currently does not work with chunked prefill |
| 200 | + |
| 201 | +**Casue:** This issue occurs when the --use-padding-aware-scheduling flag is enabled while deploying a vLLM model on Habana Gaudi3. |
| 202 | +The current vLLM version (v0.9.0.1+Gaudi-1.22.0) does not support using padding-aware scheduling together with chunked prefill. |
| 203 | + |
| 204 | +**Fix:** If your workload doesn’t require padding-aware scheduling, you can disable it to allow deployment to proceed. |
| 205 | + |
| 206 | +Edit your `gaudi3-values.yaml` file. Locate and remove the following flag from the vLLM startup command: |
| 207 | +```bash |
| 208 | +--use-padding-aware-scheduling |
| 209 | +``` |
| 210 | + |
| 211 | +Redeploy the vLLM Helm chart: |
| 212 | +```bash |
| 213 | +helm upgrade --install vllm-llama-8b ./core/helm-charts/vllm \ |
| 214 | + --values ./core/helm-charts/vllm/gaudi3-values.yaml |
| 215 | +``` |
| 216 | + |
| 217 | +Confirm the pod starts successfully: |
| 218 | +```bash |
| 219 | +kubectl get pods |
| 220 | +kubectl logs -f <vllm-pod-name> |
| 221 | +``` |
| 222 | + |
| 223 | +--- |
| 224 | + |
| 225 | +### 9. Inference Stack Deploy Keycloak System Error |
| 226 | + |
| 227 | +**Error:** TASK \[Deploy Keycloak System\] FAILED! ... "Failure when executing Helm command ... response status code 429: toomanyrequests: You have reached your unauthenticated pull rate limit." |
| 228 | + |
| 229 | +**Cause:** This error was seen when attempting a redeployment (running inference_stack_deploy.sh, menu "1) Provision Enterprise Inference Cluster") when the Keycloak service is already installed and the inference_config.cfg "deploy_keycloak_apisix"="on". |
| 230 | + |
| 231 | +**Fix:** Update inference_config.cfg to change "deploy_keycloak_apisix=on" to "deploy_keycloak_apisix=off" and rerun inference_stack_deploy.sh. |
| 232 | + |
| 233 | +--- |
| 234 | + |
| 235 | +### 10. Kubernetes pods failing with "disk pressure" |
| 236 | + |
| 237 | +If pods are hanging in "pending" state or in CrashLoopBackoff with "disk pressure" messages when examining logs (kubectl logs <pod> or kubectl describe pod <pod>), you may be lacking space on a required filesystem. The Enterprise Inference standard installation will use /opt/local-path-provisioner for model local storage. Ensure this location has sufficient space allocated. It is recommended that you undeploy any failing models, allocate more space to the local-path-provisioner, then redeploy your models. |
| 238 | + |
| 239 | +--- |
| 240 | + |
| 241 | +### 11. Hugging face authentication failure |
| 242 | + |
| 243 | +**Error :** Deployment fails or hangs when running inference-stack-deploy.sh or while deploying models with below error |
| 244 | + |
| 245 | +```bash |
| 246 | +su "${USERNAME}" -c "cd /home/${USERNAME}/Enterprise-Inference/core && echo -e '1\n${MODELS}\nyes' | bash ./inference-stack-deploy.sh --models '${MODELS}' --cpu-or-gpu '${GPU_TYPE}' --hugging-face-token ${HUGGINGFACE_TOKEN}" resolution to this is getting new hygging face and updating in inference-config |
| 247 | +``` |
| 248 | +**Cause:** The Hugging Face token passed via --hugging-face-token does not match the token stored in inference-config.cfg, or the token has expired / been revoked. |
| 249 | + |
| 250 | +**Fix:** |
| 251 | + |
| 252 | +1. Check if hugging face token has required permission for model trying to deploy. |
| 253 | +2. Check if hugging face token is expired. generate new hugging face token, Update your inference-config.cfg and run inference |
| 254 | + |
| 255 | +--- |
| 256 | + |
| 257 | +### 12. Docker Image Pull Failure |
| 258 | + |
| 259 | +**Error:** During deployment, the image download task fails and retries multiple times: |
| 260 | +```bash |
| 261 | +TASK [download : Download_container | Download image if required] |
| 262 | +FAILED - RETRYING: [master1]: Download_container | Download image if required |
| 263 | +``` |
| 264 | +
|
| 265 | +**Cause**: Docker Hub enforces pull rate limits for unauthenticated users. |
| 266 | +When multiple images are pulled during Enterprise Inference deployment, the limit may be exceeded, causing HTTP 429 Too Many Requests. |
| 267 | +
|
| 268 | +This commonly occurs when: |
| 269 | +
|
| 270 | +Re-running deployments multiple times |
| 271 | +
|
| 272 | +Deploying on fresh nodes without Docker authentication |
| 273 | +
|
| 274 | +Multiple images are pulled in quick succession |
| 275 | +
|
| 276 | +**Fix:** |
| 277 | +
|
| 278 | +Verify the issue with a manual pull test |
| 279 | +```bash |
| 280 | +sudo ctr -n k8s.io images pull docker.io/library/registry:2.8.1 |
| 281 | +``` |
| 282 | +
|
| 283 | +If this fails with 429 Too Many Requests, Docker Hub rate limiting is confirmed. |
| 284 | +
|
| 285 | +**Option A — Authenticate to Docker Hub** |
| 286 | +
|
| 287 | +-> Log in to Docker Hub so containerd can pull images with higher limits. |
| 288 | +```bash |
| 289 | +sudo docker login |
| 290 | +``` |
| 291 | +-> Enter your Docker Hub username and password (or access token). |
| 292 | +
|
| 293 | +-> After login, retry the image pull: |
| 294 | +```bash |
| 295 | +sudo ctr -n k8s.io images pull docker.io/kubernetesui/metrics-scraper:v1.0.8 |
| 296 | +``` |
| 297 | +
|
| 298 | +**Option B — Wait for Rate Limit Reset** |
| 299 | +
|
| 300 | +Docker Hub rate limits typically reset after a few hours. wait 2–4 hours and retry deployment or image pull |
| 301 | +
|
| 302 | +### 13. Triton Package Compatibility Issue |
| 303 | +
|
| 304 | +**Error:** |
| 305 | +During model deployment, the inference service may fail to start and worker processes may exit unexpectedly with an error similar to: |
| 306 | +
|
| 307 | +> RuntimeError: Worker failed with error *module `triton` has no attribute `next_power_of_2`*. |
| 308 | +
|
| 309 | +**Cause:** |
| 310 | +This issue is caused by a compatibility mismatch between the Triton package and the vLLM execution path used during model deployment. It commonly occurs when deploying models using vLLM with default parameter, when Triton is present but does not fully support the required execution path, or when deployments target CPU or accelerator-based platforms (including Gaudi) without platform-specific tuning. As a result, |
| 311 | +vLLM workers fail during initialization and the inference engine does not reach a ready state. |
| 312 | +
|
| 313 | +**Fix:** |
| 314 | +Apply the Intel-recommended environment variables and command-line parameters during model deployment to ensure vLLM uses a compatible execution path. |
| 315 | +
|
| 316 | +**Environment Variables (YAML):** |
| 317 | +```yaml |
| 318 | +VLLM_CPU_KVCACHE_SPACE: "40" |
| 319 | +VLLM_RPC_TIMEOUT: "100000" |
| 320 | +VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1" |
| 321 | +VLLM_ENGINE_ITERATION_TIMEOUT_S: "120" |
| 322 | +VLLM_CPU_NUM_OF_RESERVED_CPU: "0" |
| 323 | +VLLM_CPU_SGL_KERNEL: "1" |
| 324 | +HF_HUB_DISABLE_XET: "1" |
| 325 | +``` |
| 326 | +
|
| 327 | +**Extra Command Arguments (YAML list):** |
| 328 | +```yaml |
| 329 | +- "--block-size" |
| 330 | +- "128" |
| 331 | +- "--dtype" |
| 332 | +- "bfloat16" |
| 333 | +- "--distributed_executor_backend" |
| 334 | +- "mp" |
| 335 | +- "--enable_chunked_prefill" |
| 336 | +- "--enforce-eager" |
| 337 | +- "--max-model-len" |
| 338 | +- "33024" |
| 339 | +- "--max-num-batched-tokens" |
| 340 | +- "2048" |
| 341 | +- "--max-num-seqs" |
| 342 | +- "256" |
| 343 | +``` |
| 344 | +
|
| 345 | +**Notes:** |
| 346 | +Tensor parallelism and pipeline parallelism are determined dynamically based on the deployment configuration: |
| 347 | +
|
| 348 | +```yaml |
| 349 | +tensor_parallel_size: "{{ .Values.tensor_parallel_size }}" |
| 350 | +pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}" |
| 351 | +``` |
| 352 | +
|
| 353 | +**Result:** |
| 354 | +After applying the recommended parameters, model deployment completes successfully and the inference service starts without worker initialization failures. |
| 355 | +
|
0 commit comments