Skip to content

Latest commit

 

History

History
110 lines (86 loc) · 4.04 KB

File metadata and controls

110 lines (86 loc) · 4.04 KB

Gaudi Node Requirements and Setup Guide

This guide helps verify and automatically install the latest firmware and driver version for Habana Gaudi nodes in your Kubernetes or Standalone Environment.

What You Need

  • Intel® Gaudi® cards installed in your system
  • Linux operating system
  • Internet connection
  • Root/sudo privileges

Step 1: Check Firmware Version

hl-smi -L | grep SPI

You'll see something like:

Firmware [SPI] Version : Preboot version hl-gaudi2-1.20.0-fw-58.0.0-sec-9 (Jan 16 2025 - 17:51:04)
For visual assistance, refer to the following snapshot for Firmware version:

AI Inference Firmware Snapshot

Step 2: Check Driver Version

Use the following commands to check the required driver version installed on your Gaudi nodes:

hl-smi 

You'll see something like:

+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.20.0-fw-58.1.1.1          |
| Driver Version:                                     1.20.0-bd87f71          |
| Nic Driver Version:                                 1.20.0-e4fe12d          |
|-------------------------------+----------------------+----------------------+
For visual assistance, refer to the following snapshot for Driver version:

AI Inference Driver Snapshot

Step 3: Check Runtime Version

dpkg -l | grep habanalabs-container-runtime

You'll see something like:

ii  habanalabs-container-runtime  1.20.0-543  HABANA container runtime

Updating Your System (Automated)

The platform now includes an automated firmware and driver update script that handles the complete upgrade process safely and efficiently.

Automated Installation/Upgrade Process

Navigate to the scripts directory and run the automated firmware update script:

cd core/scripts/
chmod +x firmware-update.sh

# Upgrade to a specific version (e.g., 1.21.1)
./firmware-update.sh 1.21.1

# Force downgrade if needed (e.g., from 1.21.1 to 1.20.0)  
./firmware-update.sh 1.20.0 --force

What the Automation Does

The script automatically handles:

  • Version validation - Prevents unnecessary upgrades
  • Driver uninstallation - Safely removes existing installation
  • Driver installation - Downloads and installs target version
  • Firmware upgrade - Updates card firmware using hl-fw-loader
  • Kernel module management - Version-aware loading/unloading
  • Container runtime setup - Installs matching runtime version
  • Pre/post validation - Shows version information before and after

Manual Verification (Optional)

After running the automated script, you can verify the installation:

For Kubernetes Users

After the automated upgrade, check that your cards are properly recognized:

kubectl get nodes
kubectl describe node node-name

Look for:

Capacity:  
  habana.ai/gaudi:    8  
Allocatable: 
  habana.ai/gaudi:    8

If the numbers don't match, run:

kubectl rollout restart ds habana-ai-device-plugin-ds -n habana-ai-operator

For detailed documentation, refer to the official guide: Intel® Gaudi® Software Installation Documentation

For automation script details: See Firmware Update Script Documentation

Note: It is recommended to reboot the device after firmware/driver updates to ensure complete initialization. The automation script provides comprehensive version validation before and after the upgrade process. If for some reason rebooting the device is not possible, please try to restart habana device plugin as documented above if Model pods go to pending due to workload nodes not getting allocatable "habana.ai/gaudi" equal to capacity as documented above.