-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Open
Description
Problem
FlashInfer allreduce fusion workspace creation failed with cudaErrorInsufficientDriver (CUDA error 35) on all SCI H200 CI runners, even on driver 575.57.08. The error message was misleading.
Root Cause
/dev/nvidia-caps-imex-channels/channel0 device node was never created. The GPU driver registers the nvidia-caps-imex-channels character device (visible in /proc/devices), but the device nodes must be manually created via mknod. Without them, SymmDeviceMemory allocation fails with cudaErrorInsufficientDriver.
Fix Applied (2026-03-13)
Created IMEX channel device nodes on n04, n05, n06:
major=$(cat /proc/devices | grep nvidia-caps-imex-channels | awk '{print $1}')
sudo mkdir -p /dev/nvidia-caps-imex-channels
sudo mknod /dev/nvidia-caps-imex-channels/channel0 c $major 0
sudo chmod 666 /dev/nvidia-caps-imex-channels/channel0Note: These device nodes don't survive reboots. For persistence, add options nvidia NVreg_CreateImexChannel0=1 to /etc/modprobe.d/nvidia.conf.
Status
- n04 — done
- n05 — done
- n06 — done
- n07 — offline, will be configured when brought back
- n08 — offline, will be configured when brought back
- Make persistent across reboots (
NVreg_CreateImexChannel0=1)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels