Skip to content

FlashInfer allreduce fusion disabled on SCI H200 CI runners: missing IMEX channels #20500

@alisonshao

Description

@alisonshao

Problem

FlashInfer allreduce fusion workspace creation failed with cudaErrorInsufficientDriver (CUDA error 35) on all SCI H200 CI runners, even on driver 575.57.08. The error message was misleading.

Root Cause

/dev/nvidia-caps-imex-channels/channel0 device node was never created. The GPU driver registers the nvidia-caps-imex-channels character device (visible in /proc/devices), but the device nodes must be manually created via mknod. Without them, SymmDeviceMemory allocation fails with cudaErrorInsufficientDriver.

Fix Applied (2026-03-13)

Created IMEX channel device nodes on n04, n05, n06:

major=$(cat /proc/devices | grep nvidia-caps-imex-channels | awk '{print $1}')
sudo mkdir -p /dev/nvidia-caps-imex-channels
sudo mknod /dev/nvidia-caps-imex-channels/channel0 c $major 0
sudo chmod 666 /dev/nvidia-caps-imex-channels/channel0

Note: These device nodes don't survive reboots. For persistence, add options nvidia NVreg_CreateImexChannel0=1 to /etc/modprobe.d/nvidia.conf.

Status

  • n04 — done
  • n05 — done
  • n06 — done
  • n07 — offline, will be configured when brought back
  • n08 — offline, will be configured when brought back
  • Make persistent across reboots (NVreg_CreateImexChannel0=1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions