Skip to content

Run toolkit validation in operand init containers#2264

Draft
rajathagasthya wants to merge 1 commit intomainfrom
fix/init-container-nvidia-module-check
Draft

Run toolkit validation in operand init containers#2264
rajathagasthya wants to merge 1 commit intomainfrom
fix/init-container-nvidia-module-check

Conversation

@rajathagasthya
Copy link
Copy Markdown
Contributor

@rajathagasthya rajathagasthya commented Apr 2, 2026

Description

The toolkit-validation init containers in operand DaemonSets previously
polled for a toolkit-ready sentinel file on the host. During driver reinstall
cycles, it is possible for operands to (in unknown situations) to find a
stale toolkit-ready file from a previous cycle, passing the init gate
while nvidia-smi would actually fail.

Replace the shell-based sentinel file check with a nvidia-validator
check. This runs nvidia-smi through the toolkit runtime wrapper and
retries until it succeeds, validating both toolkit injection and driver
module readiness without depending on host sentinel files.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 2, 2026

Coverage Status

coverage: 27.711%. remained the same
when pulling a9c8b4f on fix/init-container-nvidia-module-check
into 652724d on main.

@rajathagasthya rajathagasthya force-pushed the fix/init-container-nvidia-module-check branch 2 times, most recently from b578788 to ce2ef9a Compare April 3, 2026 17:56
@rajathagasthya rajathagasthya changed the title Prevent operand pods from starting before nvidia driver is loaded Run toolkit validation in operand init containers Apr 3, 2026
@rajathagasthya rajathagasthya force-pushed the fix/init-container-nvidia-module-check branch from ce2ef9a to 624fa74 Compare April 3, 2026 17:57
@rajathagasthya rajathagasthya modified the milestone: v26.7 Apr 3, 2026
The toolkit-validation init containers in operand DaemonSets previously
polled for a toolkit-ready sentinel file on the host. It is possible for
operands to (in unknown situations) to find a stale toolkit-ready file
from a previous cycle, passing the init gate while nvidia-smi would
actually fail.

Replace the shell-based sentinel file check with a nvidia-validator
check. This runs nvidia-smi through the toolkit runtime wrapper and
retries until it succeeds, validating both toolkit injection and driver
module readiness without depending on host sentinel files.

Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
@rajathagasthya rajathagasthya force-pushed the fix/init-container-nvidia-module-check branch from 624fa74 to a9c8b4f Compare April 3, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants