Skip to content

[codex] fix gpu toolkit cdi preflight detection#5529

Open
HOYALIM wants to merge 1 commit into
NVIDIA:mainfrom
HOYALIM:codex/issue-5489-gpu-toolkit-preflight
Open

[codex] fix gpu toolkit cdi preflight detection#5529
HOYALIM wants to merge 1 commit into
NVIDIA:mainfrom
HOYALIM:codex/issue-5489-gpu-toolkit-preflight

Conversation

@HOYALIM

@HOYALIM HOYALIM commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds Linux lspci fallback GPU detection when nvidia-smi is unavailable, so preflight still recognizes NVIDIA PCI GPU hardware.
  • Keeps the existing install_nvidia_container_toolkit remediation active for missing toolkit plus missing/invalid NVIDIA CDI spec.
  • Treats CDI guard opt-out as explicit CPU/GPU-off intent rather than auto-disabled GPU detection state.
  • Adds regression coverage for the exact issue path and the cached resume preflight guard.

Validation

  • NODE_PATH=/Users/holim/code/NemoClaw/node_modules /Users/holim/code/NemoClaw/node_modules/.bin/vitest run src/lib/onboard/preflight-cdi.test.ts src/lib/onboard/machine/handlers/preflight.test.ts
  • NODE_PATH=/Users/holim/code/NemoClaw/node_modules /Users/holim/code/NemoClaw/node_modules/.bin/tsc -p tsconfig.src.json --noEmit
  • git diff --check

Notes: #5489 currently has an assignee, so this PR is intentionally limited to the missing preflight detection/remediation path.

Fixes #5489

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved NVIDIA GPU detection on Linux by trying nvidia-smi first and falling back to lspci when needed.
    • Refined preflight/resume GPU passthrough opt-out logic (including noGpu and sandbox GPU mode 0) and updated CDI GPU spec guarding accordingly.
    • When Docker is reachable on Linux, runtime is now normalized to docker when it was previously unknown.
  • Tests

    • Added/expanded CDI and preflight resume scenarios, including cases where nvidia-ctk is unavailable and expected blocking remediation/actions are verified.

@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4641be50-3c16-4407-9273-e6f6334d70ec

📥 Commits

Reviewing files that changed from the base of the PR and between 31dfdb2 and 40b779c.

📒 Files selected for processing (5)
  • src/lib/onboard.ts
  • src/lib/onboard/machine/handlers/preflight.test.ts
  • src/lib/onboard/machine/handlers/preflight.ts
  • src/lib/onboard/preflight-cdi.test.ts
  • src/lib/onboard/preflight.ts

📝 Walkthrough

Walkthrough

Fixes preflight skipping install_nvidia_container_toolkit remediation when nvidia-smi is absent. detectNvidiaGpu gains an lspci -nn PCI scan fallback for non-WSL Linux. The GPU passthrough opt-out condition is unified across preflight.ts, onboard.ts, and the handler to use sandboxGpuConfig.mode === "0" instead of the derived sandboxGpuEnabled boolean. Tests are added and adjusted accordingly.

Changes

GPU Detection and CDI Opt-Out Fix

Layer / File(s) Summary
detectNvidiaGpu lspci fallback and assessHost wiring
src/lib/onboard/preflight.ts
detectNvidiaGpu is refactored to accept a structured opts object with platform, isWsl, and optional commandExistsImpl. When nvidia-smi -L is unavailable or yields no output on non-WSL Linux, it falls back to scanning lspci -nn for NVIDIA controller entries. assessHost computes Linux release and /proc/version details earlier to derive isWslHost before capability probing, then threads isWsl and commandExistsImpl into the updated detectNvidiaGpu call. An earlier normalization block forces containerRuntime to "docker" on Linux when dockerReachable is true and runtime remains "unknown".
Opt-out condition unified to mode === "0"
src/lib/onboard/machine/handlers/preflight.ts, src/lib/onboard.ts
resumeOptedOutGpuPassthrough in the resume+cached handler path is redefined to check noGpu, effectiveSandboxGpuFlag === "disable", or resumeSandboxGpuConfig.mode === "0", replacing the prior composite of gpuRequested / session?.gpuPassthrough === false / !resumeSandboxGpuConfig.sandboxGpuEnabled. preflight() in onboard.ts applies the same mode === "0" logic when computing explicitlyOptedOutGpuPassthrough passed to assertCdiNvidiaGpuSpecPresent, replacing the prior !sandboxGpuConfig.sandboxGpuEnabled check. JSDoc comments document the CDI NVIDIA GPU spec validation guard, unsupported-container-runtime rejection, and the purpose of the preflight step.
CDI and handler preflight tests
src/lib/onboard/preflight-cdi.test.ts, src/lib/onboard/machine/handlers/preflight.test.ts
New assessHost — CDI Vitest case stubs host probing to emulate NVIDIA PCI hardware with CDI spec directories configured but nvidia-ctk unavailable; asserts assessHost flags cdiNvidiaGpuSpecMissing as true and nvidiaContainerToolkitInstalled as false, then verifies planHostRemediation emits a blocking install_nvidia_container_toolkit action with apt-get install -y nvidia-container-toolkit and nvidia-ctk cdi generate / nvidia-ctk cdi list commands. Handler preflight resume tests updated to expect assertCdiNvidiaGpuSpecPresent second argument as false (instead of true), and host-GPU-platform resume test adjusted to include false as an argument before the "jetson" platform string. Test helpers documented with JSDoc.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

bug-fix, area: onboarding, area: sandbox, v0.0.66

Suggested reviewers

  • prekshivyas
  • cv

Poem

🐇 When smi goes dark, the GPU hides its face,
But lspci leaps in to scan every trace!
mode === "0" now anchors the opt-out just right,
No more skipped remediations in the dead of night.
The rabbit hops on — CDI specs gleaming bright! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[codex] fix gpu toolkit cdi preflight detection' directly summarizes the main change: fixing GPU toolkit CDI preflight detection logic.
Linked Issues check ✅ Passed Changes fully address issue #5489 by implementing lspci fallback GPU detection and correcting CDI opt-out handling to emit toolkit remediation when toolkit is absent and CDI is configured.
Out of Scope Changes check ✅ Passed All modifications directly support the fix: GPU detection via lspci fallback, CDI guard flag computation, opt-out logic adjustments, and comprehensive test coverage for the specific issue path.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@HOYALIM HOYALIM force-pushed the codex/issue-5489-gpu-toolkit-preflight branch 2 times, most recently from c36dfb5 to f246231 Compare June 17, 2026 06:41
@HOYALIM HOYALIM marked this pull request as ready for review June 17, 2026 06:42
Copilot AI review requested due to automatic review settings June 17, 2026 06:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves host GPU preflight detection and aligns the CDI guard behavior with “explicit opt-out” semantics, including a new test case for PCI-based NVIDIA detection.

Changes:

  • Extend NVIDIA GPU detection to fall back to lspci on Linux when nvidia-smi is unavailable.
  • Move WSL detection earlier in assessHost and pass WSL/platform context into the GPU probe.
  • Update preflight handler logic + tests to keep the CDI guard active on resume unless GPU was explicitly disabled.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/lib/onboard/preflight.ts Adds PCI fallback for NVIDIA detection and computes isWslHost earlier for downstream checks.
src/lib/onboard/preflight-cdi.test.ts Adds coverage for toolkit bootstrap planning when PCI detects NVIDIA but nvidia-smi/nvidia-ctk are missing.
src/lib/onboard/machine/handlers/preflight.ts Refines “opted out” logic on resume to reflect explicit disable/flags.
src/lib/onboard/machine/handlers/preflight.test.ts Updates expectations + adds regression coverage for resume behavior with auto GPU mode.
src/lib/onboard.ts Renames and refines the “explicit opt-out” passthrough flag used by the CDI guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/lib/onboard/preflight.ts Outdated
Comment thread src/lib/onboard/preflight-cdi.test.ts
@HOYALIM HOYALIM force-pushed the codex/issue-5489-gpu-toolkit-preflight branch from f246231 to 8c0cc12 Compare June 17, 2026 06:54
@HOYALIM

HOYALIM commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai resume
@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@HOYALIM: Sure! I'll resume reviews for this PR and kick off a full review now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@HOYALIM

HOYALIM commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

1626-1654: Run the recommended onboard E2E matrix for this behavioral guard change.

Given this touches preflight/CDI gating logic in src/lib/onboard.ts, please run the listed onboarding E2E jobs (especially cloud-e2e, sandbox-operations-e2e, channels-stop-start-e2e, and issue-3600-gpu-proof-optional-e2e) before merge.

As per coding guidelines, changes in src/lib/onboard.ts should be validated with the specified nightly E2E job set.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 1626 - 1654, Before merging the changes to
the preflight function in src/lib/onboard.ts which modify the CDI/GPU gating
logic, run the complete E2E test matrix including the cloud-e2e,
sandbox-operations-e2e, channels-stop-start-e2e, and
issue-3600-gpu-proof-optional-e2e jobs to validate the behavioral guard changes
to the onboarding flow. This testing is required per coding guidelines for any
modifications to src/lib/onboard.ts that touch preflight validation logic.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 1626-1654: Before merging the changes to the preflight function in
src/lib/onboard.ts which modify the CDI/GPU gating logic, run the complete E2E
test matrix including the cloud-e2e, sandbox-operations-e2e,
channels-stop-start-e2e, and issue-3600-gpu-proof-optional-e2e jobs to validate
the behavioral guard changes to the onboarding flow. This testing is required
per coding guidelines for any modifications to src/lib/onboard.ts that touch
preflight validation logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f97d1410-4b17-4826-b58c-b6a81cf5cc33

📥 Commits

Reviewing files that changed from the base of the PR and between 1baae3c and bea9b95.

📒 Files selected for processing (5)
  • src/lib/onboard.ts
  • src/lib/onboard/machine/handlers/preflight.test.ts
  • src/lib/onboard/machine/handlers/preflight.ts
  • src/lib/onboard/preflight-cdi.test.ts
  • src/lib/onboard/preflight.ts

@HOYALIM

HOYALIM commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai summary

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Summary regeneration triggered.

@HOYALIM HOYALIM force-pushed the codex/issue-5489-gpu-toolkit-preflight branch from bea9b95 to 40b779c Compare June 18, 2026 01:33
@HOYALIM

HOYALIM commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Linux][Onboard] preflight skips install_nvidia_container_toolkit remediation when toolkit absent, Docker CDI configured, and nvidia-smi unavailable

2 participants