Skip to content

fix(onboard): warn on Docker GPU sandbox DNS failure#5539

Open
HwangJohn wants to merge 3 commits into
NVIDIA:mainfrom
HwangJohn:fix/5520-docker-gpu-dns-probe
Open

fix(onboard): warn on Docker GPU sandbox DNS failure#5539
HwangJohn wants to merge 3 commits into
NVIDIA:mainfrom
HwangJohn:fix/5520-docker-gpu-dns-probe

Conversation

@HwangJohn

@HwangJohn HwangJohn commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a best-effort DNS probe after Docker GPU sandbox recreation so users get a clear warning when the patched sandbox cannot resolve external domains. The probe is diagnostic only and does not switch network modes, inject DNS settings, or fail onboarding by itself.

Related Issue

Refs #5520

Changes

  • Probe DNS from the recreated Docker GPU sandbox after patching.
  • Emit a warning with the opt-in host-network workaround when DNS is confirmed broken.
  • Add NEMOCLAW_DOCKER_GPU_PATCH_DNS_PROBE=0 as a probe opt-out.
  • Document the Docker-driver GPU DNS troubleshooting path.
  • Add regression tests for success, warning, opt-out, and host-network interactions.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

Validated locally on Windows and on DGX Spark/Linux. npm run docs completed with 0 errors; Fern reported the existing 2 warnings.

  • PR description includes the DCO sign-off declaration and every commit appears as Verified in GitHub
  • Git hooks passed during commit and push, or npx prek run --from-ref main --to-ref HEAD passes
  • Targeted tests pass for changed behavior
  • Full npm test passes (broad runtime changes only)
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: HwangJohn angelic805@gmail.com

Summary by CodeRabbit

  • Documentation

    • Added guidance for Docker GPU hosts where recreated GPU sandboxes resolve host aliases but fail to reach external domains, including getent hosts diagnostics and remediation steps (fix Docker DNS and recreate sandbox, or temporarily enable host networking).
  • New Features

    • Added an optional post-recreate external DNS probe for GPU-enabled containers and a warning printer with remediation messaging.
    • Introduced environment controls to enable/disable the probe.
  • Tests

    • Expanded coverage for probe success/failure/skip behavior and warning output, plus improved Windows path normalization for CDI spec detection checks.

Signed-off-by: HwangJohn <angelic805@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 880919c9-6fd5-4c27-add0-720ba09695b8

📥 Commits

Reviewing files that changed from the base of the PR and between 2fab485 and a54ec6f.

📒 Files selected for processing (1)
  • src/lib/onboard/docker-gpu-patch.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-gpu-patch.ts

📝 Walkthrough

Walkthrough

Adds an optional post-recreate external DNS probe to the Docker GPU sandbox patch path. After recreating the container, probeDockerGpuExternalDns runs getent hosts example.com inside the new container and returns a discriminated union result; printDockerGpuExternalDnsWarning emits remediation guidance on failure. A new environment variable disables the probe. Tests validate probe behavior, integration into the recreate flow, and warning output. Troubleshooting docs explain the failure mode and remediation paths.

Changes

Docker GPU sandbox post-recreate external DNS probe

Layer / File(s) Summary
DNS probe contract, implementation, and recreate-flow integration
src/lib/onboard/docker-gpu-patch.ts
Exports DOCKER_GPU_PATCH_DNS_PROBE_ENV constant, DockerGpuExternalDnsProbeResult union type with four branches (resolved, failed, unavailable, skipped), helper function dockerGpuPatchDnsProbeDisabled, and exported functions probeDockerGpuExternalDns (executes getent hosts inside the container via docker run exec, maps exit code and output to the union type, skippable via environment variable) and printDockerGpuExternalDnsWarning (emits four-line remediation message on DNS failure); integrates the probe into recreateOpenShellDockerSandboxWithGpu immediately after newContainerId is determined. Adds documentation comments to all exported helper functions.
Tests: DNS probe, warning printer, recreate-path, CDI path normalization
src/lib/onboard/docker-gpu-patch.test.ts
Adds imports for new DNS probe exports, introduces posixPathForTest helper for Windows path normalization in CDI detection test fixtures, adds unit tests verifying probe failure detection with parsed output details, probe skip when DOCKER_GPU_PATCH_DNS_PROBE_ENV=0, and warning printer output format; adds integration-style recreate-path test asserting that a failed DNS probe emits a warning without automatically injecting --network host into the container run.
Troubleshooting documentation for GPU sandbox external DNS failure
docs/reference/troubleshooting.mdx
Adds a new subsection documenting the getent hosts diagnostic, the Docker embedded DNS forwarding failure mode, two remediation paths (fixing Docker daemon DNS and recreating the sandbox, or opting into host networking with NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host during onboarding), and the NEMOCLAW_DOCKER_GPU_PATCH_DNS_PROBE=0 knob to skip the post-recreate DNS probe.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Suggested labels

bug-fix, area: onboarding, area: networking, area: sandbox

Suggested reviewers

  • cv
  • prekshivyas

Poem

🐇 Hop, hop — the sandbox is born anew,
But can it reach the web? Let's peek and view!
getent hosts whispers from inside the cage,
If DNS fails, a warning takes the stage.
Set the env flag and skip the fuss,
The rabbit keeps the GPU dreams robust! 🌐✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: adding DNS failure warnings for Docker GPU sandbox. It clearly reflects the primary objective and is concise.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: HwangJohn <angelic805@gmail.com>
Signed-off-by: HwangJohn <angelic805@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant