Skip to content

CON-1516: Remove vCPU gate from self-test preflight#413

Draft
jjziets wants to merge 8 commits into
vast-ai:masterfrom
jjziets:CON-1516-remove-vcpu-self-test-preflight
Draft

CON-1516: Remove vCPU gate from self-test preflight#413
jjziets wants to merge 8 commits into
vast-ai:masterfrom
jjziets:CON-1516-remove-vcpu-self-test-preflight

Conversation

@jjziets

@jjziets jjziets commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • remove the self-test preflight hard gate based on offer["cpu_cores"] >= 2 * num_gpus
  • leave physical CPU validation to the self-test runtime image, which can inspect visible physical cores
  • reduce noisy CLI-side self-test requirement output by removing the old vCPU/core preflight diagnostic from the local preflight report
  • map 5001/udp when launching the self-test image and externally probe it after TCP /progress is reachable
  • add distinct UDP diagnostics for missing 5001/udp mapping and for “TCP works, UDP echo failed”
  • update the legacy vast.py copy and regression coverage

Scope Note

  • This PR is the CLI side of physical CPU-core preflight alignment plus UDP self-test verification.
  • The paired self-test image responder is in vast-ai/self-test#6. The CLI expects that image to echo UDP probes on 5001/udp.

Validation

  • ./.venv/bin/python -m py_compile vast.py vastai/cli/commands/machines.py vastai/cli/self_test/runtime_diagnostics.py
  • ./.venv/bin/python -m pytest tests/cli/test_machines_commands.py -q — 51 passed, 1 existing pytest config warning

Dogfood

  • Paired self-test image staging build was triggered from vast-ai/self-test#6 with tag prefix self-test-udp-dogfood-cuda-.
  • Image run: https://github.com/vast-ai/self-test/actions/runs/28094691046
  • Validation jobs passed for CUDA 11.8, 12.8, 13.0, and 13.3. Image build/push jobs are still in progress as of the PR update.
  • Once the staging tag is available, the host-verification dogfood command should use --test-image vastai/test:self-test-udp-dogfood-cuda-12.8 or the matching CUDA tag.
  • The local client-image dogfood harness has also been updated to request/probe 5001/udp for paid client rentals, but that is not host verification.

@robballantyne robballantyne left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I will approve without requesting changes when test image is available

@jjziets

jjziets commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

CON-1531 follow-up pushed in 63f2752.

What changed:

  • Broadened self-test progress-stage parsing for the current image output (ResNet18 test on all GPUs, ECC test on all GPUs, NCCL distributed test with N GPUs, stress-ng and gpu-burn tests simultaneously).
  • Improved the offline runtime summary so it covers instances going offline before or during runtime.
  • Split repeated status polling/API failures into instance_status_poll_failed instead of folding them into startup timeout.
  • Cleaned up startup-timeout wording for hosts.
  • Surfaced cleanup_failed if the workload passes but the temporary paid instance cannot be destroyed.
  • Added tests for current image stage lines, status-poll timeout classification, and cleanup failure.

Validation:

  • ./.venv/bin/python -m pytest tests/cli -q -> 313 passed, 1 existing config warning.
  • Paid dogfood with latest PR image vastai/test:self-test-udp-dogfood-cuda-12.8:
    • Green machine 35008: passed TCP, UDP, system requirements, ResNet18, ECC, NCCL, stress/gpu-burn; cleanup succeeded.
    • Red machine 141071: preflight caught low reliability/upload; forced runtime failed with progress_endpoint_unreachable; support bundle printed and cleanup succeeded.

Local evidence/artifacts:

  • /Users/hanneszietsman/VastAi/CON-1531-self-test-error-flow-map.md
  • /Users/hanneszietsman/VastAi/dogfood-captures/host-self-test-latest-pr/20260625T090214Z.tar.gz

@jjziets jjziets force-pushed the CON-1516-remove-vcpu-self-test-preflight branch from 63f2752 to f737010 Compare June 25, 2026 09:25
@jjziets

jjziets commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Final CON-1531 diagnostic follow-up SHA is now f737010 after amending in the catch-all unexpected_error path.

Additional tightening in the final amend:

  • Added first-class unexpected_error catalog copy and runtime diagnostic rendering for otherwise unhandled CLI exceptions.
  • Added preflight_checks stage marking before threshold checks run.
  • Added regression coverage for unexpected exception shaping/redaction.

Latest local validation:

  • ./.venv/bin/python -m pytest tests/cli/test_runtime_diagnostics.py tests/cli/test_self_test_support_bundle.py tests/cli/test_machines_commands.py -q -> 84 passed, 1 existing config warning.
  • ./.venv/bin/python -m pytest tests/cli -q -> 314 passed, 1 existing config warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants