feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983
feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983Arifuzzamanjoy wants to merge 17 commits into
Conversation
Lightheartdevs
left a comment
There was a problem hiding this comment.
Appreciate the scope discipline — keeping the entire toolkit in resources/p2p-gpu/ with no core modifications is the right call. The architecture (lib/phases/subcommands split) mirrors the main installer cleanly.
Blocking issues:
-
Philosophy drift from CLAUDE.md. The project's error-handling contract is explicit: "Never
|| trueor2>/dev/null. No silent swallowing." The file header insetup.shclaims# Design: aligned with DreamServer CLAUDE.md — Let It Crash > KISS > Pure Functions > SOLIDbut the implementation uses|| warn "... (non-fatal)"~20 times and|| trueas a fallback.warn-and-continue is silent swallowing — it just logs first. Please either:- Let these errors crash (the philosophy the header claims), or
- Remove the CLAUDE.md alignment claim from the header and justify the non-fatal pattern on a case-by-case basis.
-
chmod a+rwXas documented fallback. World-writable is a correctness and security smell on a shared remote host. Vast.ai boxes are ephemeral but still multi-tenant on the physical hardware. Setgid + POSIX ACLs should be the only path; thechmod a+rwXfallback should fail hard rather than degrade. -
integration-smokeCI failure. Needs to go green before merge — even if it's a pre-existing flake, confirm it's unrelated to this PR. -
New CI workflow (
.github/workflows/p2p-gpu.yml, +182 lines) adds maintenance surface for what's positioned as a resources-tier add-on. Consider whether this is justified for code that, by its location inresources/, isn't part of the supported install path. If yes, add a maintainer comment explaining why it warrants first-class CI.
Non-blocking observations:
- The
service-hints.yaml+ per-provider quirk handling is genuinely useful and I'd love to see this land. - 28 provider quirks is a lot to maintain — a way for the community to contribute new hints without PRs against core would help long-term.
Happy to re-review once the philosophy/permissions issues are addressed. Thanks for the submission — this is a real gap in the project and the architecture is sound.
|
|
Audit follow-up: not merge-ready. The p2p GPU toolkit is self-contained and shell syntax passed, but the advertised NVIDIA driver/library mismatch repair is not actually reachable: callers use |
|
fixed. repair_nvml_mismatch now uses detect_nvml_mismatch && status=0 || status=$? at both capture sites — exit status preserved explicitly under set -e. regression test added covering status 1 (mismatch → repair attempted) and status 2 (detection failed → repair skipped). CI step included. |
|
Re-audit update: the specific NVML mismatch exit-status issues from the last audit look fixed now. The branch no longer uses the old I still cannot merge this branch yet:
So: good progress on the NVML bug, but still needs cleanup and maintainer sign-off on the non-fatal/world-writable permission tradeoff before merge. |
also caught a live deployment bug — .env missing/root-owned after partial upstream install was killing compose. seeds from .env.example now, generates all required secrets, hard-fails on permission errors. tested on live instances across different tiers on vast.ai. ready for re-review. |
Lightheartdevs
left a comment
There was a problem hiding this comment.
Thanks for the follow-up fixes. I re-audited this against current main, including a local merge simulation, bash syntax checks, the NVML regression test, and targeted searches through the installer paths. I still can't merge this yet.
Findings:
-
resources/README.md:1replaces the existing top-level Resources index with the P2P GPU guide. That file currently catalogsmulti-agent/,products/,research/,cookbooks/,dev/, etc.; this PR effectively deletes that index and duplicates content that already belongs inresources/p2p-gpu/README.md. Please leave the existing index intact and add a smallp2p-gpu/entry/link instead. -
git diff --check origin/main...HEADstill fails, now on trailing whitespace inresources/p2p-gpu/lib/networking.sh:307,resources/p2p-gpu/lib/networking.sh:317, andresources/p2p-gpu/lib/networking.sh:325. This needs to be clean before merge. -
The ACL-only permission story is not actually fail-fast on ACL application failure.
resources/p2p-gpu/README.md:121says ACLs are required and hard-fail if unavailable, butresources/p2p-gpu/lib/permissions.sh:49,resources/p2p-gpu/lib/permissions.sh:50,resources/p2p-gpu/lib/permissions.sh:79, andresources/p2p-gpu/lib/permissions.sh:80continue withwarnwhensetfaclitself fails. That means a host withsetfaclinstalled but an ACL-incompatible mount, bad UID mapping, or partial ACL failure can continue into service startup with broken shared-data permissions. For the shared data paths that replace the old world-writable fallback, failedsetfacl/ownership repair should stop the install or be very narrowly justified. -
resources/p2p-gpu/phases/01-dependencies.sh:42-46killsunattended-upgrades,apt-get, anddpkgwithkillall -9whenever the dpkg frontend lock is held. That can leave the package database mid-transaction, and it fights the saferDPkg::Lock::Timeouthandling used immediately below. Please avoid killingdpkg/apt-get; wait for the lock, kill only a verified stale unattended-upgrades process if absolutely necessary, and fail with a clear recovery message if the lock does not clear.
Also noticed a doc consistency issue while reviewing: resources/README.md:120 still says chmod a+rwX is a documented fallback and resources/README.md:141 still shows || true, which conflicts with the latest claim that there is no world-writable fallback and no || true path. Fixing the top-level README replacement should probably remove this duplicate stale copy entirely.
Checks run locally:
git diff --check origin/main...HEADfails as noted abovebash -noverresources/p2p-gpu/**/*.shpassesbash resources/p2p-gpu/tests/test-nvml-mismatch.shpasses- local
git merge --no-commit --no-ff origin/maincompleted without conflicts, then the merge simulation was aborted
Lightheartdevs
left a comment
There was a problem hiding this comment.
Deep re-audit of current head 222c33e: not merge-ready, and I do not think this should be direct-fixed and merged in this pass.
What I verified:
bash -noverresources/p2p-gpu/**/*.shpasses.bash resources/p2p-gpu/tests/test-nvml-mismatch.shpasses.- A merge-tree check against current
origin/maindid not show textual conflicts.
Blocking issues still present:
-
git diff --check origin/main...HEADstill fails on trailing whitespace:resources/p2p-gpu/lib/networking.sh:307resources/p2p-gpu/lib/networking.sh:317resources/p2p-gpu/lib/networking.sh:325
-
resources/README.mdis still replaced by the P2P GPU guide. Currentmainuses this file as the top-level Resources index for multi-agent docs, products, research, cookbooks, tools, blog, docs, dev, and legacy. This PR should preserve that index and add a smallp2p-gpu/entry instead of replacing the file. -
The ACL hard-fail guarantee is still not true. The README/header says ACLs are required, but ACL application can still warn and continue in the shared-data paths, for example:
resources/p2p-gpu/lib/permissions.sh:49-50resources/p2p-gpu/lib/permissions.sh:79-80- service-specific ACL fixes later in the same file also continue after
setfaclfailures.
-
resources/p2p-gpu/phases/01-dependencies.sh:42-46still useskillall -9 unattended-upgrades apt-get dpkgwhen the dpkg frontend lock is held, then treatsdpkg --configure -afailure as non-fatal. That can leave the package DB in a worse state and conflicts with the saferDPkg::Lock::Timeoutpath already used below. -
This is a large new root-executed operational toolkit plus a new workflow. The idea is valuable, but the remaining work is not just mechanical cleanup; it needs a maintainer decision on fail-fast vs rented-instance resilience, plus live Vast.ai/GPU validation after those policy choices are made.
Recommendation: do not merge this PR as-is. I would split the safe pieces out first: preserve resources/README.md, remove the whitespace, make ACL setup truly fail-fast where promised, and replace the package-lock killing with bounded wait/recovery guidance. Then re-test on live rented GPU hardware before reconsidering.
|
all four blocking items from the last two reviews addressed and verified on live vast.ai hardware. 1. trailing whitespace (
|
ef44fbf to
c99f2a7
Compare
|
71e0e21 retired resources/ — only conflict is resources/README.md (modify/delete), rest merges clean. before I rebase, two candidates for the new home:
ruled out: installer/ (tauri gui), extensions/library/ (service extensions, not deploy infra). holding the rebase until you confirm. |
|
Decision: put Rationale:
Rebase target:
The prior review concerns still apply after the move: ACL behavior should remain fail-fast where promised, no world-writable fallback should come back, and package-lock recovery should avoid killing |
Per reviewer decision: p2p-gpu is a deployment-target installer, not a utility script. Relocate from dream-server/scripts/p2p-gpu/ to dream-server/installers/p2p-gpu/ alongside macos/, windows/, and mobile/. Changes: - git mv dream-server/scripts/p2p-gpu -> dream-server/installers/p2p-gpu - Update 9 file headers (# Part of: path) - Update .github/workflows/p2p-gpu.yml path filters and commands - Add docs pointer in dream-server/docs/README.md No functional code changes. All source paths use SCRIPT_DIR via BASH_SOURCE[0] and are location-agnostic.
…h timeout 1. Phase 05: add _ensure_persona_file() - creates data/persona/SOUL.md before compose starts. If missing, Docker mounts it as a directory, crashing hermes with "not a directory". Tries upstream renderer (build-installation-context.py), falls back to template copy, then minimal placeholder as last resort. 2. services.sh:552: add || warn to _ensure_host_agent_running call - return 1 under set -euo pipefail was killing the entire setup script when the agent failed to start. Same class of bug as the _wait_for_dpkg_lock fix. 3. services.sh: increase agent health check timeout from 8s to 20s - agent initialization (config load, service scan, HTTP bind) takes 10-20s on cold starts. The old 8s timeout caused false-negative health failures even when the agent was starting normally.
1. Phase 05: rmdir -> rm -rf for SOUL.md directory removal; Docker creates non-empty dir stubs that rmdir cannot remove. 2. Phase 05: add _ensure_mount_files() for models.ini to avoid directory-vs-file bind-mount bugs. 3. Phase 09: add _verify_model_file() to ensure GGUF exists or fall back to any .gguf before compose.
…bound variable error Under set -euo pipefail, kernel_version and lib_version must be initialized before being tested with [[ -n $var ]]. They are conditionally assigned in if blocks, so if neither condition matches, the variable remains unbound. Initialize both to empty strings on function entry.
654ac05 to
c24175c
Compare
|
all done. moved to dream-server/installers/p2p-gpu/, rebased onto current main. |
|
re-added p2p-gpu.yml at the new installers path — runs bash -n + nvml regression. |
…he stack On compose failure, parse failed services from stderr (manifest error / pull access denied) and bring up the rest with --no-deps. Partial bring-up is intentionally non-fatal and logs PARTIAL BRING-UP with a count. Status-aware grep, no || true.
|
resilience layer: one missing image no longer takes down the whole stack. |
|
multi-gpu OOM: cap ctx/batch by per-gpu vram (not total) so cuda0 stops overcommitting; model weight split per-gpu in headroom calc. verified on 4X and 8X multi-gpu |
Updated setup resources with new tutorial video and presentation slides.
- fix .env mode to 0660 (README + env_set comment) and broken CLAUDE.md link - dedup /dev/shm row, add NVML row, correct installer-timeout wording, refresh tree
What
One-command DreamServer deployment on peer-to-peer GPU marketplaces (Vast.ai). Handles 28 known provider quirks — root user rejection, Docker socket permissions, NVIDIA/AMD toolkit setup, model bootstrapping, multi-GPU topology, reverse proxy, and SSH tunnel generation.
Where
Everything lives in
resources/p2p-gpu/— fully self-contained, no modifications to core DreamServer code or extension manifests.