Skip to content

feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983

Open
Arifuzzamanjoy wants to merge 17 commits into
Light-Heart-Labs:mainfrom
Arifuzzamanjoy:feat/p2p-gpu-hints-vastai-guard
Open

feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983
Arifuzzamanjoy wants to merge 17 commits into
Light-Heart-Labs:mainfrom
Arifuzzamanjoy:feat/p2p-gpu-hints-vastai-guard

Conversation

@Arifuzzamanjoy

Copy link
Copy Markdown
Contributor

What

One-command DreamServer deployment on peer-to-peer GPU marketplaces (Vast.ai). Handles 28 known provider quirks — root user rejection, Docker socket permissions, NVIDIA/AMD toolkit setup, model bootstrapping, multi-GPU topology, reverse proxy, and SSH tunnel generation.

Where

Everything lives in resources/p2p-gpu/ — fully self-contained, no modifications to core DreamServer code or extension manifests.

@Lightheartdevs Lightheartdevs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the scope discipline — keeping the entire toolkit in resources/p2p-gpu/ with no core modifications is the right call. The architecture (lib/phases/subcommands split) mirrors the main installer cleanly.

Blocking issues:

  1. Philosophy drift from CLAUDE.md. The project's error-handling contract is explicit: "Never || true or 2>/dev/null. No silent swallowing." The file header in setup.sh claims # Design: aligned with DreamServer CLAUDE.md — Let It Crash > KISS > Pure Functions > SOLID but the implementation uses || warn "... (non-fatal)" ~20 times and || true as a fallback. warn-and-continue is silent swallowing — it just logs first. Please either:

    • Let these errors crash (the philosophy the header claims), or
    • Remove the CLAUDE.md alignment claim from the header and justify the non-fatal pattern on a case-by-case basis.
  2. chmod a+rwX as documented fallback. World-writable is a correctness and security smell on a shared remote host. Vast.ai boxes are ephemeral but still multi-tenant on the physical hardware. Setgid + POSIX ACLs should be the only path; the chmod a+rwX fallback should fail hard rather than degrade.

  3. integration-smoke CI failure. Needs to go green before merge — even if it's a pre-existing flake, confirm it's unrelated to this PR.

  4. New CI workflow (.github/workflows/p2p-gpu.yml, +182 lines) adds maintenance surface for what's positioned as a resources-tier add-on. Consider whether this is justified for code that, by its location in resources/, isn't part of the supported install path. If yes, add a maintainer comment explaining why it warrants first-class CI.

Non-blocking observations:

  • The service-hints.yaml + per-provider quirk handling is genuinely useful and I'd love to see this land.
  • 28 provider quirks is a lot to maintain — a way for the community to contribute new hints without PRs against core would help long-term.

Happy to re-review once the philosophy/permissions issues are addressed. Thanks for the submission — this is a real gap in the project and the architecture is sound.

@Arifuzzamanjoy

Arifuzzamanjoy commented Apr 18, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review — all four blocking items are addressed in this PR’s scope.

1. Philosophy: Zero || true remaining in resources/p2p-gpu/**/*.sh (and related snippets updated). Replaced with || warn or explicit empty-result handling. || warn is intentional and aligned with CLAUDE.md §4: “If you must tolerate a failure, log it: some_command || warn "failed (non-fatal)". Header and README now state this is adapted from CLAUDE.md for rented-provider environments.

2>/dev/null was converted to 2>>"$LOGFILE" broadly. Remaining uses are only expected probe patterns (kill -0, docker inspect, command -v) plus generated heredoc-script content, with inline # stderr expected: ... comments where applicable.

2. chmod a+rwX: apply_data_acl() now hard-fails (exit 1 + install guidance) if setfacl is unavailable. Renamed apply_shared_dir_permsapply_multi_uid_perms. Removed a+rwX from whisper/ and open-webui/ (replaced by specific UID ACLs). a+rwX now remains only for models/ and searxng/, each with inline rationale.

3. integration-smoke: This appears unrelated to p2p-gpu scope (and should be tracked separately if needed).

4. CI justification: Added maintainer note to p2p-gpu.yml clarifying path scope (resources/p2p-gpu/**), no core DreamServer test coverage, and why strict bash lint/syntax checks are required for root-executed marketplace scripts. Smoke uses mock Docker (no real containers).

@Lightheartdevs

Copy link
Copy Markdown
Collaborator

Audit follow-up: not merge-ready.

The p2p GPU toolkit is self-contained and shell syntax passed, but the advertised NVIDIA driver/library mismatch repair is not actually reachable: callers use if ! detect_nvml_mismatch, which loses the original exit status, and repair_nvml_mismatch can exit early under set -e when the mismatch status is returned. Please preserve/probe exit statuses explicitly and add a regression for the mismatch repair path before reconsideration.

@Arifuzzamanjoy

Arifuzzamanjoy commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

fixed. repair_nvml_mismatch now uses detect_nvml_mismatch && status=0 || status=$? at both capture sites — exit status preserved explicitly under set -e. regression test added covering status 1 (mismatch → repair attempted) and status 2 (detection failed → repair skipped). CI step included.
also pushed NVML signature detection — multi-GPU hosts were returning status 2 when stderr clearly showed driver/library mismatch. parses known NVML error strings now.

@Lightheartdevs

Copy link
Copy Markdown
Collaborator

Re-audit update: the specific NVML mismatch exit-status issues from the last audit look fixed now. The branch no longer uses the old if ! detect_nvml_mismatch pattern that lost the original exit code, repair_nvml_mismatch captures non-zero probe status without tripping errexit, and resources/p2p-gpu/tests/test-nvml-mismatch.sh passed locally.

I still cannot merge this branch yet:

  • git diff --check origin/main...HEAD fails on trailing whitespace in resources/p2p-gpu/lib/environment.sh and resources/p2p-gpu/phases/00-preflight.sh.
  • The earlier requested-changes themes are still present: many paths intentionally continue after || warn ... (non-fatal), and world-writable chmod -R a+rwX remains in the permission repair path. The README now documents that philosophy, but the code still needs a maintainer decision because this is a large new operational toolkit.
  • The PR is 5k+ lines plus a new workflow; I only re-ran syntax and the NVML regression locally, not real Vast.ai/GPU hardware.

So: good progress on the NVML bug, but still needs cleanup and maintainer sign-off on the non-fatal/world-writable permission tradeoff before merge.

@Arifuzzamanjoy

Copy link
Copy Markdown
Contributor Author
  • trailing whitespace cleaned from environment.sh + 00-preflight.sh
  • zero a+rwX in code, readme updated — acl-only, hard-fail

also caught a live deployment bug — .env missing/root-owned after partial upstream install was killing compose. seeds from .env.example now, generates all required secrets, hard-fails on permission errors.

tested on live instances across different tiers on vast.ai. ready for re-review.

@Lightheartdevs Lightheartdevs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up fixes. I re-audited this against current main, including a local merge simulation, bash syntax checks, the NVML regression test, and targeted searches through the installer paths. I still can't merge this yet.

Findings:

  1. resources/README.md:1 replaces the existing top-level Resources index with the P2P GPU guide. That file currently catalogs multi-agent/, products/, research/, cookbooks/, dev/, etc.; this PR effectively deletes that index and duplicates content that already belongs in resources/p2p-gpu/README.md. Please leave the existing index intact and add a small p2p-gpu/ entry/link instead.

  2. git diff --check origin/main...HEAD still fails, now on trailing whitespace in resources/p2p-gpu/lib/networking.sh:307, resources/p2p-gpu/lib/networking.sh:317, and resources/p2p-gpu/lib/networking.sh:325. This needs to be clean before merge.

  3. The ACL-only permission story is not actually fail-fast on ACL application failure. resources/p2p-gpu/README.md:121 says ACLs are required and hard-fail if unavailable, but resources/p2p-gpu/lib/permissions.sh:49, resources/p2p-gpu/lib/permissions.sh:50, resources/p2p-gpu/lib/permissions.sh:79, and resources/p2p-gpu/lib/permissions.sh:80 continue with warn when setfacl itself fails. That means a host with setfacl installed but an ACL-incompatible mount, bad UID mapping, or partial ACL failure can continue into service startup with broken shared-data permissions. For the shared data paths that replace the old world-writable fallback, failed setfacl/ownership repair should stop the install or be very narrowly justified.

  4. resources/p2p-gpu/phases/01-dependencies.sh:42-46 kills unattended-upgrades, apt-get, and dpkg with killall -9 whenever the dpkg frontend lock is held. That can leave the package database mid-transaction, and it fights the safer DPkg::Lock::Timeout handling used immediately below. Please avoid killing dpkg/apt-get; wait for the lock, kill only a verified stale unattended-upgrades process if absolutely necessary, and fail with a clear recovery message if the lock does not clear.

Also noticed a doc consistency issue while reviewing: resources/README.md:120 still says chmod a+rwX is a documented fallback and resources/README.md:141 still shows || true, which conflicts with the latest claim that there is no world-writable fallback and no || true path. Fixing the top-level README replacement should probably remove this duplicate stale copy entirely.

Checks run locally:

  • git diff --check origin/main...HEAD fails as noted above
  • bash -n over resources/p2p-gpu/**/*.sh passes
  • bash resources/p2p-gpu/tests/test-nvml-mismatch.sh passes
  • local git merge --no-commit --no-ff origin/main completed without conflicts, then the merge simulation was aborted

@Lightheartdevs Lightheartdevs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep re-audit of current head 222c33e: not merge-ready, and I do not think this should be direct-fixed and merged in this pass.

What I verified:

  • bash -n over resources/p2p-gpu/**/*.sh passes.
  • bash resources/p2p-gpu/tests/test-nvml-mismatch.sh passes.
  • A merge-tree check against current origin/main did not show textual conflicts.

Blocking issues still present:

  1. git diff --check origin/main...HEAD still fails on trailing whitespace:

    • resources/p2p-gpu/lib/networking.sh:307
    • resources/p2p-gpu/lib/networking.sh:317
    • resources/p2p-gpu/lib/networking.sh:325
  2. resources/README.md is still replaced by the P2P GPU guide. Current main uses this file as the top-level Resources index for multi-agent docs, products, research, cookbooks, tools, blog, docs, dev, and legacy. This PR should preserve that index and add a small p2p-gpu/ entry instead of replacing the file.

  3. The ACL hard-fail guarantee is still not true. The README/header says ACLs are required, but ACL application can still warn and continue in the shared-data paths, for example:

    • resources/p2p-gpu/lib/permissions.sh:49-50
    • resources/p2p-gpu/lib/permissions.sh:79-80
    • service-specific ACL fixes later in the same file also continue after setfacl failures.
  4. resources/p2p-gpu/phases/01-dependencies.sh:42-46 still uses killall -9 unattended-upgrades apt-get dpkg when the dpkg frontend lock is held, then treats dpkg --configure -a failure as non-fatal. That can leave the package DB in a worse state and conflicts with the safer DPkg::Lock::Timeout path already used below.

  5. This is a large new root-executed operational toolkit plus a new workflow. The idea is valuable, but the remaining work is not just mechanical cleanup; it needs a maintainer decision on fail-fast vs rented-instance resilience, plus live Vast.ai/GPU validation after those policy choices are made.

Recommendation: do not merge this PR as-is. I would split the safe pieces out first: preserve resources/README.md, remove the whitespace, make ACL setup truly fail-fast where promised, and replace the package-lock killing with bounded wait/recovery guidance. Then re-test on live rented GPU hardware before reconsidering.

@Arifuzzamanjoy

Copy link
Copy Markdown
Contributor Author

all four blocking items from the last two reviews addressed and verified on live vast.ai hardware.


1. trailing whitespace (networking.sh:307,317,325): fixed

$ git diff --check origin/main...HEAD
(empty — no errors)

2. resources/README.md: preserved

original index intact — multi-agent, research, cookbooks, tools, blog, docs, dev, legacy all untouched. p2p-gpu added as a section entry under ## Deployment & Operations:

# DreamServer Resources
...
## Deployment & Operations
### [`p2p-gpu/`](p2p-gpu/) — Deploy on Peer-to-Peer GPU Marketplaces
...
## What's Inside
### [`multi-agent/`](multi-agent/) — How We Ran a Self-Organizing AI Team
### Agent Systems Blueprint — 32 Documents, 14,384 Lines
...

stale chmod a+rwX and || true references from the old duplicate copy are gone.

3. ACL hard-fail (permissions.sh:49-50, 79-80): fixed

shared-data ACL failures now stop the install. no warn-and-continue on setfacl for paths that replaced the old world-writable fallback.

live getfacl output from RTX 5090 instance:

data/owner: dream, group: dream, flags: -s-, user::rwx | user:user:rwx | group::rwx | other::r-x
data/models/ — same ACL structure (multi-service write: llama-server, comfyui, aria2c)
data/searxng/ — adds user:977:rwx for uid variance across image versions

setgid on all shared dirs. zero a+rwX. other::r-x only.

4. dpkg lock killing (01-dependencies.sh:42-46): fixed

removed killall -9 apt-get dpkg. only kills a verified stale unattended-upgrades if the lock doesn't clear within the bounded wait. falls through to DPkg::Lock::Timeout for everything else. hard-fails with recovery message if lock persists.

$ dpkg --audit
(empty — no broken packages)

$ apt-get check
Reading package lists... Building dependency tree... Reading state information...
(no errors)

additional checks

$ bash -n resources/p2p-gpu/**/*.sh    → clean
$ nvidia-smi                           → 580.95.05 / CUDA 13.0 / RTX 5090
$ NVML mismatch check                  → host=580.95.05 vs container=580.95.05 (aligned)
$ docker compose ps                    → 21/21 running, llama-server + dashboard-api healthy

llama-server loaded Qwen3-30B-A3B (17.5GB VRAM). kokoro TTS ready (67 voices). whisper ASR bootstrapped.

known non-blocking

  • id: 'comfyui': no such user 4× during UID ownership fix — no host user, ACLs use numeric UID. services unaffected.
  • dream-tailscale /dev/net/tun busy — expected on vast.ai. optional service.

full verification logs (git-diff-check, ACLs, dpkg-audit, nvidia-smi, services, install log)
screen recordings of full installs on 3 separate fresh instances,
logs are from a separate fresh instance.

@Arifuzzamanjoy Arifuzzamanjoy force-pushed the feat/p2p-gpu-hints-vastai-guard branch from ef44fbf to c99f2a7 Compare May 16, 2026 18:36
@Arifuzzamanjoy

Arifuzzamanjoy commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

71e0e21 retired resources/ — only conflict is resources/README.md (modify/delete), rest merges clean.

before I rebase, two candidates for the new home:

  • dream-server/installers/p2p-gpu/ — matches the lib/phases pattern, but siblings (macos/, windows/, mobile/) are dispatched platform installers; p2p-gpu isn't (phase 04 just wraps install.sh --non-interactive).

  • dream-server/scripts/p2p-gpu/ — fits the "operational scripts" framing in CLAUDE.md, but would be the first nested subdir there.

ruled out: installer/ (tauri gui), extensions/library/ (service extensions, not deploy infra).

holding the rebase until you confirm.

@Lightheartdevs

Copy link
Copy Markdown
Collaborator

Decision: put p2p-gpu under dream-server/installers/p2p-gpu/.

Rationale:

  • This is a deployment-target installer/toolkit: it provisions a rented host, fixes provider quirks, prepares Docker/GPU/runtime state, then invokes the normal DreamServer install path. That makes it closer to installers/macos, installers/windows, and installers/mobile than to the smaller utility/diagnostic scripts under dream-server/scripts/.
  • It does not need to be wired into installers/dispatch.sh in v1. Treat it as an optional/provider-specific entrypoint with its own setup.sh, README, lib/, phases/, and subcommands/ under dream-server/installers/p2p-gpu/.
  • dream-server/scripts/ should stay for single-purpose operational/validation helpers and installer support scripts. A large root-executed phase/subcommand deployment toolkit there would make the directory boundary fuzzier.

Rebase target:

  • Move resources/p2p-gpu/** -> dream-server/installers/p2p-gpu/**.
  • Drop the resources/README.md touch entirely now that resources/ has been retired.
  • Update the p2p-gpu workflow path filters and smoke commands to dream-server/installers/p2p-gpu/**.
  • Keep the detailed docs as dream-server/installers/p2p-gpu/README.md; add only a small pointer from existing docs/README if needed.
  • Re-run git diff --check, bash syntax over the moved scripts, the NVML regression, and a merge simulation against current main after the move.

The prior review concerns still apply after the move: ACL behavior should remain fail-fast where promised, no world-writable fallback should come back, and package-lock recovery should avoid killing dpkg/apt-get.

Per reviewer decision: p2p-gpu is a deployment-target installer, not a
utility script. Relocate from dream-server/scripts/p2p-gpu/ to
dream-server/installers/p2p-gpu/ alongside macos/, windows/, and mobile/.

Changes:
- git mv dream-server/scripts/p2p-gpu -> dream-server/installers/p2p-gpu
- Update 9 file headers (# Part of: path)
- Update .github/workflows/p2p-gpu.yml path filters and commands
- Add docs pointer in dream-server/docs/README.md

No functional code changes. All source paths use SCRIPT_DIR via
BASH_SOURCE[0] and are location-agnostic.
…h timeout

1. Phase 05: add _ensure_persona_file() - creates data/persona/SOUL.md before compose starts. If missing, Docker mounts it as a directory, crashing hermes with "not a directory". Tries upstream renderer (build-installation-context.py), falls back to template copy, then minimal placeholder as last resort.

2. services.sh:552: add || warn to _ensure_host_agent_running call - return 1 under set -euo pipefail was killing the entire setup script when the agent failed to start. Same class of bug as the _wait_for_dpkg_lock fix.

3. services.sh: increase agent health check timeout from 8s to 20s - agent initialization (config load, service scan, HTTP bind) takes 10-20s on cold starts. The old 8s timeout caused false-negative health failures even when the agent was starting normally.
1. Phase 05: rmdir -> rm -rf for SOUL.md directory removal; Docker creates non-empty dir stubs that rmdir cannot remove.

2. Phase 05: add _ensure_mount_files() for models.ini to avoid directory-vs-file bind-mount bugs.

3. Phase 09: add _verify_model_file() to ensure GGUF exists or fall back to any .gguf before compose.
…bound variable error

Under set -euo pipefail, kernel_version and lib_version must be initialized
before being tested with [[ -n $var ]]. They are conditionally assigned in
if blocks, so if neither condition matches, the variable remains unbound.

Initialize both to empty strings on function entry.
@Arifuzzamanjoy Arifuzzamanjoy force-pushed the feat/p2p-gpu-hints-vastai-guard branch from 654ac05 to c24175c Compare May 27, 2026 09:27
@Arifuzzamanjoy

Arifuzzamanjoy commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

all done. moved to dream-server/installers/p2p-gpu/, rebased onto current main.
prior concerns preserved — acl hard-fail, zero a+rwX, zero || true, dpkg lock safe.
git diff --check, bash -n, NVML regression all clean post-move.
re-added p2p-gpu.yml at the new path (filters on installers/p2p-gpu/**, runs bash -n + nvml regression); repo-wide CI green. live vast.ai/GPU validation is manual.
tested installs across gpu tiers including multi-gpu — known non-blocking: tailscale /dev/net/tun busy (optional service), comfyui uid warnings (numeric-UID ACLs cover it).
full verification:
https://drive.google.com/drive/folders/1Q_f8IaA2HL9k3s7OVYdChso8uXwvswsy?usp=sharing

@Arifuzzamanjoy

Copy link
Copy Markdown
Contributor Author

re-added p2p-gpu.yml at the new installers path — runs bash -n + nvml regression.

…he stack

On compose failure, parse failed services from stderr (manifest error /
pull access denied) and bring up the rest with --no-deps. Partial bring-up
is intentionally non-fatal and logs PARTIAL BRING-UP with a count.
Status-aware grep, no || true.
@Arifuzzamanjoy

Arifuzzamanjoy commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

resilience layer: one missing image no longer takes down the whole stack.
but the durable fix is the hermes image pin - it points to a per-commit tag that is no longer on docker hub, so a fresh pull fails now. moving it to a digest pin. i can remove this commit if you prefer that path instead.

@Arifuzzamanjoy

Copy link
Copy Markdown
Contributor Author

multi-gpu OOM: cap ctx/batch by per-gpu vram (not total) so cuda0 stops overcommitting; model weight split per-gpu in headroom calc. verified on 4X and 8X multi-gpu

Arifuzzamanjoy and others added 2 commits June 6, 2026 04:50
Updated setup resources with new tutorial video and presentation slides.
- fix .env mode to 0660 (README + env_set comment) and broken CLAUDE.md link
- dedup /dev/shm row, add NVML row, correct installer-timeout wording, refresh tree
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants