feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6
Open
Scooter-DeJean wants to merge 3 commits intoMettaMazza:mainfrom
Open
feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6Scooter-DeJean wants to merge 3 commits intoMettaMazza:mainfrom
Scooter-DeJean wants to merge 3 commits intoMettaMazza:mainfrom
Conversation
The provider startup health-check is hardcoded to 60 retries (1s each) in
main.rs. 60s is fine for a 7B model on a fast SSD-backed GPU, but is not
enough for genuinely slow loads:
- >20B models on slower disks (15GB GGUF read at 200MB/s = 75s before
any GPU work begins)
- models split across multiple backends via llama.cpp RPC, where layer
transfer over the network adds tens of seconds even on a LAN
- models that need long KV cache allocation on constrained GPUs
When this trips, llama-server is alive and progressing through the load,
but ErnOS exits with "Provider failed health check after 60 attempts. Is
the server running?" — a misleading error.
This commit makes the retry budget configurable via a new toml field:
[general]
provider_health_check_retries = 240 # 4-minute budget
Default is 60 via a default-fn — matches the legacy hardcoded value, so
existing tomls without this field deserialize to identical pre-patch
behavior.
Tests cover: default-is-60, missing-field-deserializes-to-60, explicit-
value-honored.
llama.cpp interprets the existing hardcoded `-c 0` as "use the model's
n_ctx_train." Modern hybrid-reasoning model families (Qwen3.5/3.6, several
recent Mistral releases) advertise 256K-1M training contexts. On VRAM-
limited GPUs, llama.cpp's KV-cache allocator either:
- auto-shrinks the context to fit available VRAM (which can still take
60-90s to allocate), or
- OOMs and refuses to start.
Either way, ErnOS's startup health check times out before llama-server is
ready, the supervisor logs "embedding/inference server spawned but not
healthy," and the operator is left with a useless dashboard.
This commit adds an opt-in toml field:
[llamacpp]
context_length = 32768 # bounds KV cache; 0 = auto-detect (default)
The default value 0 produces `-c 0` — bit-for-bit identical to current
behavior. Operators only set context_length when their hardware plus the
model's advertised training context combine to exceed the GPU's KV cache
budget.
Tests cover: default produces `-c 0`, explicit value produces `-c <value>`.
llama.cpp ships first-class RPC support: a small `rpc-server` binary runs
on a remote host, exposes its GPU(s) over TCP, and `llama-server --rpc
HOST:PORT` distributes model layers across the local GPU and any listed
remote backends. This means a user with two modest GPUs across two
machines (e.g. an 11 GB card + a 16 GB card) can run a 27B-32B-class model
that wouldn't fit on either alone.
ErnOS's LlamaCppProvider::build_server_args doesn't pass `--rpc`. There is
no toml field for it. Users wanting multi-GPU mesh inference have to fork.
This commit adds an opt-in toml field:
[llamacpp]
rpc_servers = "10.0.0.5:50052,10.0.0.6:50052"
When None (default), no `--rpc` flag is emitted — bit-for-bit identical to
current behavior. The empty-string case (`Some("")`) is also treated as
off, so an operator who unsets the value via TOML doesn't end up emitting
a malformed `--rpc ` (no value) that llama-server rejects.
Tests cover: None -> no --rpc, Some("") -> no --rpc, Some(value) -> --rpc
emitted with the value verbatim.
Security note: llama.cpp's rpc-server is unauthenticated and unencrypted.
The doc comment on the field flags this; operators should only enable
rpc_servers over a private network (LAN, Tailscale, WireGuard, etc.).
Owner
🔴 REJECTED — Governance ViolationRule: §1.1 (Max 500 lines per file) / §11 R11 (Immediate rejection trigger) Evidence: This PR pushes Required before re-review:
The new code itself (rpc_servers field, empty-string guard, §13.4 security warning, edge-case tests) is solid. The file it lands in is not. Fix the file, resubmit. Note: This PR is stacked on #5 which has the same issue. Both need the file split. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
[llamacpp] rpc_servers = "host:port,..."toml field. When set, llama-server distributes layers across this node's GPU AND the listed remote rpc-server endpoints — enables 27B-32B-class models on combined two-GPU meshes. Default None; empty-string treated as None to avoid emitting--rpcwith no value (which llama-server rejects).Tests embedded covering None → no
--rpc, Some("") → no--rpc, Some(value) →--rpc <value>emitted verbatim.Per your #dev-general feedback this adds a
tracing::warn!at server-startup whenever rpc_servers is set, surfacing the unauthenticated/unencrypted security posture of llama.cpp's rpc-server (§13.4 — your new section). Operators see the warning even if they never read the field's doc comment.Third of three stacked PRs. Diff currently shows PR-A's and PR-B's commits too because of cross-fork base limitations — collapses to just this commit after A and B merge.