feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh by Scooter-DeJean · Pull Request #6 · MettaMazza/ErnOSAgent

Scooter-DeJean · 2026-05-05T19:16:51Z

Adds [llamacpp] rpc_servers = "host:port,..." toml field. When set, llama-server distributes layers across this node's GPU AND the listed remote rpc-server endpoints — enables 27B-32B-class models on combined two-GPU meshes. Default None; empty-string treated as None to avoid emitting --rpc with no value (which llama-server rejects).

Tests embedded covering None → no --rpc, Some("") → no --rpc, Some(value) → --rpc <value> emitted verbatim.

Per your #dev-general feedback this adds a tracing::warn! at server-startup whenever rpc_servers is set, surfacing the unauthenticated/unencrypted security posture of llama.cpp's rpc-server (§13.4 — your new section). Operators see the warning even if they never read the field's doc comment.

Third of three stacked PRs. Diff currently shows PR-A's and PR-B's commits too because of cross-fork base limitations — collapses to just this commit after A and B merge.

The provider startup health-check is hardcoded to 60 retries (1s each) in main.rs. 60s is fine for a 7B model on a fast SSD-backed GPU, but is not enough for genuinely slow loads: - >20B models on slower disks (15GB GGUF read at 200MB/s = 75s before any GPU work begins) - models split across multiple backends via llama.cpp RPC, where layer transfer over the network adds tens of seconds even on a LAN - models that need long KV cache allocation on constrained GPUs When this trips, llama-server is alive and progressing through the load, but ErnOS exits with "Provider failed health check after 60 attempts. Is the server running?" — a misleading error. This commit makes the retry budget configurable via a new toml field: [general] provider_health_check_retries = 240 # 4-minute budget Default is 60 via a default-fn — matches the legacy hardcoded value, so existing tomls without this field deserialize to identical pre-patch behavior. Tests cover: default-is-60, missing-field-deserializes-to-60, explicit- value-honored.

llama.cpp interprets the existing hardcoded `-c 0` as "use the model's n_ctx_train." Modern hybrid-reasoning model families (Qwen3.5/3.6, several recent Mistral releases) advertise 256K-1M training contexts. On VRAM- limited GPUs, llama.cpp's KV-cache allocator either: - auto-shrinks the context to fit available VRAM (which can still take 60-90s to allocate), or - OOMs and refuses to start. Either way, ErnOS's startup health check times out before llama-server is ready, the supervisor logs "embedding/inference server spawned but not healthy," and the operator is left with a useless dashboard. This commit adds an opt-in toml field: [llamacpp] context_length = 32768 # bounds KV cache; 0 = auto-detect (default) The default value 0 produces `-c 0` — bit-for-bit identical to current behavior. Operators only set context_length when their hardware plus the model's advertised training context combine to exceed the GPU's KV cache budget. Tests cover: default produces `-c 0`, explicit value produces `-c <value>`.

llama.cpp ships first-class RPC support: a small `rpc-server` binary runs on a remote host, exposes its GPU(s) over TCP, and `llama-server --rpc HOST:PORT` distributes model layers across the local GPU and any listed remote backends. This means a user with two modest GPUs across two machines (e.g. an 11 GB card + a 16 GB card) can run a 27B-32B-class model that wouldn't fit on either alone. ErnOS's LlamaCppProvider::build_server_args doesn't pass `--rpc`. There is no toml field for it. Users wanting multi-GPU mesh inference have to fork. This commit adds an opt-in toml field: [llamacpp] rpc_servers = "10.0.0.5:50052,10.0.0.6:50052" When None (default), no `--rpc` flag is emitted — bit-for-bit identical to current behavior. The empty-string case (`Some("")`) is also treated as off, so an operator who unsets the value via TOML doesn't end up emitting a malformed `--rpc ` (no value) that llama-server rejects. Tests cover: None -> no --rpc, Some("") -> no --rpc, Some(value) -> --rpc emitted with the value verbatim. Security note: llama.cpp's rpc-server is unauthenticated and unencrypted. The doc comment on the field flags this; operators should only enable rpc_servers over a private network (LAN, Tailscale, WireGuard, etc.).

MettaMazza · 2026-05-07T07:46:23Z

🔴 REJECTED — Governance Violation

Rule: §1.1 (Max 500 lines per file) / §11 R11 (Immediate rejection trigger)

Evidence: This PR pushes src/provider/llamacpp.rs to 611 lines — 111 lines over the §1.1 limit of 500. §11 states: "A PR is immediately rejected if it contains any of the following. No discussion, no exceptions."

Required before re-review:

Split llamacpp.rs as part of this PR. The file must be under 500 lines after merge. Extract tests into llamacpp_tests.rs, move build_server_args into a server_args.rs submodule, or both.

The new code itself (rpc_servers field, empty-string guard, §13.4 security warning, edge-case tests) is solid. The file it lands in is not. Fix the file, resubmit.

Note: This PR is stacked on #5 which has the same issue. Both need the file split.

Scooter-DeJean added 3 commits May 5, 2026 13:10

Scooter-DeJean mentioned this pull request May 5, 2026

fix(web): dispatch ConsumeResult::ToolCalls in WebUI path #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6

feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6
Scooter-DeJean wants to merge 3 commits intoMettaMazza:mainfrom
Scooter-DeJean:feat/rpc-servers

Scooter-DeJean commented May 5, 2026

Uh oh!

MettaMazza commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Scooter-DeJean commented May 5, 2026

Uh oh!

MettaMazza commented May 7, 2026

🔴 REJECTED — Governance Violation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants