Skip to content

feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6

Open
Scooter-DeJean wants to merge 3 commits intoMettaMazza:mainfrom
Scooter-DeJean:feat/rpc-servers
Open

feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6
Scooter-DeJean wants to merge 3 commits intoMettaMazza:mainfrom
Scooter-DeJean:feat/rpc-servers

Conversation

@Scooter-DeJean
Copy link
Copy Markdown
Contributor

Adds [llamacpp] rpc_servers = "host:port,..." toml field. When set, llama-server distributes layers across this node's GPU AND the listed remote rpc-server endpoints — enables 27B-32B-class models on combined two-GPU meshes. Default None; empty-string treated as None to avoid emitting --rpc with no value (which llama-server rejects).

Tests embedded covering None → no --rpc, Some("") → no --rpc, Some(value) → --rpc <value> emitted verbatim.

Per your #dev-general feedback this adds a tracing::warn! at server-startup whenever rpc_servers is set, surfacing the unauthenticated/unencrypted security posture of llama.cpp's rpc-server (§13.4 — your new section). Operators see the warning even if they never read the field's doc comment.

Third of three stacked PRs. Diff currently shows PR-A's and PR-B's commits too because of cross-fork base limitations — collapses to just this commit after A and B merge.

The provider startup health-check is hardcoded to 60 retries (1s each) in
main.rs. 60s is fine for a 7B model on a fast SSD-backed GPU, but is not
enough for genuinely slow loads:

  - >20B models on slower disks (15GB GGUF read at 200MB/s = 75s before
    any GPU work begins)
  - models split across multiple backends via llama.cpp RPC, where layer
    transfer over the network adds tens of seconds even on a LAN
  - models that need long KV cache allocation on constrained GPUs

When this trips, llama-server is alive and progressing through the load,
but ErnOS exits with "Provider failed health check after 60 attempts. Is
the server running?" — a misleading error.

This commit makes the retry budget configurable via a new toml field:

    [general]
    provider_health_check_retries = 240   # 4-minute budget

Default is 60 via a default-fn — matches the legacy hardcoded value, so
existing tomls without this field deserialize to identical pre-patch
behavior.

Tests cover: default-is-60, missing-field-deserializes-to-60, explicit-
value-honored.
llama.cpp interprets the existing hardcoded `-c 0` as "use the model's
n_ctx_train." Modern hybrid-reasoning model families (Qwen3.5/3.6, several
recent Mistral releases) advertise 256K-1M training contexts. On VRAM-
limited GPUs, llama.cpp's KV-cache allocator either:

  - auto-shrinks the context to fit available VRAM (which can still take
    60-90s to allocate), or
  - OOMs and refuses to start.

Either way, ErnOS's startup health check times out before llama-server is
ready, the supervisor logs "embedding/inference server spawned but not
healthy," and the operator is left with a useless dashboard.

This commit adds an opt-in toml field:

    [llamacpp]
    context_length = 32768   # bounds KV cache; 0 = auto-detect (default)

The default value 0 produces `-c 0` — bit-for-bit identical to current
behavior. Operators only set context_length when their hardware plus the
model's advertised training context combine to exceed the GPU's KV cache
budget.

Tests cover: default produces `-c 0`, explicit value produces `-c <value>`.
llama.cpp ships first-class RPC support: a small `rpc-server` binary runs
on a remote host, exposes its GPU(s) over TCP, and `llama-server --rpc
HOST:PORT` distributes model layers across the local GPU and any listed
remote backends. This means a user with two modest GPUs across two
machines (e.g. an 11 GB card + a 16 GB card) can run a 27B-32B-class model
that wouldn't fit on either alone.

ErnOS's LlamaCppProvider::build_server_args doesn't pass `--rpc`. There is
no toml field for it. Users wanting multi-GPU mesh inference have to fork.

This commit adds an opt-in toml field:

    [llamacpp]
    rpc_servers = "10.0.0.5:50052,10.0.0.6:50052"

When None (default), no `--rpc` flag is emitted — bit-for-bit identical to
current behavior. The empty-string case (`Some("")`) is also treated as
off, so an operator who unsets the value via TOML doesn't end up emitting
a malformed `--rpc ` (no value) that llama-server rejects.

Tests cover: None -> no --rpc, Some("") -> no --rpc, Some(value) -> --rpc
emitted with the value verbatim.

Security note: llama.cpp's rpc-server is unauthenticated and unencrypted.
The doc comment on the field flags this; operators should only enable
rpc_servers over a private network (LAN, Tailscale, WireGuard, etc.).
@MettaMazza
Copy link
Copy Markdown
Owner

🔴 REJECTED — Governance Violation

Rule: §1.1 (Max 500 lines per file) / §11 R11 (Immediate rejection trigger)

Evidence: This PR pushes src/provider/llamacpp.rs to 611 lines — 111 lines over the §1.1 limit of 500. §11 states: "A PR is immediately rejected if it contains any of the following. No discussion, no exceptions."

Required before re-review:

  • Split llamacpp.rs as part of this PR. The file must be under 500 lines after merge. Extract tests into llamacpp_tests.rs, move build_server_args into a server_args.rs submodule, or both.

The new code itself (rpc_servers field, empty-string guard, §13.4 security warning, edge-case tests) is solid. The file it lands in is not. Fix the file, resubmit.

Note: This PR is stacked on #5 which has the same issue. Both need the file split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants