feat(llamacpp): add explicit context_length override by Scooter-DeJean · Pull Request #5 · MettaMazza/ErnOSAgent

Scooter-DeJean · 2026-05-05T19:16:49Z

Adds [llamacpp] context_length = N toml field. Default 0 plumbs through to llama-server -c 0 (auto-derive from GGUF) — bit-for-bit identical to current behavior. Operators set non-zero when hardware + model's advertised n_ctx_train combine to exceed KV cache budget (Qwen3.5/3.6 at 256K is the canonical case).

Tests embedded covering default-produces--c 0 and explicit-value-overrides.

Per your #dev-general feedback this also adds a tracing::info! when the override fires, so the §2.7 "behavior changes are never silent" requirement is satisfied at runtime — operators see the override loudly in logs at server-startup time.

Second of three stacked PRs. Diff currently shows PR-A's commit too because of cross-fork base limitations — collapses to just this commit after PR-A merges.

The provider startup health-check is hardcoded to 60 retries (1s each) in main.rs. 60s is fine for a 7B model on a fast SSD-backed GPU, but is not enough for genuinely slow loads: - >20B models on slower disks (15GB GGUF read at 200MB/s = 75s before any GPU work begins) - models split across multiple backends via llama.cpp RPC, where layer transfer over the network adds tens of seconds even on a LAN - models that need long KV cache allocation on constrained GPUs When this trips, llama-server is alive and progressing through the load, but ErnOS exits with "Provider failed health check after 60 attempts. Is the server running?" — a misleading error. This commit makes the retry budget configurable via a new toml field: [general] provider_health_check_retries = 240 # 4-minute budget Default is 60 via a default-fn — matches the legacy hardcoded value, so existing tomls without this field deserialize to identical pre-patch behavior. Tests cover: default-is-60, missing-field-deserializes-to-60, explicit- value-honored.

llama.cpp interprets the existing hardcoded `-c 0` as "use the model's n_ctx_train." Modern hybrid-reasoning model families (Qwen3.5/3.6, several recent Mistral releases) advertise 256K-1M training contexts. On VRAM- limited GPUs, llama.cpp's KV-cache allocator either: - auto-shrinks the context to fit available VRAM (which can still take 60-90s to allocate), or - OOMs and refuses to start. Either way, ErnOS's startup health check times out before llama-server is ready, the supervisor logs "embedding/inference server spawned but not healthy," and the operator is left with a useless dashboard. This commit adds an opt-in toml field: [llamacpp] context_length = 32768 # bounds KV cache; 0 = auto-detect (default) The default value 0 produces `-c 0` — bit-for-bit identical to current behavior. Operators only set context_length when their hardware plus the model's advertised training context combine to exceed the GPU's KV cache budget. Tests cover: default produces `-c 0`, explicit value produces `-c <value>`.

MettaMazza · 2026-05-07T07:46:13Z

🔴 REJECTED — Governance Violation

Rule: §1.1 (Max 500 lines per file) / §11 R11 (Immediate rejection trigger)

Evidence: This PR adds 40 lines to src/provider/llamacpp.rs, pushing it from 518 → 557 lines. §1.1 mandates a maximum of 500 lines per file. §11 states: "A PR is immediately rejected if it contains any of the following. No discussion, no exceptions."

Required before re-review:

Split llamacpp.rs as part of this PR (not a follow-up). Extract test code into llamacpp_tests.rs or split build_server_args into a server_args.rs submodule — whichever gets the file under 500 lines.

The new code itself (config field, serde default, tracing log, tests) is clean. The file it lands in is not. Fix the file, resubmit.

Scooter-DeJean added 2 commits May 5, 2026 13:10

Scooter-DeJean mentioned this pull request May 5, 2026

fix(web): dispatch ConsumeResult::ToolCalls in WebUI path #7

Open

MettaMazza mentioned this pull request May 7, 2026

feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llamacpp): add explicit context_length override#5

feat(llamacpp): add explicit context_length override#5
Scooter-DeJean wants to merge 2 commits intoMettaMazza:mainfrom
Scooter-DeJean:feat/context-length

Scooter-DeJean commented May 5, 2026

Uh oh!

MettaMazza commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Scooter-DeJean commented May 5, 2026

Uh oh!

MettaMazza commented May 7, 2026

🔴 REJECTED — Governance Violation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants