feat(llamacpp): add explicit context_length override#5
Open
Scooter-DeJean wants to merge 2 commits intoMettaMazza:mainfrom
Open
feat(llamacpp): add explicit context_length override#5Scooter-DeJean wants to merge 2 commits intoMettaMazza:mainfrom
Scooter-DeJean wants to merge 2 commits intoMettaMazza:mainfrom
Conversation
The provider startup health-check is hardcoded to 60 retries (1s each) in
main.rs. 60s is fine for a 7B model on a fast SSD-backed GPU, but is not
enough for genuinely slow loads:
- >20B models on slower disks (15GB GGUF read at 200MB/s = 75s before
any GPU work begins)
- models split across multiple backends via llama.cpp RPC, where layer
transfer over the network adds tens of seconds even on a LAN
- models that need long KV cache allocation on constrained GPUs
When this trips, llama-server is alive and progressing through the load,
but ErnOS exits with "Provider failed health check after 60 attempts. Is
the server running?" — a misleading error.
This commit makes the retry budget configurable via a new toml field:
[general]
provider_health_check_retries = 240 # 4-minute budget
Default is 60 via a default-fn — matches the legacy hardcoded value, so
existing tomls without this field deserialize to identical pre-patch
behavior.
Tests cover: default-is-60, missing-field-deserializes-to-60, explicit-
value-honored.
llama.cpp interprets the existing hardcoded `-c 0` as "use the model's
n_ctx_train." Modern hybrid-reasoning model families (Qwen3.5/3.6, several
recent Mistral releases) advertise 256K-1M training contexts. On VRAM-
limited GPUs, llama.cpp's KV-cache allocator either:
- auto-shrinks the context to fit available VRAM (which can still take
60-90s to allocate), or
- OOMs and refuses to start.
Either way, ErnOS's startup health check times out before llama-server is
ready, the supervisor logs "embedding/inference server spawned but not
healthy," and the operator is left with a useless dashboard.
This commit adds an opt-in toml field:
[llamacpp]
context_length = 32768 # bounds KV cache; 0 = auto-detect (default)
The default value 0 produces `-c 0` — bit-for-bit identical to current
behavior. Operators only set context_length when their hardware plus the
model's advertised training context combine to exceed the GPU's KV cache
budget.
Tests cover: default produces `-c 0`, explicit value produces `-c <value>`.
Owner
🔴 REJECTED — Governance ViolationRule: §1.1 (Max 500 lines per file) / §11 R11 (Immediate rejection trigger) Evidence: This PR adds 40 lines to Required before re-review:
The new code itself (config field, serde default, tracing log, tests) is clean. The file it lands in is not. Fix the file, resubmit. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
[llamacpp] context_length = Ntoml field. Default 0 plumbs through tollama-server -c 0(auto-derive from GGUF) — bit-for-bit identical to current behavior. Operators set non-zero when hardware + model's advertised n_ctx_train combine to exceed KV cache budget (Qwen3.5/3.6 at 256K is the canonical case).Tests embedded covering default-produces-
-c 0and explicit-value-overrides.Per your #dev-general feedback this also adds a
tracing::info!when the override fires, so the §2.7 "behavior changes are never silent" requirement is satisfied at runtime — operators see the override loudly in logs at server-startup time.Second of three stacked PRs. Diff currently shows PR-A's commit too because of cross-fork base limitations — collapses to just this commit after PR-A merges.