Skip to content

feat(llamacpp): add explicit context_length override#5

Open
Scooter-DeJean wants to merge 2 commits intoMettaMazza:mainfrom
Scooter-DeJean:feat/context-length
Open

feat(llamacpp): add explicit context_length override#5
Scooter-DeJean wants to merge 2 commits intoMettaMazza:mainfrom
Scooter-DeJean:feat/context-length

Conversation

@Scooter-DeJean
Copy link
Copy Markdown
Contributor

Adds [llamacpp] context_length = N toml field. Default 0 plumbs through to llama-server -c 0 (auto-derive from GGUF) — bit-for-bit identical to current behavior. Operators set non-zero when hardware + model's advertised n_ctx_train combine to exceed KV cache budget (Qwen3.5/3.6 at 256K is the canonical case).

Tests embedded covering default-produces--c 0 and explicit-value-overrides.

Per your #dev-general feedback this also adds a tracing::info! when the override fires, so the §2.7 "behavior changes are never silent" requirement is satisfied at runtime — operators see the override loudly in logs at server-startup time.

Second of three stacked PRs. Diff currently shows PR-A's commit too because of cross-fork base limitations — collapses to just this commit after PR-A merges.

The provider startup health-check is hardcoded to 60 retries (1s each) in
main.rs. 60s is fine for a 7B model on a fast SSD-backed GPU, but is not
enough for genuinely slow loads:

  - >20B models on slower disks (15GB GGUF read at 200MB/s = 75s before
    any GPU work begins)
  - models split across multiple backends via llama.cpp RPC, where layer
    transfer over the network adds tens of seconds even on a LAN
  - models that need long KV cache allocation on constrained GPUs

When this trips, llama-server is alive and progressing through the load,
but ErnOS exits with "Provider failed health check after 60 attempts. Is
the server running?" — a misleading error.

This commit makes the retry budget configurable via a new toml field:

    [general]
    provider_health_check_retries = 240   # 4-minute budget

Default is 60 via a default-fn — matches the legacy hardcoded value, so
existing tomls without this field deserialize to identical pre-patch
behavior.

Tests cover: default-is-60, missing-field-deserializes-to-60, explicit-
value-honored.
llama.cpp interprets the existing hardcoded `-c 0` as "use the model's
n_ctx_train." Modern hybrid-reasoning model families (Qwen3.5/3.6, several
recent Mistral releases) advertise 256K-1M training contexts. On VRAM-
limited GPUs, llama.cpp's KV-cache allocator either:

  - auto-shrinks the context to fit available VRAM (which can still take
    60-90s to allocate), or
  - OOMs and refuses to start.

Either way, ErnOS's startup health check times out before llama-server is
ready, the supervisor logs "embedding/inference server spawned but not
healthy," and the operator is left with a useless dashboard.

This commit adds an opt-in toml field:

    [llamacpp]
    context_length = 32768   # bounds KV cache; 0 = auto-detect (default)

The default value 0 produces `-c 0` — bit-for-bit identical to current
behavior. Operators only set context_length when their hardware plus the
model's advertised training context combine to exceed the GPU's KV cache
budget.

Tests cover: default produces `-c 0`, explicit value produces `-c <value>`.
@MettaMazza
Copy link
Copy Markdown
Owner

🔴 REJECTED — Governance Violation

Rule: §1.1 (Max 500 lines per file) / §11 R11 (Immediate rejection trigger)

Evidence: This PR adds 40 lines to src/provider/llamacpp.rs, pushing it from 518 → 557 lines. §1.1 mandates a maximum of 500 lines per file. §11 states: "A PR is immediately rejected if it contains any of the following. No discussion, no exceptions."

Required before re-review:

  • Split llamacpp.rs as part of this PR (not a follow-up). Extract test code into llamacpp_tests.rs or split build_server_args into a server_args.rs submodule — whichever gets the file under 500 lines.

The new code itself (config field, serde default, tracing log, tests) is clean. The file it lands in is not. Fix the file, resubmit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants