This note defines a small admin API for loading, unloading, and inspecting models at runtime.
The goal is to avoid editing local.json and restarting the service for routine model management.
It is intentionally a v1 design:
- live runtime control only
- no automatic writes back to
settings.jsonorlocal.json - no arbitrary model definitions via API
- no force unload
- no background job system for model loads
Current reality note:
- this admin API is implemented and is the live control plane used by the workbench
- runtime-only load overrides are implemented for
llama_cpp,exllamav3,vllm,vllm_serve, andllama_server llama_serverload/unload starts and stops a managed nativellama-serversubprocess; binary path, model path, library path,mmproj, draft model path, host, port, and extra native args stay in model config in v1vllm_serveload/unload starts and stops a managed localvllm servesubprocess; binary path, target model id/path, library path, environment, host, port, API key, and extra CLI args stay in model config in v1- the live
load_constraintspayload remains the source of truth for UI controls when this note and implementation details drift
- Purpose
- Core Concepts
- State Semantics
- Request Behavior By Runtime State
- TranslateGemma Request Notes
- Endpoints
- Unload And In-Flight Requests
The current service merges settings.json and local.json into one effective config, then loads enabled models at startup.
The admin API adds a separate live control plane on top of that merged config:
- the merged config tells us which models are known to the service
- the live runtime state tells us which of those models are currently loaded
That distinction must stay explicit in both the API and the UI.
A configured model definition comes from the merged settings.json + local.json payload.
This is static process input. It includes fields such as:
model_pathbackenddeviceprompt_format- backend-specific settings
enabled
This definition is not modified by the admin API in v1.
Each configured model also has a live runtime state inside the process.
Allowed states:
unloadedloadingloadedunloadingfailed
These states are runtime-only and may differ from the original enabled value in config.
- the model exists in merged config
- no runtime is currently loaded
- inference requests for this model are rejected
- the model may be loaded through the admin API
- a runtime load has started but is not complete yet
- inference requests for this model are rejected
- duplicate load requests should be treated as idempotent and return current state
- a runtime exists and may serve inference requests
- the model may be unloaded through the admin API
- no new inference requests are accepted for this model
- in-flight requests are allowed to finish
- once in-flight requests reach zero, runtime resources are released
- the last load attempt failed
last_errorshould be retained for inspection- the model may be loaded again through the admin API
For POST /v1/responses:
loaded: acceptunloaded: rejectloading: rejectunloading: rejectfailed: reject
The error should be explicit and machine-readable.
Suggested error codes:
unknown_modelmodel_not_loadedmodel_loadingmodel_unloadingmodel_failedthinking_unsupported
Requests may optionally include thinking: "default" | "enabled" | "disabled".
default preserves the model configuration. enabled and disabled are
accepted only when the selected model advertises those values in
capabilities.thinking_modes; otherwise the request is rejected with
400 thinking_unsupported.
llama_cpp models configured with prompt_format: "translategemma_template" use the official structured TranslateGemma request shape internally. They remain single-turn text models and continue to report capabilities.multi_turn: false.
Known-source requests should include both source_lang_code and target_lang_code.
Mixed-source requests may omit source_lang_code, or set it to "auto" or "mixed", while still providing target_lang_code. In that mode, the runtime keeps the structured TranslateGemma path, uses an internal valid source-language fallback, and prepends a short instruction asking the model to detect the source language per segment. This supports payloads where one input contains multiple source languages; it is not a raw Gemma prompt/tokenizer path.
Returns all known models from merged config together with their live runtime state.
This endpoint is the main UI source of truth.
Suggested response shape:
{
"models": [
{
"name": "google_gemma-4-E2B-it-Q8_0-gguf",
"resolved_backend": "llama_cpp",
"configured_enabled": true,
"runtime_state": "loaded",
"is_loaded": true,
"replicas": 3,
"replica_max": 4,
"loaded_replicas": 3,
"inflight_requests": 0,
"queue_depth": 0,
"runtime_inflight": 0,
"configured_target_inflight": 1,
"effective_target_inflight": 1,
"last_error": null,
"vram_estimate_mib": 57200,
"vram_estimate_replica_count": 3,
"vram_estimate_source": "model_artifact_size",
"load_constraints": {
"gguf_n_ctx": {
"kind": "integer",
"minimum": 1,
"step": 1
},
"gguf_flash_attn": {
"kind": "enum",
"default": "auto",
"allowed_values": ["on", "off", "auto"],
"examples": ["auto", "on", "off"]
},
"gguf_type_k": {
"kind": "string_or_null",
"format": "ggml_type_name",
"default": "f16",
"allowed_values": ["f32", "f16", "bf16", "q8_0", "q4_0", "q4_1", "iq4_nl", "q5_0", "q5_1"],
"examples": ["f16", "q8_0", "q4_0"]
},
"gguf_type_v": {
"kind": "string_or_null",
"format": "ggml_type_name",
"default": "f16",
"allowed_values": ["f32", "f16", "bf16", "q8_0", "q4_0", "q4_1", "iq4_nl", "q5_0", "q5_1"],
"examples": ["f16", "q8_0", "q4_0"]
}
},
"load_recommendations": {
"gguf_cache_type_pairs": {
"kind": "pair_presets",
"fields": ["gguf_type_k", "gguf_type_v"],
"recommended_pairs": [
{
"label": "f16/f16",
"gguf_type_k": "f16",
"gguf_type_v": "f16"
},
{
"label": "q8_0/q8_0",
"gguf_type_k": "q8_0",
"gguf_type_v": "q8_0"
},
{
"label": "q4_0/q4_0",
"gguf_type_k": "q4_0",
"gguf_type_v": "q4_0"
}
],
"notes": [
"Service-curated presets for GGUF cache types.",
"Prefer symmetric GGUF K/V pairs by default; asymmetric pairs may reduce or disable GPU offload in upstream llama.cpp."
]
}
},
"load_override": {},
"capabilities": {
"modalities": ["text"],
"multi_turn": true,
"thinking_modes": ["default"]
},
"definition": {
"model_path": "/home/gunnar/models/google_gemma-4-E2B-it-Q8_0/google_gemma-4-E2B-it-Q8_0.gguf",
"backend": "llama_cpp",
"prompt_format": "gemma4_template",
"enabled": true,
"replicas": 3,
"replica_max": 4,
"target_inflight": 1,
"gguf_n_gpu_layers": -1,
"gguf_n_ctx": 4096,
"gguf_flash_attn": "auto",
"gguf_type_k": null,
"gguf_type_v": null
}
}
]
}Notes:
configured_enabledreports what the merged config saysruntime_statereports the live process statereplicasreports the current effective replica count for the admin rowdefinition.replicasreports the configured default replica countloaded_replicasreports how many replicas of the public model are currently loadedqueue_depthis the public-model queue depth inside the schedulerruntime_inflightis aggregate inflight work across loaded replicas of the public modelconfigured_target_inflightis the configured per-replica inflight targeteffective_target_inflightis the currently honest per-replica scheduler target after capability clampingvram_estimate_mibis an approximate per-model VRAM estimatevram_estimate_replica_countis the replica count that the VRAM estimate was measured or derived forvram_estimate_sourceis eitherobserved_load_delta,model_artifact_size, orunavailablecapabilities.modalitieslists which input modalities the model accepts (["text"]or["text", "image"]); a UI can use it to decide whether to allow image input for a modelcapabilities.multi_turnreports whether the model accepts a multi-turnmessagesarray onPOST /v1/responses; this istrueforllama_server,vllm, andvllm_servemodels and for supported text-onlyllama_cppchat prompt formats (generic,mistral_template,qwen3_template,gemma4_template), but remainsfalseforllama_cpptranslategemma_templatecapabilities.thinking_modeslists accepted values for request-levelthinking; models without a safe per-request control report only["default"], while supported vLLM Gemma4/Qwen3,llama_cppGemma4, ExLlamaV3 Gemma4/Qwen3, CT2 Qwen3, and configured remote models report["default", "enabled", "disabled"]load_constraintsdescribes backend-specific live-load fields for UI controlsload_recommendationsdescribes service-curated recommended presets and pairings for UI defaultsload_overridereports the runtime-only override currently active on a loaded modeldefinitioncontains common model fields plus only the fields relevant to the resolved backend
For UI work, load_constraints is the source of truth for which live-load controls should be shown for a model.
Rules:
- if a field is absent from
load_constraints, the UI should treat that field as unsupported for that model - for
kind: "integer", the UI should useminimumandstepdirectly for numeric inputs or sliders - for
kind: "enum", the UI should useallowed_valuesdirectly for a constrained select or segmented control - for
kind: "string_or_null", the UI should use a text input or a constrained select if the frontend chooses to offer known values - if a
defaultis present inload_constraints, the UI may use it as the concrete runtime default when bothdefinitionandload_overrideresolve tonull load_constraintsis derived from the resolved backend, not from whether the model is currently loaded or unloaded- when this document and upstream backend docs differ, the UI should follow the live
load_constraintspayload returned by the service
For UI work, load_recommendations is the source of truth for which presets the service recommends surfacing first.
Rules:
load_recommendationsis optional and additive; it does not replaceload_constraints- fields listed in
recommended_pairsmust still be validated againstload_constraints - the service may accept more combinations than it recommends
- the UI should treat these presets as convenience defaults, not as an exhaustive list of allowed values
For UI state, definition and load_override should be interpreted together:
definitionis the configured value from merged configload_overrideis a runtime-only sparse patch- the effective loaded value is computed by applying
load_overrideoverdefinition - key presence in
load_overridematters, even when the value isnull
This means the UI should not use truthiness to merge values.
Correct merge rule:
if key exists in load_override:
effective_value = load_override[key]
else:
effective_value = definition[key]
This matters in particular for exllama_cache_quant.
Example:
{
"load_override": {
"exllama_cache_quant": null
},
"definition": {
"exllama_cache_quant": "8,8"
}
}In this case, the effective loaded value is null.
For llama_cpp GGUF cache fields, the UI may interpret the effective cache type value as:
- field absent in
load_overrideand absent ornullindefinition: useload_constraints.<field>.default, currently"f16" - effective value
null: useload_constraints.<field>.default, currently"f16" - effective value
"q8_0":q8_0 - effective value
"q4_0":q4_0
For ExLlamaV3, the UI may interpret the effective quant value as:
- field absent in
load_overrideand absent ornullindefinition: fp16 - effective value
null: fp16 - effective value
"8":k=8,v=8 - effective value
"8,4":k=8,v=4
The API does not currently return separate k_bits and v_bits fields.
The UI should parse exllama_cache_quant itself when it wants to display separate K/V values.
llama_cpp GGUF:
{
"gguf_n_ctx": {
"kind": "integer",
"minimum": 1,
"step": 1
},
"gguf_flash_attn": {
"kind": "enum",
"default": "auto",
"allowed_values": ["on", "off", "auto"],
"examples": ["auto", "on", "off"]
},
"gguf_type_k": {
"kind": "string_or_null",
"format": "ggml_type_name",
"default": "f16",
"allowed_values": ["f32", "f16", "bf16", "q8_0", "q4_0", "q4_1", "iq4_nl", "q5_0", "q5_1"],
"examples": ["f16", "q8_0", "q4_0"]
},
"gguf_type_v": {
"kind": "string_or_null",
"format": "ggml_type_name",
"default": "f16",
"allowed_values": ["f32", "f16", "bf16", "q8_0", "q4_0", "q4_1", "iq4_nl", "q5_0", "q5_1"],
"examples": ["f16", "q8_0", "q4_0"]
}
}GGUF recommended presets:
{
"gguf_cache_type_pairs": {
"kind": "pair_presets",
"fields": ["gguf_type_k", "gguf_type_v"],
"recommended_pairs": [
{
"label": "f16/f16",
"gguf_type_k": "f16",
"gguf_type_v": "f16"
},
{
"label": "q8_0/q8_0",
"gguf_type_k": "q8_0",
"gguf_type_v": "q8_0"
},
{
"label": "q4_0/q4_0",
"gguf_type_k": "q4_0",
"gguf_type_v": "q4_0"
}
]
}
}ExLlamaV3:
{
"exllama_cache_size": {
"kind": "integer",
"minimum": 256,
"step": 256
},
"exllama_max_rq_tokens": {
"kind": "integer",
"minimum": 1,
"step": 1
},
"exllama_cache_k_bits": {
"kind": "integer_or_null",
"minimum": 2,
"maximum": 8,
"default": null,
"null_means": "fp16",
"allowed_values": [2, 3, 4, 5, 6, 7, 8]
},
"exllama_cache_v_bits": {
"kind": "integer_or_null",
"minimum": 2,
"maximum": 8,
"default": null,
"null_means": "fp16",
"allowed_values": [2, 3, 4, 5, 6, 7, 8]
},
"exllama_cache_quant": {
"kind": "string_or_null",
"format": "<bits>|<k_bits>,<v_bits>"
}
}vLLM and vLLM Serve:
{
"vllm_max_model_len": {
"kind": "integer",
"minimum": 256,
"step": 256
},
"vllm_kv_cache_dtype": {
"kind": "enum",
"default": "auto",
"allowed_values": ["auto", "fp8", "fp8_e4m3", "fp8_e5m2"],
"examples": ["auto", "fp8"]
},
"vllm_kv_cache_memory_bytes": {
"kind": "integer",
"minimum": 268435456,
"step": 268435456,
"unit": "bytes",
"display_unit": "mib"
},
"vllm_max_pixels": {
"kind": "integer",
"minimum": 200704,
"step": 200704,
"unit": "pixels"
},
"vllm_speculative_method": {
"kind": "string_or_null",
"format": "vllm_speculative_method",
"default": null,
"examples": ["mtp", "draft_model", "mlp_speculator"]
},
"vllm_speculative_model": {
"kind": "string_or_null",
"format": "hf_id_or_local_path",
"default": null,
"examples": ["google/gemma-4-26B-A4B-it-assistant"]
},
"vllm_num_speculative_tokens": {
"kind": "integer",
"minimum": 1,
"step": 1,
"default": 1
}
}llama-server:
{
"llama_server_n_ctx": {
"kind": "integer",
"minimum": 1,
"step": 1
},
"llama_server_image_max_tokens": {
"kind": "integer",
"minimum": 1,
"step": 1
},
"llama_server_spec_type": {
"kind": "enum",
"default": "draft-mtp",
"allowed_values": ["draft-mtp"],
"examples": ["draft-mtp"]
},
"llama_server_spec_draft_n_max": {
"kind": "integer",
"minimum": 1,
"maximum": 6,
"step": 1,
"default": 2
},
"llama_server_spec_draft_p_min": {
"kind": "float",
"minimum": 0.0,
"maximum": 1.0,
"default": 0.0
}
}ExLlamaV3 recommended presets:
{
"exllama_cache_bit_pairs": {
"kind": "pair_presets",
"fields": ["exllama_cache_k_bits", "exllama_cache_v_bits"],
"recommended_pairs": [
{
"label": "fp16",
"exllama_cache_k_bits": null,
"exllama_cache_v_bits": null
},
{
"label": "8/8",
"exllama_cache_k_bits": 8,
"exllama_cache_v_bits": 8
},
{
"label": "8/4",
"exllama_cache_k_bits": 8,
"exllama_cache_v_bits": 4
}
]
}
}CT2, openai_remote, and stub:
{}Returns current GPU memory usage (from nvidia-smi) and per-model VRAM estimates.
Suggested response shape:
{
"gpus": [
{
"index": 0,
"name": "NVIDIA RTX PRO 6000 Blackwell Workstation Edition",
"used_mib": 75603,
"total_mib": 97887,
"used_over_total": "75603MiB / 97887MiB"
}
],
"models": [
{
"name": "google_gemma-4-E2B-it-Q8_0-gguf",
"runtime_state": "loaded",
"is_loaded": true,
"configured_target_inflight": 1,
"effective_target_inflight": 1,
"vram_estimate_mib": 12500,
"vram_estimate_replica_count": 3,
"vram_estimate_source": "model_artifact_size"
},
{
"name": "mistral-small-3.2-24b-instruct-2506-gguf",
"runtime_state": "unloaded",
"is_loaded": false,
"configured_target_inflight": 1,
"effective_target_inflight": 1,
"vram_estimate_mib": 16800,
"vram_estimate_replica_count": 1,
"vram_estimate_source": "model_artifact_size"
}
],
"error": null
}Notes:
used_over_totalmatches the compact view you typically read fromnvidia-smivram_estimate_mibfor unloaded models is still an estimate, not a reservationvram_estimate_replica_counttells the caller how many replicas that estimate corresponds to- if
nvidia-smiis unavailable,gpuscan be empty anderrorwill explain why
Loads one model that already exists in merged config.
Rules:
404ifmodel_nameis unknown200if the model is alreadyloadedorloading- transition
unloaded -> loading -> loaded - transition
failed -> loading -> loaded - if load fails, transition to
failedand retainlast_error - an optional request body may provide
replicasfor this load, but only while the model isunloadedorfailed - an optional request body may provide temporary backend-specific load overrides for this one live load
Supported load override fields:
- public model:
replicas llama_cpp:gguf_n_ctx,gguf_flash_attn,gguf_type_k,gguf_type_v- ExLlamaV3:
exllama_cache_size,exllama_cache_quant,exllama_cache_k_bits,exllama_cache_v_bits,exllama_max_rq_tokens - vLLM and vLLM Serve:
vllm_max_model_len,vllm_kv_cache_dtype,vllm_kv_cache_memory_bytes,vllm_max_pixels,vllm_speculative_method,vllm_speculative_model,vllm_num_speculative_tokens - llama-server:
llama_server_n_ctx,llama_server_image_max_tokens,llama_server_spec_type,llama_server_spec_draft_n_max,llama_server_spec_draft_p_min
Example load bodies:
{
"replicas": 3
}{
"gguf_n_ctx": 8192
}{
"gguf_n_ctx": 16384
}{
"gguf_n_ctx": 32768
}{
"gguf_n_ctx": 32768,
"gguf_flash_attn": "auto",
"gguf_type_k": "q8_0",
"gguf_type_v": "q4_0"
}{
"exllama_cache_size": 32768,
"exllama_cache_quant": null,
"exllama_max_rq_tokens": 32768
}{
"exllama_cache_size": 32768,
"exllama_cache_quant": "8,8",
"exllama_max_rq_tokens": 32768
}{
"exllama_cache_size": 32768,
"exllama_cache_quant": "8,4",
"exllama_max_rq_tokens": 32768
}{
"exllama_cache_size": 32768,
"exllama_cache_k_bits": 8,
"exllama_cache_v_bits": 4,
"exllama_max_rq_tokens": 32768
}{
"vllm_max_model_len": 16384,
"vllm_kv_cache_dtype": "fp8",
"vllm_kv_cache_memory_bytes": 2147483648,
"vllm_max_pixels": 4014080,
"vllm_speculative_method": "mtp",
"vllm_speculative_model": "google/gemma-4-26B-A4B-it-assistant",
"vllm_num_speculative_tokens": 1
}{
"llama_server_n_ctx": 4096,
"llama_server_image_max_tokens": 512,
"llama_server_spec_type": "draft-mtp",
"llama_server_spec_draft_n_max": 4,
"llama_server_spec_draft_p_min": 0.25
}vLLM and vLLM Serve load override notes:
vllm_max_model_lenis the per-load context length.vllm_kv_cache_dtypequantizes the KV cache; allowed UI values areauto,fp8,fp8_e4m3,fp8_e5m2. The service accepts any dtype string vLLM supports.vllm_kv_cache_memory_bytessets an absolute KV cache size in bytes. It is machine-independent and overridesvllm_gpu_memory_utilizationfor KV sizing. Theload_constraintsentry carriesunit: "bytes"anddisplay_unit: "mib"so the UI can present it in MiB.- Prefer keeping configured
vllm_gpu_memory_utilizationvery low and controlling load-time cache budget withvllm_kv_cache_memory_bytes, otherwise vLLM may reserve most free VRAM. vllm_max_pixelscaps the vision-token budget per image for vision-language models; it is merged into the model'svllm_mm_processor_kwargsasmax_pixels.vllm_speculative_methodselects the vLLM speculative path for this load, for examplemtp,draft_model, ormlp_speculator.nulldisables the configured speculative path for that load.vllm_speculative_modelis the assistant/draft/speculator checkpoint or local path passed through vLLM'sspeculative_config.model. For Gemma 4 MTP this is the Gemma 4 assistant checkpoint, not a generic smaller draft model.vllm_num_speculative_tokensmaps to vLLMspeculative_config.num_speculative_tokens.- For
vllm_serve, target model id/path, binary path, library path, environment, host, port, API key, and extra CLI args are configured in the model definition, not overridden through the admin load body in v1. - Loading a
vllm_servemodel starts a localvllm servesubprocess. Unloading terminates that subprocess, so VRAM is released by the server process rather than by Python object cleanup alone.
llama-server load override notes:
llama_server_n_ctxmaps to the nativellama-server -c/--ctx-sizeflag for this load.llama_server_image_max_tokensmaps to native--image-max-tokensand controls the per-image vision token budget.llama_server_spec_typecurrently accepts only"draft-mtp"ornull.llama_server_spec_draft_n_maxmaps to native--spec-draft-n-max; v1 constrains it to1..6.llama_server_spec_draft_p_minmaps to native--spec-draft-p-min; v1 constrains it to0.0..1.0.- Model path, binary path, library path,
mmproj, draft model path, GPU layers, flash attention, reasoning, host, port, API key, and extra native args are configured in the model definition, not overridden through the admin load body in v1. - Loading a
llama_servermodel starts a localllama-serversubprocess. Unloading terminates that subprocess, so VRAM is released by the native server process rather than by Python object cleanup alone.
exllama_cache_quant format:
- omitted or
null: fp16 KV cache "<bits>": same quantization for K and V, for example"8""<k_bits>,<v_bits>": separate K/V quantization, for example"8,4"
exllama_cache_k_bits and exllama_cache_v_bits format:
- both omitted: do not override the current configured value
- both
null: reset to fp16 KV cache - both integers from
2through8: override K and V separately - they must be provided together
- they cannot be combined with
exllama_cache_quantin the same load request
gguf_type_k and gguf_type_v format:
- omitted or
null: use the runtime default cache type "<ggml_type_name>": a GGML cache type name, for example"f16","q8_0", or"q4_0"
gguf_flash_attn format:
- omitted: do not override the current configured value
"on": force Flash Attention on"off": force Flash Attention off"auto": use the runtime auto mode
Upstream references for these backend-specific value sets:
llama_cppGGUF cacheallowed_valuesand defaultf16are based on the officialllama.cppserver docs: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md- ExLlamaV3
k_bits/v_bitsallowed range2..8is based on the official ExLlamaV3 README, which documents2-8 bit cache quantization: https://github.com/turboderp-org/exllamav3 - The conservative GGUF preset guidance above is informed by upstream llama.cpp reports that asymmetric K/V pairs can disable GPU offload in some setups: ggml-org/llama.cpp#20866
These overrides are runtime-only:
- they do not modify
settings.json - they do not modify
local.json - they should be surfaced separately from the configured definition in admin responses
replicasin the load request does not modifydefinition.replicas; it only selects the replica count for that one live load
Suggested response shape:
{
"name": "google_gemma-4-E2B-it-Q8_0-gguf",
"resolved_backend": "llama_cpp",
"configured_enabled": false,
"runtime_state": "loaded",
"is_loaded": true,
"replicas": 3,
"replica_max": 4,
"loaded_replicas": 3,
"inflight_requests": 0,
"queue_depth": 0,
"runtime_inflight": 0,
"configured_target_inflight": 1,
"effective_target_inflight": 1,
"last_error": null,
"vram_estimate_mib": 12340,
"vram_estimate_replica_count": 3,
"vram_estimate_source": "observed_load_delta",
"load_constraints": {
"gguf_n_ctx": {
"kind": "integer",
"minimum": 1,
"step": 1
},
"gguf_type_k": {
"kind": "string_or_null",
"format": "ggml_type_name",
"default": "f16",
"allowed_values": ["f32", "f16", "bf16", "q8_0", "q4_0", "q4_1", "iq4_nl", "q5_0", "q5_1"],
"examples": ["f16", "q8_0", "q4_0"]
},
"gguf_type_v": {
"kind": "string_or_null",
"format": "ggml_type_name",
"default": "f16",
"allowed_values": ["f32", "f16", "bf16", "q8_0", "q4_0", "q4_1", "iq4_nl", "q5_0", "q5_1"],
"examples": ["f16", "q8_0", "q4_0"]
}
},
"load_recommendations": {
"gguf_cache_type_pairs": {
"kind": "pair_presets",
"fields": ["gguf_type_k", "gguf_type_v"],
"recommended_pairs": [
{
"label": "f16/f16",
"gguf_type_k": "f16",
"gguf_type_v": "f16"
},
{
"label": "q8_0/q8_0",
"gguf_type_k": "q8_0",
"gguf_type_v": "q8_0"
},
{
"label": "q4_0/q4_0",
"gguf_type_k": "q4_0",
"gguf_type_v": "q4_0"
}
]
}
},
"load_override": {
"gguf_n_ctx": 32768,
"gguf_flash_attn": "auto",
"gguf_type_k": "q8_0",
"gguf_type_v": "q4_0"
},
"definition": {
"model_path": "/home/gunnar/models/google_gemma-4-E2B-it-Q8_0/google_gemma-4-E2B-it-Q8_0.gguf",
"backend": "llama_cpp",
"prompt_format": "gemma4_template",
"enabled": false,
"replicas": 3,
"replica_max": 4,
"target_inflight": 1,
"gguf_n_gpu_layers": -1,
"gguf_n_ctx": 4096,
"gguf_flash_attn": "auto",
"gguf_type_k": null,
"gguf_type_v": null
}
}Notes:
- loading is allowed for configured models even when
configured_enabledisfalse - after a successful load,
vram_estimate_sourcemay switch toobserved_load_deltaif a GPU delta could be measured during load - when
replicasis provided, the model load is aggregate and all-or-nothing for that selected replica count replicasand load overrides may only be changed while the model isunloadedorfailed
Validation behavior:
422means the request body failed schema validation before runtime logic ran Examples:gguf_n_ctx: 0exllama_cache_size: 0exllama_max_rq_tokens: 0400withcode: "invalid_load_request"means the body was structurally valid, but the values were invalid for the resolved backend or runtime rules Examples:gguf_type_k: "q8-0"gguf_type_k: "foo"sending onlyexllama_cache_k_bitswithoutexllama_cache_v_bitscombiningexllama_cache_quantwithexllama_cache_k_bits/exllama_cache_v_bitsexllama_cache_size: 8000exllama_cache_quant: "fp16"llama_server_spec_type: "medusa"llama_server_spec_draft_n_max: 7llama_server_spec_draft_p_min: 1.5sending ExLlamaV3-only fields to a llama_cpp model sending llama-server-only fields to a vLLM model sending load overrides while the model is already loaded and not first unloading it409still applies for runtime state conflicts such as loading or unloading transitions Examples: loading a model while it is already unloading
Gracefully unloads one currently loaded model.
Rules:
404ifmodel_nameis unknown200if alreadyunloaded200if alreadyunloading- transition
loaded -> unloading -> unloaded - unload stops all loaded replicas of the public model
- new inference requests are rejected once
unloadingstarts - in-flight requests are allowed to finish before resources are released
Suggested response shape:
{
"name": "google_gemma-4-E2B-it-Q8_0-gguf",
"resolved_backend": "llama_cpp",
"configured_enabled": false,
"runtime_state": "unloaded",
"is_loaded": false,
"replicas": 3,
"replica_max": 4,
"loaded_replicas": 0,
"inflight_requests": 0,
"queue_depth": 0,
"runtime_inflight": 0,
"configured_target_inflight": 1,
"effective_target_inflight": 1,
"last_error": null,
"vram_estimate_mib": 12340,
"vram_estimate_replica_count": 4,
"vram_estimate_source": "observed_load_delta",
"definition": {
"model_path": "/home/gunnar/models/google_gemma-4-E2B-it-Q8_0/google_gemma-4-E2B-it-Q8_0.gguf",
"backend": "llama_cpp",
"device": "cuda",
"prompt_format": "gemma4_template",
"enabled": false,
"replicas": 3,
"replica_max": 4,
"target_inflight": 1
}
}Unload must be graceful in v1.
That means:
- mark the model as
unloading - reject new inference requests for that model
- wait for in-flight requests to finish
- release runtime references and backend resources
- mark the model as
unloaded
The service should track inflight_requests per model so unload can wait safely.
This admin API should stay compatible with a future external scheduler.
The intended split is:
- scheduler owns external pending queues
- runtime owns backend execution state
That implies the following unload behavior:
- queued but not yet submitted requests: cancel
- already submitted or actively running requests: let them drain in v1
So even after a scheduler exists, unload should not mean "kill active GPU work immediately".
It should mean:
- stop new admissions
- cancel scheduler-owned queued work
- drain already submitted runtime work
- then unload the model
For v1, "successful unload" should mean:
- no runtime remains registered for the model
- no new requests can reach that runtime
- no in-flight requests remain
- backend-owned objects are dereferenced
- memory becomes reusable for later loads
The implementation should not promise that every allocator reports zero immediately after unload.
The guarantee is functional reuse, not cosmetic memory counters.
This API is intended to support a UI, so the implementation should expose:
- stable response models
- OpenAPI descriptions on every admin endpoint
- clear descriptions of runtime states
- clear error codes for rejected inference requests
The UI should be able to render:
- configured vs loaded state
- current lifecycle state
- last load error
This v1 note does not define:
- writes back to config files
- force unload
- active GPU job cancellation
- retry policy for failed model loads