Skip to content

Commit 202a784

Browse files
authored
feat: Add local model runner support (ollama, vllm, llama.cpp) (#5)
* feat: Add local model runner support (ollama, vllm, llama.cpp) Adds infrastructure for spawning and managing local LLM runners: - **fetch.rs**: Generic file fetching for local paths and remote URLs with caching support and huggingface:// URL translation - **ollama.rs**: Modelfile parsing and generation with parameter merging for runtime configuration - **vllm.rs**: vLLM CLI argument generation for OpenAI-compatible server deployment - **llamacpp.rs**: llama.cpp server CLI argument generation with parameter aliasing - **runner.rs**: RunnerManager for spawning/stopping local model processes with graceful shutdown - **ModelConfig**: Unified model configuration with runner field (external, ollama, vllm, llama-cpp) replacing the old ModelDefinition enum (backward compatible) - **context**: Architecture nodes can specify deployment context override for remote deployment of local runners Includes test fixture: tinyllama.Modelfile for Modelfile testing. * Add runner management endpoints and remote deployment tests - Add /v1/runners endpoints for spawn, list, and stop operations - Extend AppState to include optional SharedRunnerManager - Add with_runner_manager() builder for worker mode - Create integration tests simulating control plane to worker dispatch - Tests verify multi-instance coordination for remote deployment This enables the control plane to dispatch spawn commands to workers representing remote clusters, as discussed in the context override design. * Add Docker runner support and improve vLLM integration Docker support: - Add docker.rs module with DockerConfig, RegistryConfig structs - Support image-based and Dockerfile-based container deployment - Handle custom registries with authentication - Map model parameters to container environment variables - Support GPU passthrough, volumes, network modes, IPC settings - Add extra_args for vLLM-specific flags (--swap-space, --tool-call-parser) vLLM improvements: - Add is_vllm_installed() and is_cuda_available() checks - Add HuggingFace token detection from HF_TOKEN env var - Detect gated models requiring authentication - Generate proper environment variables for vLLM process - Add warnings for missing CUDA or HF token Example Docker config matching user's DGX setup: { "runner": "docker", "source": "RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4", "docker": { "image": "dgx-vllm:cutlass-nvfp4", "network": "host", "gpus": "all", "ipc": "host", "volumes": ["${HOME}/.cache/huggingface:/root/.cache/huggingface"], "extra_args": "--swap-space 32 --tool-call-parser hermes" }, "parameters": { "tensor_parallel_size": 1, "max_model_len": 131072 } } * Refactor extra_args to structured map for cleaner composability - Change extra_args from string to HashMap<String, Value> - Add extra_args_to_string() to convert map to CLI format at runtime - Support bool flags (true=include, false=omit), numbers, strings, arrays - Arrays repeat the flag for each value (useful for --stop tokens) Example: "extra_args": { "swap-space": 32, "tool-call-parser": "hermes", "enable-auto-tool-choice": true } Becomes: "--swap-space 32 --tool-call-parser hermes --enable-auto-tool-choice" This cleans up the rough edges of CLI tools and makes composition files more readable and programmatically manipulable.
1 parent c9c078e commit 202a784

17 files changed

Lines changed: 3352 additions & 145 deletions

Cargo.lock

Lines changed: 12 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ serde_yaml = "0.9"
5050
# Home directory detection
5151
dirs = "5"
5252

53+
# Hashing for cache keys
54+
sha2 = "0.10"
55+
5356
# System hostname
5457
hostname = "0.4"
5558

example-composition.json

Lines changed: 0 additions & 108 deletions
This file was deleted.

src/config/architecture.rs

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,11 @@ pub struct ArchitectureNode {
3535
/// WebSocket URL for "ws" adapter
3636
pub url: Option<String>,
3737

38+
/// Deployment context override (default: localhost)
39+
/// Use to deploy local runners to remote nodes
40+
#[serde(skip_serializing_if = "Option::is_none")]
41+
pub context: Option<String>,
42+
3843
#[serde(rename = "extra-options", default)]
3944
pub extra_options: HashMap<String, serde_json::Value>,
4045
}
@@ -163,8 +168,23 @@ mod tests {
163168
use_case: None,
164169
condition: None,
165170
url: None,
171+
context: None,
166172
extra_options: HashMap::new(),
167173
};
168174
assert_eq!(node.effective_bind_addr(), "0.0.0.0");
169175
}
176+
177+
#[test]
178+
fn test_parse_node_with_context() {
179+
let json = r#"{
180+
"name": "remote-model",
181+
"layer": 1,
182+
"model": "llama-vllm",
183+
"adapter": "openai-api",
184+
"context": "gpu-cluster"
185+
}"#;
186+
187+
let node: ArchitectureNode = serde_json::from_str(json).unwrap();
188+
assert_eq!(node.context, Some("gpu-cluster".to_string()));
189+
}
170190
}

0 commit comments

Comments
 (0)