feat: Add local model runner support (ollama, vllm, llama.cpp) #5

tbraun96 · 2025-12-29T19:10:52Z

Summary

Adds infrastructure for spawning and managing local LLM runners with a unified model configuration system.

New Modules

fetch.rs: Generic file fetching for local paths and remote URLs with SHA256-based caching and huggingface:// URL support
ollama.rs: Modelfile parsing/generation with SBIO pattern (pure parsing, I/O wrappers)
vllm.rs: vLLM CLI argument generation for OpenAI-compatible deployment
llamacpp.rs: llama.cpp server CLI arg generation with parameter aliasing
runner.rs: RunnerManager for spawning/stopping local processes with graceful shutdown

Model Configuration Changes

New unified ModelConfig replaces the old ModelDefinition enum (backward compatible):

{
  "models": {
    "my-model": {
      "runner": "ollama",
      "interface": "openai-api",
      "source": "tinyllama:1.1b",
      "parameters": { "temperature": 0.7 }
    }
  }
}

Runner types: external, ollama, vllm, llama-cpp, docker

Architecture Context

Nodes can specify context for remote deployment of local runners:

{
  "name": "gpu-model",
  "layer": 1,
  "adapter": "openai-api",
  "context": "gpu-cluster"
}

Test plan

All 228 unit tests pass
Modelfile parsing/generation roundtrip tests
CLI argument generation tests for vllm/llamacpp
Path classification tests (local/remote/huggingface)
Test fixture: tests/fixtures/tinyllama.Modelfile

Adds infrastructure for spawning and managing local LLM runners: - **fetch.rs**: Generic file fetching for local paths and remote URLs with caching support and huggingface:// URL translation - **ollama.rs**: Modelfile parsing and generation with parameter merging for runtime configuration - **vllm.rs**: vLLM CLI argument generation for OpenAI-compatible server deployment - **llamacpp.rs**: llama.cpp server CLI argument generation with parameter aliasing - **runner.rs**: RunnerManager for spawning/stopping local model processes with graceful shutdown - **ModelConfig**: Unified model configuration with runner field (external, ollama, vllm, llama-cpp) replacing the old ModelDefinition enum (backward compatible) - **context**: Architecture nodes can specify deployment context override for remote deployment of local runners Includes test fixture: tinyllama.Modelfile for Modelfile testing.

- Add /v1/runners endpoints for spawn, list, and stop operations - Extend AppState to include optional SharedRunnerManager - Add with_runner_manager() builder for worker mode - Create integration tests simulating control plane to worker dispatch - Tests verify multi-instance coordination for remote deployment This enables the control plane to dispatch spawn commands to workers representing remote clusters, as discussed in the context override design.

Docker support: - Add docker.rs module with DockerConfig, RegistryConfig structs - Support image-based and Dockerfile-based container deployment - Handle custom registries with authentication - Map model parameters to container environment variables - Support GPU passthrough, volumes, network modes, IPC settings - Add extra_args for vLLM-specific flags (--swap-space, --tool-call-parser) vLLM improvements: - Add is_vllm_installed() and is_cuda_available() checks - Add HuggingFace token detection from HF_TOKEN env var - Detect gated models requiring authentication - Generate proper environment variables for vLLM process - Add warnings for missing CUDA or HF token Example Docker config matching user's DGX setup: { "runner": "docker", "source": "RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4", "docker": { "image": "dgx-vllm:cutlass-nvfp4", "network": "host", "gpus": "all", "ipc": "host", "volumes": ["${HOME}/.cache/huggingface:/root/.cache/huggingface"], "extra_args": "--swap-space 32 --tool-call-parser hermes" }, "parameters": { "tensor_parallel_size": 1, "max_model_len": 131072 } }

- Change extra_args from string to HashMap<String, Value> - Add extra_args_to_string() to convert map to CLI format at runtime - Support bool flags (true=include, false=omit), numbers, strings, arrays - Arrays repeat the flag for each value (useful for --stop tokens) Example: "extra_args": { "swap-space": 32, "tool-call-parser": "hermes", "enable-auto-tool-choice": true } Becomes: "--swap-space 32 --tool-call-parser hermes --enable-auto-tool-choice" This cleans up the rough edges of CLI tools and makes composition files more readable and programmatically manipulable.

tbraun96 added 4 commits December 29, 2025 14:10

tbraun96 merged commit 202a784 into master Dec 30, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add local model runner support (ollama, vllm, llama.cpp) #5

feat: Add local model runner support (ollama, vllm, llama.cpp) #5

Uh oh!

tbraun96 commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add local model runner support (ollama, vllm, llama.cpp) #5

feat: Add local model runner support (ollama, vllm, llama.cpp) #5

Uh oh!

Conversation

tbraun96 commented Dec 29, 2025

Summary

New Modules

Model Configuration Changes

Architecture Context

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants