Skip to content

Conversation

@tbraun96
Copy link
Contributor

Summary

Adds infrastructure for spawning and managing local LLM runners with a unified model configuration system.

New Modules

  • fetch.rs: Generic file fetching for local paths and remote URLs with SHA256-based caching and huggingface:// URL support
  • ollama.rs: Modelfile parsing/generation with SBIO pattern (pure parsing, I/O wrappers)
  • vllm.rs: vLLM CLI argument generation for OpenAI-compatible deployment
  • llamacpp.rs: llama.cpp server CLI arg generation with parameter aliasing
  • runner.rs: RunnerManager for spawning/stopping local processes with graceful shutdown

Model Configuration Changes

New unified ModelConfig replaces the old ModelDefinition enum (backward compatible):

{
  "models": {
    "my-model": {
      "runner": "ollama",
      "interface": "openai-api",
      "source": "tinyllama:1.1b",
      "parameters": { "temperature": 0.7 }
    }
  }
}

Runner types: external, ollama, vllm, llama-cpp, docker

Architecture Context

Nodes can specify context for remote deployment of local runners:

{
  "name": "gpu-model",
  "layer": 1,
  "adapter": "openai-api",
  "context": "gpu-cluster"
}

Test plan

  • All 228 unit tests pass
  • Modelfile parsing/generation roundtrip tests
  • CLI argument generation tests for vllm/llamacpp
  • Path classification tests (local/remote/huggingface)
  • Test fixture: tests/fixtures/tinyllama.Modelfile

Adds infrastructure for spawning and managing local LLM runners:

- **fetch.rs**: Generic file fetching for local paths and remote URLs
  with caching support and huggingface:// URL translation

- **ollama.rs**: Modelfile parsing and generation with parameter
  merging for runtime configuration

- **vllm.rs**: vLLM CLI argument generation for OpenAI-compatible
  server deployment

- **llamacpp.rs**: llama.cpp server CLI argument generation with
  parameter aliasing

- **runner.rs**: RunnerManager for spawning/stopping local model
  processes with graceful shutdown

- **ModelConfig**: Unified model configuration with runner field
  (external, ollama, vllm, llama-cpp) replacing the old
  ModelDefinition enum (backward compatible)

- **context**: Architecture nodes can specify deployment context
  override for remote deployment of local runners

Includes test fixture: tinyllama.Modelfile for Modelfile testing.
- Add /v1/runners endpoints for spawn, list, and stop operations
- Extend AppState to include optional SharedRunnerManager
- Add with_runner_manager() builder for worker mode
- Create integration tests simulating control plane to worker dispatch
- Tests verify multi-instance coordination for remote deployment

This enables the control plane to dispatch spawn commands to workers
representing remote clusters, as discussed in the context override design.
Docker support:
- Add docker.rs module with DockerConfig, RegistryConfig structs
- Support image-based and Dockerfile-based container deployment
- Handle custom registries with authentication
- Map model parameters to container environment variables
- Support GPU passthrough, volumes, network modes, IPC settings
- Add extra_args for vLLM-specific flags (--swap-space, --tool-call-parser)

vLLM improvements:
- Add is_vllm_installed() and is_cuda_available() checks
- Add HuggingFace token detection from HF_TOKEN env var
- Detect gated models requiring authentication
- Generate proper environment variables for vLLM process
- Add warnings for missing CUDA or HF token

Example Docker config matching user's DGX setup:
{
  "runner": "docker",
  "source": "RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4",
  "docker": {
    "image": "dgx-vllm:cutlass-nvfp4",
    "network": "host",
    "gpus": "all",
    "ipc": "host",
    "volumes": ["${HOME}/.cache/huggingface:/root/.cache/huggingface"],
    "extra_args": "--swap-space 32 --tool-call-parser hermes"
  },
  "parameters": {
    "tensor_parallel_size": 1,
    "max_model_len": 131072
  }
}
- Change extra_args from string to HashMap<String, Value>
- Add extra_args_to_string() to convert map to CLI format at runtime
- Support bool flags (true=include, false=omit), numbers, strings, arrays
- Arrays repeat the flag for each value (useful for --stop tokens)

Example:
  "extra_args": {
    "swap-space": 32,
    "tool-call-parser": "hermes",
    "enable-auto-tool-choice": true
  }

Becomes: "--swap-space 32 --tool-call-parser hermes --enable-auto-tool-choice"

This cleans up the rough edges of CLI tools and makes composition files
more readable and programmatically manipulable.
@tbraun96 tbraun96 merged commit 202a784 into master Dec 30, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants