Skip to content

Conversation

@tbraun96
Copy link
Contributor

Summary

This PR transforms llmnet into a Kubernetes-like orchestration platform for LLM pipelines. It introduces:

  • Pipeline resource - The deployable unit (like K8s Deployment) with replicas, health checks, and rollout strategies
  • Node management - Cluster member tracking with capacity and heartbeats
  • Control plane API - REST endpoints for managing pipelines, nodes, and namespaces
  • kubectl-like CLI - Familiar subcommands: deploy, get, scale, delete, context
  • Context management - Switch between local and remote clusters (like kubeconfig)

New CLI Commands

# Control Plane
llmnet serve --control-plane           # Start control plane on :8181

# Pipeline Management
llmnet deploy pipeline.yaml            # Deploy pipeline
llmnet get pipelines                   # List pipelines
llmnet scale my-pipeline --replicas 3  # Scale
llmnet delete pipeline my-pipeline     # Delete

# Node Management  
llmnet get nodes                       # List nodes

# Context Management
llmnet context list                    # List contexts
llmnet context add remote --url http://10.0.0.1:8181
llmnet context use remote              # Switch context

# Status & Validation
llmnet status                          # Cluster status
llmnet validate config.json            # Validate composition

# Legacy Mode
llmnet run config.json                 # Run local pipeline

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    LLMNet Control Plane                      │
│                      (llmnet serve)                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ API Server   │  │ Pipeline     │  │ Node Registry    │  │
│  │ :8181        │  │ Controller   │  │                  │  │
│  └──────────────┘  └──────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         ▼                    ▼                    ▼
   ┌───────────┐        ┌───────────┐        ┌───────────┐
   │  Node 1   │        │  Node 2   │        │  Node 3   │
   │ (worker)  │        │ (worker)  │        │ (worker)  │
   │ :8080     │        │ :8080     │        │ :8080     │
   └───────────┘        └───────────┘        └───────────┘

Design Philosophy

Mirrored K8s Features:

  • Declarative configuration (JSON/YAML manifests)
  • Contexts (like kubeconfig)
  • Health checks (liveness/readiness)
  • Horizontal scaling
  • Labels and selectors
  • Rollout strategies
  • Namespaces

Excluded K8s Features (too complex for LLM orchestration):

  • Complex scheduler (simple round-robin instead)
  • etcd/HA control plane
  • CRDs
  • Network policies
  • RBAC (simple API key auth instead)
  • StatefulSets/DaemonSets

Test plan

  • All 163 existing tests pass
  • New cluster controller tests
  • Control plane API endpoint tests
  • Context management tests
  • CLI parsing tests
  • Manual testing: deploy/scale/delete workflow
  • Manual testing: multi-node cluster setup

This commit transforms llmnet into a Kubernetes-like orchestration platform
for LLM pipelines, introducing:

## Core Resources
- **Pipeline**: Deployable unit with replicas, health checks, rollout strategy
- **Node**: Cluster member with capacity tracking and heartbeats
- **Namespace**: Logical isolation for pipelines

## Control Plane API
- REST endpoints for pipeline/node/namespace management
- Deploy, scale, and delete pipelines remotely
- Node registration and health monitoring
- Cluster status and statistics

## kubectl-like CLI
- `llmnet serve --control-plane`: Run control plane server
- `llmnet deploy <config>`: Deploy pipeline to current context
- `llmnet get pipelines/nodes/namespaces`: List resources
- `llmnet scale <name> --replicas N`: Scale pipelines
- `llmnet delete pipeline/node <name>`: Delete resources
- `llmnet context use/add/list/delete`: Context management
- `llmnet status`: Cluster status
- `llmnet validate`: Validate composition files
- `llmnet run`: Legacy mode for local execution

## Context Management
- ~/.llmnet/config stores cluster contexts (like kubeconfig)
- Switch between local and remote clusters
- API key authentication support

## Design Philosophy
Mirrors K8s features that make sense for LLM orchestration:
- Declarative configuration
- Health checks (liveness/readiness)
- Horizontal scaling
- Labels and selectors
- Rollout strategies

Excludes complex K8s features not needed for LLM workloads:
- Complex scheduler (simple round-robin instead)
- etcd/HA control plane
- Network policies
- StatefulSets/DaemonSets

Default control plane port: 8181
Default worker port: 8080
Create docs/commands/ with detailed markdown documentation for each
CLI subcommand:
- serve.md - Control plane and worker server modes
- deploy.md - Pipeline deployment with namespaces
- get.md - Resource listing (pipelines, nodes, namespaces)
- delete.md - Resource removal
- scale.md - Replica count adjustment
- context.md - Multi-cluster context management
- status.md - Cluster health overview
- validate.md - Configuration validation
- run.md - Local pipeline execution
- logs.md - Pipeline log streaming (planned feature)
- README.md - Command index and quick reference

Each document includes:
- Synopsis and argument tables
- Practical examples with expected output
- Error handling and troubleshooting
- Common usage patterns
- Comparison with kubectl equivalents
- Add node metrics collection (CPU, memory, disk, GPU via sysinfo/nvml)
- Implement weighted node scoring algorithm for intelligent scheduling
- Add auto-scaling config (minReplicas, maxReplicas, utilization targets)
- Create heartbeat client for worker nodes to report metrics
- Add API endpoints for node scores and autoscaling config
- Integrate heartbeat client into worker serve mode
- Enable GPU metrics by default (--features gpu)
- Add LLMNet logo to README
@tbraun96 tbraun96 merged commit c9c078e into master Dec 28, 2025
7 checks passed
@tbraun96 tbraun96 deleted the feature/kubernetes-orchestration branch December 28, 2025 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants