Skip to content

buckster123/LocalRouter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”€ LocalRouter

Your private LLM inference hub β€” local hardware, rented GPUs, or managed APIs. One TUI. One proxy. Zero vendor lock-in.

Python 3.10+ License: MIT Version

LocalRouter TUI β€” main menu

The main menu β€” launch endpoints, browse models, run diagnostics, all from your terminal.


Why LocalRouter?

You've got a local GPU, a Vast.ai account, maybe a Together AI key β€” but juggling three different CLIs, SSH tunnels, and config files is a pain. LocalRouter unifies all of that behind a single interactive TUI and a transparent local proxy (localhost:8888). Your clients never know which backend is active.

pip install localrouter        # or: git clone + pip install -e .
localrouter                    # launch the TUI

That's it. No YAML configs to write, no Docker required for local mode, no separate tunnel scripts to remember.


Features at a Glance

Feature What it does
πŸ–₯️ Local inference Run GGUF models on your own hardware β€” Vulkan, ROCm, CUDA, or CPU via llama.cpp
☁️ Vast.ai rental One-click GPU rental: pick a tier (4090 β†’ 8Γ—H100) β†’ recipe β†’ geo β†’ launch
⚑ vLLM serving Tensor-parallel serving for 1T+ MoE models across multi-GPU clusters
🀝 Together AI Connect 229+ managed models, hot-swap providers mid-session
πŸ”€ Unified proxy localhost:8888 routes to whichever provider is active β€” swap backends, not code
πŸ“Š Usage tracking Per-provider cost logging (JSONL), session totals, rate-limit monitoring
πŸ” Batch compare Send the same prompt to multiple providers, see results & metrics side-by-side
πŸ§™ Launch wizard Guided flow: GPU tier β†’ model recipe β†’ geo preference β†’ offer β†’ spin up
✏️ Recipe editor TUI-based editor for recipes, GPU tiers, and docker images β€” no manual TOML editing
πŸ”§ Deep diagnostics SSH probes, download speed checks, stall detection & automatic recovery

Quick Start

1. Install

# From PyPI (when published) or from source:
git clone https://github.com/buckster123/LocalRouter.git
cd LocalRouter
pip install -e .

# Now you have the `localrouter` command globally

2. Pick your mode

πŸ–₯️ Local mode β€” no API keys, no cloud, just your GPU

Point recipes at your GGUF models and go:

# Prerequisites: compiled llama.cpp + GGUF models in ~/models/
localrouter                    # TUI β†’ Local β†’ Launch β†’ pick recipe

The TUI auto-discovers:

  • llama-server binaries from PATH or ~/llama.cpp/build*/bin/
  • GGUF files in ~/models/ (configurable)
  • Available backends (Vulkan, ROCm/HIP, CUDA, CPU)
☁️ Vast.ai GGUF mode β€” rent a GPU, get an endpoint
pip install vastai              # one-time
vastai set api-key <your-key>   # from console.vast.ai
localrouter                     # TUI β†’ Launch β†’ Vast GGUF

The wizard walks you through: GPU tier β†’ model β†’ geo β†’ offer β†’ launch. Use Watch to follow boot progress, Tunnel to forward locally.

⚑ vLLM mode β€” tensor-parallel for massive MoE models

For models too large for llama.cpp β€” like DeepSeek V4 Pro (1.6T params) across 4-8 GPUs:

localrouter                     # TUI β†’ Launch β†’ pick a vLLM recipe

vLLM recipes auto-configure tensor parallelism, FlashInfer attention, FP8 KV cache, and chunked prefill. The launch script detects GPU count and sets everything up.

Pre-configured for DeepSeek V4 Pro and Flash on H100, H200, B200, and A100 clusters.

🀝 Together AI β€” managed inference, 229+ models
mkdir -p ~/.vastai-gguf
cat > ~/.vastai-gguf/config.toml << 'EOF'
[provider.together]
api_key = "sk-xxxxx"
base_url = "https://api.together.ai/v1"
EOF

The TUI picks it up automatically β€” browse models, pin your choice, hot-swap.


The Proxy β€” One Endpoint to Rule Them All

Client (curl / Python / Hermes / …)
         β”‚
         β–Ό
   localhost:8888          ← always the same URL
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
    β”‚  Proxy   β”‚          ← routes to active provider
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
         β”‚
   β”Œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β–Ό     β–Ό      β–Ό          β–Ό
 Local  Vast  vLLM     Together
(llama) (GGUF)(tensor)  (managed)

OpenAI-compatible API: /v1/chat/completions, /v1/completions, health check at /health. Switch providers from the TUI β€” clients don't change a thing.


Recipe System

Everything is driven by recipes.toml β€” 70 pre-configured recipes across 19 GPU tiers and 4 providers. Edit recipes from the TUI (Editor menu) or directly in TOML:

# Local GPU
[[recipes]]
name       = "local-qwen35-9b"
provider   = "local"
label      = "Qwen3.5-9B  Q4_K_M  (local Vulkan)"
model_path = "~/models/Qwen3.5-9B-Q4_K_M.gguf"
port       = 8100
ctx        = 32768
backend    = "vulkan"

# Rented GPU via Vast.ai (llama.cpp)
[[recipes]]
name        = "qwen36-35b-h100"
label       = "Qwen3.6-35B-A3B  Q8_0  128K ctx"
gpu         = "h100-sxm"
model_repo  = "unsloth/Qwen3.6-35B-A3B-GGUF"
model_quant = "Q8_0"
ctx         = 131072

# vLLM tensor-parallel (multi-GPU clusters)
[[recipes]]
name             = "dsv4-pro-5xh200"
provider         = "vllm"
label            = "DSv4-Pro 1.6T  FP4+FP8  384K ctx  (5Γ—H200)"
gpu              = "h200-sxm-5x"
model_id         = "deepseek-ai/DeepSeek-V4-Pro"
ctx              = 393216
image_type       = "vllm"
kv_cache_dtype   = "fp8"
reasoning_parser = "deepseek_r1"

# Managed API
[[recipes]]
name        = "together-qwen3-32b"
provider    = "together"
label       = "Qwen3-32B (managed)"
model_id    = "Qwen/Qwen3-32B"

GPU Tiers

From consumer cards to datacenter clusters:

Tier VRAM Use case
RTX 4090 24 GB Budget GGUF (Qwen 9-27B)
RTX 5090 32 GB Sweet spot for 27-35B GGUF
RTX PRO 6000 96 GB Large models, huge context
H100 SXM 80 GB Datacenter single-GPU
2×–4Γ— H100 160–320 GB DSv4-Flash GGUF
2×–5Γ— H200 282–705 GB DSv4-Flash/Pro via vLLM
4×–8Γ— B200 768+ GB DSv4-Pro at full quality

TUI Menu

Menu item What it does
Launch Guided wizard: Local / Vast GGUF / vLLM / Together β†’ spin up
Local Manage local llama.cpp endpoints (launch / status / logs / stop)
Providers Configure API keys and base URLs
Together Browse Together AI models, pin a choice
Batch Compare multiple providers side-by-side
Watch Live boot watcher β€” polls status + logs until healthy
Diagnose Usage stats, rate limits, SSH probes, stall detection
Instances List active Vast instances, reattach
HF Browse Browse HuggingFace model files, pin a quant
Editor Recipes, GPU tiers, docker images β€” full CRUD with validation
Tunnel SSH tunnel: up / status / down / logs
Smoke Provider-aware smoke tests (health, completion, tools, throughput)
Proxy Start/stop the local proxy on localhost:8888
Destroy Tear down current Vast instance
LocalRouter with active H100 instance

Running with an H100 SXM on Vast.ai β€” tunnel up, endpoint healthy, 3 slots.


Architecture

localrouter/
β”œβ”€β”€ menus/
β”‚   β”œβ”€β”€ main.py            # Main menu loop
β”‚   β”œβ”€β”€ local_menus.py     # Local llama.cpp management
β”‚   β”œβ”€β”€ vast_menus.py      # Vast.ai + vLLM launch wizard
β”‚   β”œβ”€β”€ provider_menus.py  # Provider config & switching
β”‚   β”œβ”€β”€ editor_menus.py    # Recipe/tier/image editor TUI
β”‚   └── tool_menus.py      # Tunnel, proxy, diagnostics
β”œβ”€β”€ config.py              # Settings & recipe loading
β”œβ”€β”€ recipe_editor.py       # TOML read/write, recipe CRUD, validation
β”œβ”€β”€ providers.py           # Provider abstraction layer
β”œβ”€β”€ proxy.py               # Transparent proxy server
β”œβ”€β”€ local_endpoint.py      # Local llama-server lifecycle
β”œβ”€β”€ vast_ops.py            # Vast.ai API operations
β”œβ”€β”€ hf_browser.py          # HuggingFace model browser
β”œβ”€β”€ cost.py                # Cost tracking & rate limits
└── helpers.py             # Shared utilities

~5,000 lines of Python. 18 modules. pip install -e . and you're done.


Usage Tracking

Every completion is logged to ~/.vastai-gguf/usage.log (JSONL):

{"ts":"2026-05-02T20:15:32","provider":"vast_gguf","model":"Qwen3.6-35B-A3B-Q8_0",
 "prompt_tokens":42,"completion_tokens":128,"cost":0.0012}

Local inference logs at $0. View stats in the Diagnose screen or programmatically:

python3 -c "from usage_tracker import format_summary; print(format_summary(24))"

Security

  • Vast.ai: launch.sh / launch_vllm.sh bind to 127.0.0.1:8000 inside the container. Access only via SSH tunnel (tools/vast_tunnel.sh) or the local proxy.
  • Local: Binds to 127.0.0.1 by default β€” never exposed to the network.
  • Config: API keys stored in ~/.vastai-gguf/config.toml (not in the repo).

Requirements

Dependency Required for
Python 3.10+ Everything
questionary, rich, tomli_w TUI + recipe editor (auto-installed)
vastai CLI Vast.ai mode only
llama.cpp (compiled) Local mode only
Together AI API key Together mode only
aiohttp Proxy server (pip install localrouter[proxy])

License

MIT β€” do what you want with it.

About

Local compute and endpoint manager for private LLM inference - local llama.cpp, Vast.ai GPUs, managed providers (Together AI), transparent proxy.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages