Skip to content

Releases: buckster123/LocalRouter

v0.3.0 — Recipe Editor, vLLM Backend, DeepSeek V4

04 May 12:23

Choose a tag to compare

What's New

✏️ Recipe Editor TUI

Browse, create, edit, duplicate, and delete recipes without touching TOML files. Manage GPU tiers and docker images from the same menu. Proper TOML round-trip (tomllib + tomli_w), validation, auto-backup on save.

⚡ vLLM Serving Backend

New provider type for models too large for llama.cpp. Tensor-parallel serving across multi-GPU clusters with automatic GPU detection, FlashInfer attention, FP8 KV cache, and reasoning parser support. Based on the official vllm/vllm-openai:v0.20.1 image.

🧠 DeepSeek V4 Support

  • V4-Flash (284B, 13B active): 7 GGUF recipes via llama.cpp + 2 vLLM recipes
  • V4-Pro (1.6T, 49B active): 5 vLLM recipes across datacenter clusters
  • Custom llama.cpp branch support for models with unmerged upstream PRs
  • Split-file GGUF discovery for large sharded models

🖥️ Multi-GPU Cluster Tiers

19 GPU tiers (up from 10), including:

  • 2×/4× H100 SXM (160–320 GB)
  • 2×/4×/5× H200 SXM (282–705 GB)
  • 2×/4× B200 SXM (384–768 GB)
  • 8× H100 and 8× A100 clusters

📊 By the Numbers

  • 70 recipes across 4 providers (vast_gguf, vLLM, Together AI, local)
  • 19 GPU tiers from RTX 4090 to 8×B200 SXM
  • ~5,000 lines of Python across 18 modules
  • 4 docker images: prebuilt, builder, vLLM, legacy

Dependencies

  • Added tomli_w>=1.0.0 for recipe editor TOML write-back

Full Changelog

v0.2.0...v0.3.0

v0.2.0 — Modular Rewrite + 12 Bug Fixes

03 May 09:56

Choose a tag to compare

What's New

🏗️ Architecture Overhaul

The monolithic vast_manager.py (3,064 lines) has been split into a clean 16-module Python package (localrouter/):

localrouter/
├── config.py           # paths, TOML loading, presets
├── helpers.py          # shell wrappers, formatting
├── providers.py        # Together AI, endpoint management
├── cost.py             # cost estimation, usage tracking
├── local_endpoint.py   # llama-server lifecycle
├── vast_ops.py         # SSH diagnostics, offer browsing
├── hf_browser.py       # HuggingFace model browser
├── proxy.py            # proxy lifecycle helpers
└── menus/              # TUI menu system
    ├── main.py         # entry point, banner
    ├── provider_menus.py
    ├── local_menus.py
    ├── vast_menus.py
    └── tool_menus.py

📦 Now pip-installable

pip install -e '.[all]'
localrouter              # new CLI command
python -m localrouter    # module entry
./vast_manager.py        # backward compatible

🐛 12 Bug Fixes

ID Severity What
C1 CRITICAL Vast launch crash — gpu_choices undefined after provider selection refactor
C2 CRITICAL Broken f-string in proxy status (Qwen sanitization artifact)
C3 CRITICAL Proxy streaming broken — StreamResponse.prepare() missing request arg
C4 CRITICAL ClientTimeout positional args (aiohttp 3.x keyword-only)
M1 MAJOR Duplicate capture()/run() dead code removed
M2 MAJOR menu_diagnose() no longer crashes on local/Together endpoints
M3 MAJOR PUT/PATCH requests no longer silently converted to POST in proxy
M6 MAJOR urllib.error import missing in usage tracker
M7 MAJOR Per-GPU disk size defaults in vast_up.sh were dead code (always 60GB)
M8 MAJOR KV cache type for 256K context preset was dead code (always q8_0, should be q4_0)
M9 MAJOR --flash-attn on--flash-attn (bare flag) in launch.sh
M10 MAJOR Nonthinking mode broken — JSON word-splitting in template kwargs

🔧 Additional Improvements

  • Proxy strips hop-by-hop headers before forwarding
  • Streaming error handling for client disconnect
  • UTC timestamps (was using local time with Z suffix)
  • Fixed vast_up.sh HF token handling (Qwen *** sanitization artifacts)
  • Fixed recipe labels (H200 slot count, Qwen3 vs Qwen3.5)

Upgrade

git pull origin main
pip install -e '.[all]'
# Run as before — backward compatible