Skip to content

Releases: spranab/contextcache

v0.2.0 — CPU Orchestrator

03 Mar 22:38

Choose a tag to compare

CPU Orchestrator: Confidence-Gated Tool Routing (No GPU Required)

A new standalone orchestrator service that uses a small local model (Qwen3.5-2B, 4-bit GGUF) for fast tool routing on CPU, with an external LLM for parameter extraction and response synthesis.

Two Products, One Server

Product Endpoint What it does Latency LLM needed?
Route-Only POST /route Tool detection + confidence ~500ms No
Full Pipeline POST /query Route → extract params → execute → synthesize ~3s Yes

Highlights

  • 95-100% routing accuracy on CPU with Qwen3.5-2B (4-bit quantized)
  • ~500ms per query via KV state caching — register tools once, route instantly
  • Confidence-gated pipeline — HIGH (≥0.7): execute, LOW (0.2-0.7): verify, NO_TOOL (<0.2): conversational
  • Works with any OpenAI-compatible LLM — Ollama, Claude, OpenAI, Azure
  • Configurable thinking mode — enable LLM reasoning for deeper analysis on complex queries
  • Browser admin UI — register tools, test routing, run full pipeline queries
  • Server-side LLM credentials — per-domain or global, API keys never leave the server

New Files

  • context_cache/tool_router.py — Async tool router with per-domain KV state caching
  • context_cache/orchestrator.py — Confidence-gated pipeline coordinator
  • context_cache/llama_cpp_engine.py — ctypes wrapper for llama.cpp C API
  • context_cache/llama_server_engine.py — Managed llama-server subprocess
  • scripts/serve/serve_orchestrator.py — FastAPI server (port 8422)
  • scripts/serve/static/orchestrator.html — Admin web UI
  • configs/orchestrator_config.yaml — Server configuration
  • examples/retail_assistant.py — Interactive CLI demo app
  • examples/fastapi_integration.py — FastAPI wrapper pattern

Quick Start

# Download the routing model (~1.5GB)
huggingface-cli download unsloth/Qwen3.5-2B-GGUF Qwen3.5-2B-Q4_K_M.gguf

# Start the orchestrator
python scripts/serve/serve_orchestrator.py --config configs/orchestrator_config.yaml

# Open admin UI: http://localhost:8422
# Register tools, route queries, test the full pipeline

Documentation

Full end-to-end guide added to README — from tool registration through consumer app integration, including working demo apps.