Releases: spranab/contextcache
Releases · spranab/contextcache
v0.2.0 — CPU Orchestrator
CPU Orchestrator: Confidence-Gated Tool Routing (No GPU Required)
A new standalone orchestrator service that uses a small local model (Qwen3.5-2B, 4-bit GGUF) for fast tool routing on CPU, with an external LLM for parameter extraction and response synthesis.
Two Products, One Server
| Product | Endpoint | What it does | Latency | LLM needed? |
|---|---|---|---|---|
| Route-Only | POST /route |
Tool detection + confidence | ~500ms | No |
| Full Pipeline | POST /query |
Route → extract params → execute → synthesize | ~3s | Yes |
Highlights
- 95-100% routing accuracy on CPU with Qwen3.5-2B (4-bit quantized)
- ~500ms per query via KV state caching — register tools once, route instantly
- Confidence-gated pipeline — HIGH (≥0.7): execute, LOW (0.2-0.7): verify, NO_TOOL (<0.2): conversational
- Works with any OpenAI-compatible LLM — Ollama, Claude, OpenAI, Azure
- Configurable thinking mode — enable LLM reasoning for deeper analysis on complex queries
- Browser admin UI — register tools, test routing, run full pipeline queries
- Server-side LLM credentials — per-domain or global, API keys never leave the server
New Files
context_cache/tool_router.py— Async tool router with per-domain KV state cachingcontext_cache/orchestrator.py— Confidence-gated pipeline coordinatorcontext_cache/llama_cpp_engine.py— ctypes wrapper for llama.cpp C APIcontext_cache/llama_server_engine.py— Managed llama-server subprocessscripts/serve/serve_orchestrator.py— FastAPI server (port 8422)scripts/serve/static/orchestrator.html— Admin web UIconfigs/orchestrator_config.yaml— Server configurationexamples/retail_assistant.py— Interactive CLI demo appexamples/fastapi_integration.py— FastAPI wrapper pattern
Quick Start
# Download the routing model (~1.5GB)
huggingface-cli download unsloth/Qwen3.5-2B-GGUF Qwen3.5-2B-Q4_K_M.gguf
# Start the orchestrator
python scripts/serve/serve_orchestrator.py --config configs/orchestrator_config.yaml
# Open admin UI: http://localhost:8422
# Register tools, route queries, test the full pipelineDocumentation
Full end-to-end guide added to README — from tool registration through consumer app integration, including working demo apps.