Releases · spranab/contextcache

CPU Orchestrator: Confidence-Gated Tool Routing (No GPU Required)

A new standalone orchestrator service that uses a small local model (Qwen3.5-2B, 4-bit GGUF) for fast tool routing on CPU, with an external LLM for parameter extraction and response synthesis.

Two Products, One Server

Product	Endpoint	What it does	Latency	LLM needed?
Route-Only	`POST /route`	Tool detection + confidence	~500ms	No
Full Pipeline	`POST /query`	Route → extract params → execute → synthesize	~3s	Yes

Highlights

95-100% routing accuracy on CPU with Qwen3.5-2B (4-bit quantized)
~500ms per query via KV state caching — register tools once, route instantly
Confidence-gated pipeline — HIGH (≥0.7): execute, LOW (0.2-0.7): verify, NO_TOOL (<0.2): conversational
Works with any OpenAI-compatible LLM — Ollama, Claude, OpenAI, Azure
Configurable thinking mode — enable LLM reasoning for deeper analysis on complex queries
Browser admin UI — register tools, test routing, run full pipeline queries
Server-side LLM credentials — per-domain or global, API keys never leave the server

New Files

context_cache/tool_router.py — Async tool router with per-domain KV state caching
context_cache/orchestrator.py — Confidence-gated pipeline coordinator
context_cache/llama_cpp_engine.py — ctypes wrapper for llama.cpp C API
context_cache/llama_server_engine.py — Managed llama-server subprocess
scripts/serve/serve_orchestrator.py — FastAPI server (port 8422)
scripts/serve/static/orchestrator.html — Admin web UI
configs/orchestrator_config.yaml — Server configuration
examples/retail_assistant.py — Interactive CLI demo app
examples/fastapi_integration.py — FastAPI wrapper pattern

Quick Start

# Download the routing model (~1.5GB)
huggingface-cli download unsloth/Qwen3.5-2B-GGUF Qwen3.5-2B-Q4_K_M.gguf

# Start the orchestrator
python scripts/serve/serve_orchestrator.py --config configs/orchestrator_config.yaml

# Open admin UI: http://localhost:8422
# Register tools, route queries, test the full pipeline

Documentation

Full end-to-end guide added to README — from tool registration through consumer app integration, including working demo apps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

CPU Orchestrator: Confidence-Gated Tool Routing (No GPU Required)

Two Products, One Server

Highlights

New Files

Quick Start

Documentation

Uh oh!

Releases: spranab/contextcache

v0.2.0 — CPU Orchestrator

CPU Orchestrator: Confidence-Gated Tool Routing (No GPU Required)

Two Products, One Server

Highlights

New Files

Quick Start

Documentation

Uh oh!