|  | 
|  | 1 | +# Semantic Router Quickstart | 
|  | 2 | + | 
|  | 3 | +This quickstart walks through the minimal set of commands needed to prove that | 
|  | 4 | +the semantic router can classify incoming chat requests, route them through | 
|  | 5 | +Envoy, and receive OpenAI-compatible completions. The flow is optimized for | 
|  | 6 | +local laptops and uses a lightweight mock backend by default, so the entire | 
|  | 7 | +loop finishes in a few minutes. | 
|  | 8 | + | 
|  | 9 | +## Prerequisites | 
|  | 10 | + | 
|  | 11 | +- Python environment with the project’s dependencies and virtualenv activated. | 
|  | 12 | +- `make`, `curl`, `go`, `cargo`, `rustc`, and `python3` in `PATH`. | 
|  | 13 | +- All commands below are run from the repository root. | 
|  | 14 | + | 
|  | 15 | +## Step-by-Step Runbook | 
|  | 16 | + | 
|  | 17 | +0. **Download router support models** | 
|  | 18 | + | 
|  | 19 | +   These assets (ModernBERT classifiers, LoRA adapters, embeddings, etc.) are | 
|  | 20 | +   required before the router can start. | 
|  | 21 | + | 
|  | 22 | +   ```bash | 
|  | 23 | +   make download-models | 
|  | 24 | +   ``` | 
|  | 25 | + | 
|  | 26 | +1. **Start the OpenAI-compatible backend** | 
|  | 27 | + | 
|  | 28 | +   The router expects at least one endpoint that serves `/v1/chat/completions`. | 
|  | 29 | +   You can point to a real vLLM deployment, but the fastest option is the | 
|  | 30 | +   bundled mock server: | 
|  | 31 | + | 
|  | 32 | +   ```bash | 
|  | 33 | +   pip install -r tools/mock-vllm/requirements.txt | 
|  | 34 | +   python -m uvicorn tools.mock_vllm.app:app --host 0.0.0.0 --port 8000 | 
|  | 35 | +   ``` | 
|  | 36 | + | 
|  | 37 | +   Leave this process running; it provides instant canned responses for | 
|  | 38 | +   `openai/gpt-oss-20b`. | 
|  | 39 | + | 
|  | 40 | +2. **Launch Envoy** | 
|  | 41 | + | 
|  | 42 | +   In a separate terminal, bring up the Envoy sidecar that listens on | 
|  | 43 | +   `http://127.0.0.1:8801/v1/*` and forwards traffic to the router’s gRPC | 
|  | 44 | +   ExtProc server. | 
|  | 45 | + | 
|  | 46 | +   ```bash | 
|  | 47 | +   make run-envoy | 
|  | 48 | +   ``` | 
|  | 49 | + | 
|  | 50 | +3. **Start the router with the quickstart config** | 
|  | 51 | + | 
|  | 52 | +   In another terminal, run the quickstart bootstrap. Point the health probe at | 
|  | 53 | +   the router’s local HTTP API (port 8080) so the script does not wait on the | 
|  | 54 | +   Envoy endpoint. | 
|  | 55 | + | 
|  | 56 | +   ```bash | 
|  | 57 | +   QUICKSTART_HEALTH_URL=http://127.0.0.1:8080/health \ | 
|  | 58 | +     ./examples/quickstart/quickstart.sh --skip-download --skip-build | 
|  | 59 | +   ``` | 
|  | 60 | + | 
|  | 61 | +   Keep this process alive; Ctrl+C will stop the router. | 
|  | 62 | + | 
|  | 63 | +4. **Run the quick evaluation** | 
|  | 64 | + | 
|  | 65 | +   With Envoy, the router, and the mock backend running, execute the benchmark | 
|  | 66 | +   to send a small batch of MMLU questions through the routing pipeline. | 
|  | 67 | + | 
|  | 68 | +   ```bash | 
|  | 69 | +   OPENAI_API_KEY="sk-test" \ | 
|  | 70 | +     ./examples/quickstart/quick-eval.sh \ | 
|  | 71 | +       --mode router \ | 
|  | 72 | +       --samples 5 \ | 
|  | 73 | +       --vllm-endpoint "" | 
|  | 74 | +   ``` | 
|  | 75 | + | 
|  | 76 | +   - `--mode router` restricts the run to router-transparent requests. | 
|  | 77 | +   - `--vllm-endpoint ""` disables direct vLLM comparisons. | 
|  | 78 | + | 
|  | 79 | +5. **Inspect the results** | 
|  | 80 | + | 
|  | 81 | +   The evaluator writes all artifacts under | 
|  | 82 | +   `examples/quickstart/results/<timestamp>/`: | 
|  | 83 | + | 
|  | 84 | +   - `raw/` – individual JSON summaries per dataset/model combination. | 
|  | 85 | +   - `quickstart-summary.csv` – tabular metrics (accuracy, tokens, latency). | 
|  | 86 | +   - `quickstart-report.md` – Markdown report suitable for sharing. | 
|  | 87 | + | 
|  | 88 | +   You can re-run the evaluator with different flags (e.g., `--samples 10`, | 
|  | 89 | +   `--dataset arc`) and the outputs will land in fresh timestamped folders. | 
|  | 90 | + | 
|  | 91 | +## Switching to a Real vLLM Backend | 
|  | 92 | + | 
|  | 93 | +If you prefer to exercise a real language model: | 
|  | 94 | + | 
|  | 95 | +1. Replace step 1 with a real vLLM launch (or any OpenAI-compatible server). | 
|  | 96 | +2. Update `examples/quickstart/config-quickstart.yaml` so the `vllm_endpoints` | 
|  | 97 | +   block points to that service (IP, port, and model name). | 
|  | 98 | +3. Re-run steps 2–4. No other changes to the quickstart scripts are needed. | 
|  | 99 | + | 
|  | 100 | +Keep the mock server documented for quick demos; swap to full vLLM when you | 
|  | 101 | +want latency/quality signals from the actual model. | 
0 commit comments