GitHub - vllm-project/semantic-router: Intelligent Mixture-of-Models Router for Efficient LLM Inference

📚 Complete Documentation | 🚀 Quick Start | 📣 Blog | 📖 Publications

Innovations ✨

Intelligent Routing 🧠

Auto-Reasoning and Auto-Selection of Models

An Mixture-of-Models (MoM) router that intelligently directs OpenAI API requests to the most suitable models from a defined pool based on Semantic Understanding of the request's intent (Complexity, Task, Tools).

This is achieved using BERT classification. Conceptually similar to Mixture-of-Experts (MoE) which lives within a model, this system selects the best entire model for the nature of the task.

As such, the overall inference accuracy is improved by using a pool of models that are better suited for different types of tasks:

The screenshot below shows the LLM Router dashboard in Grafana.

The router is implemented in two ways:

Golang (with Rust FFI based on the candle rust ML framework)
Python Benchmarking will be conducted to determine the best implementation.

Auto-Selection of Tools

Select the tools to use based on the prompt, avoiding the use of tools that are not relevant to the prompt so as to reduce the number of prompt tokens and improve tool selection accuracy by the LLM.

Category-Specific System Prompts

Automatically inject specialized system prompts based on query classification, ensuring optimal model behavior for different domains (math, coding, business, etc.) without manual prompt engineering.

Enterprise Security 🔒

PII detection

Detect PII in the prompt, avoiding sending PII to the LLM so as to protect the privacy of the user.

Prompt guard

Detect if the prompt is a jailbreak prompt, avoiding sending jailbreak prompts to the LLM so as to prevent the LLM from misbehaving.

Similarity Caching ⚡️

Cache the semantic representation of the prompt so as to reduce the number of prompt tokens and improve the overall inference latency.

Distributed Tracing 🔍

Comprehensive observability with OpenTelemetry distributed tracing provides fine-grained visibility into the request processing pipeline.

Open WebUI Integration 💬

To view the Chain-Of-Thought of the vLLM-SR's decision-making process, we have integrated with Open WebUI.

Quick Start 🚀

Get up and running in seconds with our interactive setup script:

bash ./scripts/quickstart.sh

This command will:

🔍 Check all prerequisites automatically
📦 Install HuggingFace CLI if needed
📥 Download all required AI models (~1.5GB)
🐳 Start all Docker services
⏳ Wait for services to become healthy
🌐 Show you all the endpoints and next steps

For detailed installation and configuration instructions, see the Complete Documentation.

What This Starts By Default

make docker-compose-up now launches the full stack including a lightweight local OpenAI-compatible model server powered by llm-katan (serving the small model Qwen/Qwen3-0.6B under the alias qwen3). The semantic router is configured to route classification & default generations to this local endpoint out-of-the-box. This gives you an entirely self-contained experience (no external API keys required) while still letting you add remote / larger models later.

Core Mode (Without Local Model)

If you only want the core semantic-router + Envoy + observability stack (and will point to external OpenAI-compatible endpoints yourself):

make docker-compose-up-core

Prerequisite Model Download (Speeds Up First Run)

The existing model bootstrap targets now also pre-download the small llm-katan model so the first docker-compose-up avoids an on-demand Hugging Face fetch.

Minimal set (fast):

make models-download-minimal

Full set:

make models-download

Both create a stamp file once Qwen/Qwen3-0.6B is present to keep subsequent runs idempotent.

Documentation 📖

For comprehensive documentation including detailed setup instructions, architecture guides, and API references, visit:

👉 Complete Documentation at Read the Docs

The documentation includes:

Installation Guide - Complete setup instructions
System Architecture - Technical deep dive
Model Training - How classification models work
API Reference - Complete API documentation
Distributed Tracing - Observability and debugging guide

Community 👋

For questions, feedback, or to contribute, please join #semantic-router channel in vLLM Slack.

Community Meetings 📅

We host bi-weekly community meetings to sync up with contributors across different time zones:

First Tuesday of the month: 9:00-10:00 AM EST (accommodates US EST and Asia Pacific contributors)
- Zoom Link: https://nyu.zoom.us/j/95065349917
- Calendar Invite: https://calendar.app.google/EeP6xDgCpxte6d1eA
Third Tuesday of the month: 1:00-2:00 PM EST (accommodates US EST and California contributors)
- Zoom Link: https://nyu.zoom.us/j/98861585086
- Calendar Invite: https://calendar.app.google/oYsmt1Pu46o4gFuP8

Join us to discuss the latest developments, share ideas, and collaborate on the project!

Citation

If you find Semantic Router helpful in your research or projects, please consider citing it:

@misc{semanticrouter2025,
  title={vLLM Semantic Router},
  author={vLLM Semantic Router Team},
  year={2025},
  howpublished={\url{https://github.com/vllm-project/semantic-router}},
}

Star History 🔥

We opened the project at Aug 31, 2025. We love open source and collaboration ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 510 Commits
.github		.github
bench		bench
candle-binding		candle-binding
config		config
dashboard		dashboard
deploy		deploy
e2e-tests		e2e-tests
examples		examples
scripts		scripts
src		src
tools		tools
website		website
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prowlabels.yaml		.prowlabels.yaml
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.extproc		Dockerfile.extproc
Dockerfile.extproc.cross		Dockerfile.extproc.cross
Dockerfile.precommit		Dockerfile.precommit
LICENSE		LICENSE
Makefile		Makefile
OWNER		OWNER
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Innovations ✨

Intelligent Routing 🧠

Auto-Reasoning and Auto-Selection of Models

Auto-Selection of Tools

Category-Specific System Prompts

Enterprise Security 🔒

PII detection

Prompt guard

Similarity Caching ⚡️

Distributed Tracing 🔍

Open WebUI Integration 💬

Quick Start 🚀

What This Starts By Default

Core Mode (Without Local Model)

Prerequisite Model Download (Speeds Up First Run)

Documentation 📖

Community 👋

Community Meetings 📅

Citation

Star History 🔥

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 33

Uh oh!

Languages

License

vllm-project/semantic-router

Folders and files

Latest commit

History

Repository files navigation

Innovations ✨

Intelligent Routing 🧠

Auto-Reasoning and Auto-Selection of Models

Auto-Selection of Tools

Category-Specific System Prompts

Enterprise Security 🔒

PII detection

Prompt guard

Similarity Caching ⚡️

Distributed Tracing 🔍

Open WebUI Integration 💬

Quick Start 🚀

What This Starts By Default

Core Mode (Without Local Model)

Prerequisite Model Download (Speeds Up First Run)

Documentation 📖

Community 👋

Community Meetings 📅

Citation

Star History 🔥

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 33

Uh oh!

Languages

Packages