开源 LLM 路由器

路由每一个请求用一个系统大脑到最合适的模型

在本地、私有与前沿模型之间统一路由，由成本、时延、隐私与安全共同驱动。

试用演示阅读白皮书

System brain

Connect all models with system brain

信号16

16 类信号，覆盖启发式与学习式检测，从知识库路由到历史感知 reask。

选择器12

12 种路由策略，覆盖规则、时延启发式、强化学习与机器学习选择。

论文17

17 篇研究论文，覆盖路由、系统、安全与多模态。

快速上手

只保留一条官方支持的本地启动路径：复制安装命令，执行后即可进入控制台。

一条命令，本地跑起来。

首跑路径收敛为一个安装脚本，负责在 macOS 和 Linux 上配置 CLI 与本地服务流程。

一键安装macOS / Linux

curl -fsSL https://vllm-semantic-router.com/zh-Hans/install.sh | bash

默认安装到 ~/.local/share/vllm-sr，写入 ~/.local/bin/vllm-sr；Windows 仍按文档中的手动 pip 方式安装。

查看完整安装指南

研究

这些论文，构成了路由器的底层思路。

从安全、多模态到编排与系统设计，这些研究线索持续塑造 vLLM Semantic Router 的演进方向。

2026 / 论文立场论文

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

vLLM Semantic Router Team

arXiv 技术报告

We introduce vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality deployments that composes heterogeneous signals into deployment-specific routing policies across cost, privacy, latency, and safety constraints.

查看论文

2026 / 论文愿景论文

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

arXiv 技术报告

We synthesize the project’s recent routing, fleet, multimodal, and governance results into the Workload-Router-Pool (WRP) architecture, connecting signal-driven routing to a full-stack inference optimization framework and outlining future research directions across workload, router, and pool design.

查看论文

2026 / 论文

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv 技术报告

We formalize the visual confused deputy as a security failure mode in computer-using agents and introduce a dual-channel guardrail that independently checks click targets and action reasoning before execution.

查看论文

2026 / 论文

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv 技术报告

We introduce Outcome-Aware Tool Selection (OATS), an offline embedding refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without adding serving-time model inference.

查看论文

2026 / 论文

Adaptive Vision-Language Model Routing for Computer Use Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv 技术报告

We propose Adaptive VLM Routing (AVR), which estimates action difficulty and routes computer-use agent steps to the cheapest model that still satisfies a target reliability threshold.

查看论文

2026 / 论文

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv 技术报告

We combine Flash Attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds while keeping the router lightweight enough to share hardware with serving.

查看论文

2026 / 论文

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv 技术报告

We present a queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets, without requiring hardware profiling runs up front.

查看论文

2026 / 论文

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv 技术报告

We derive the minimum-cost two-pool LLM fleet directly from the workload CDF and P99 TTFT target, then use Compress-and-Route to make the optimal boundary deployable in practice.

查看论文

2026 / 论文

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

arXiv 技术报告

We derive the 1/W law showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing topology a larger energy-efficiency lever than a pure GPU generation upgrade.

查看论文

2026 / 论文

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

arXiv 技术报告

We show how probabilistic ML predicates in policy languages can silently co-fire on the same query, and implement conflict detection plus a softmax-based prevention mechanism in the Semantic Router DSL.

查看论文

2026 / 论文

From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification

Huamin Chen, Xunzhuo Liu, Bowei He, Xue Liu

arXiv 技术报告

We extend the Semantic Router DSL from stateless, per-request routing to multi-step agent workflows, emitting verified decision nodes for orchestration frameworks, Kubernetes artifacts, YANG/NETCONF payloads, and protocol-boundary gates from a single declarative source file.

查看论文

2026 / 论文

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

arXiv 技术报告

We show that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a 235B model’s performance on persistent user-specific queries while cutting effective inference cost by 96%.

查看论文

2026 / 论文RAG 验证

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen

SIGIR 2026 Industry Track

We present a real-time verification component for long-document RAG that processes contexts up to 32K tokens, balancing latency and grounding coverage so interactive systems can detect unsupported answers without falling back to truncated checks.

查看论文

2025 / 论文

When to Reason: Semantic Router for vLLM

Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

NeurIPS - MLForSys

We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial.

查看论文

2025 / 论文

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

We present a category-aware semantic caching where similarity thresholds, TTLs, and quotas vary by query category, with a hybrid architecture separating in-memory HNSW search from external document storage.

查看论文

2025 / 论文

Semantic Inference Routing Protocol (SIRP)

Huamin Chen, Luay Jalil

互联网工程任务组（IETF）

This document specifies the Semantic Inference Routing Protocol (SIRP), a framework for content-level classification and semantic routing in AI inference systems.

查看论文

2025 / 论文

Multi-Provider Extensions for Agentic AI Inference APIs

H. Chen, L. Jalil, N. Cocker

Internet Engineering Task Force (IETF) - Network Management Research Group

This document specifies multi-provider extensions for agentic AI inference APIs. Published: 20 October 2025. Intended Status: Informational. Expires: 23 April 2026.

查看论文

它们决定路由器如何感知、如何判断，以及如何扩展。

查看全部论文与分享

为什么需要路由

一个请求，多种模型选择。

模型在质量、成本、时延、隐私和模态上都不一样。一旦你不止跑一个模型，难点就不再是调用 LLM，而是把每个请求路由到正确的模型系统。

能力成本隐私时延

团队为什么部署它

用同一层路由同时处理成本、质量和策略决策。

降低单次请求成本

把常规流量发往更高效的模型，把真正需要的请求交给前沿推理模型，让模型选择变成可衡量的 ROI。

每一美元换来更多有效输出。

更安全的模型决策

把越狱、PII 和幻觉处理放进路由链路里，让高风险请求在生成前就被拦住。

安全能力直接进入请求路径。

一层路由接所有模型

用同一层路由协调本地、私有和前沿模型，从边缘部署一直覆盖到托管云。

一套系统贯穿设备、VPC 和云。

路由蓝图

系统如何工作

通过交互式演示理解信号提取、投影协调、决策逻辑与模型路由行为。

香农映射

从通信理论到路由流水线的结构映射。

用户请求是在编码前的原始源消息。

编码器模型驱动

先理解，再生成

专门训练的编码器先读懂意图、排序相关性、识别模态，再把结果交给生成模型。

信号入口面

序列分类、token 标注、嵌入检索和重排序，最终汇合成同一层系统智能。

SEQ_CLS序列分类负责领域识别、越狱检测、事实核查与反馈路由。

TOKENToken 标注负责定位 PII 与高风险片段，便于做局部干预。

EMBED嵌入与重排序链路支撑语义缓存、相似检索和候选打分。

查看 Hugging Face 模型

MOD

多模态

检测并路由文本、图像和音频输入到合适的模态模型。

Input

"Is machine learning related to AI?"

Tokenizer

[CLS]IsmachinelearningrelatedtoAI?[SEP]

Embedding

Token Emb

Segment Emb

Position Emb

h₀ = Σ

Encoder Block

×N

ATTNMulti-Head Attention

NORMAdd & Norm

FFNFeed-Forward

NORMAdd & Norm

Signals

CLS

Sentence-Level (CLS Token)[CLS] → Linear Head → "computer science"TaskType: SEQ_CLS

DomainJailbreakFact-checkFeedbackModality

BIO

Token-Level (Per Token)Each token → BIO Label → O O B-LOC I-LOC OTaskType: TOKEN_CLS

PII Detection

EMB

Bi-Encodermean-pooling(h₁..hₙ) → [0.23, -0.41, 0.87, ...]TaskType: EMBEDDING

Semantic CacheSimilarityComplexity-CLJailbreak-CL

RER

Cross-Encoder[CLS] query [SEP] candidate [SEP] → scoreTaskType: CROSS_LEARNING

RerankMulti-Modal

BIE

Bi-Encoder 嵌入

独立编码查询和候选项为稠密向量，用于相似度搜索和语义缓存。

XCE

Cross-Encoder 学习

联合交叉注意力评分查询-候选对，实现高精度重排序。

CLS

分类

基于自研 BERT 的领域、越狱、PII 和事实核查的分类器，覆盖多个 signal

ATT

全注意力

跨 token 和句子的双向注意力 — 双向完整上下文，非因果掩码。

2DM

2DMSE

推理时自适应调整嵌入层数和维度，按需平衡计算量与精度。

MRL

无需重训即可截断嵌入向量到任意维度 — 按请求平衡精度与速度。

贡献者

谁在把这套系统做成现实

从研究到基础设施，这个项目由一群持续交付的人共同推进。

维护者

Huamin Chen

Distinguished Engineer @Red Hat

维护者

Xunzhuo Liu

Intelligent Routing @vLLM

维护者

Chen Wang

Senior Staff Research Scientist @IBM

维护者

Yue Zhu

Staff Research Scientist @IBM

提交者

Senan Zedan

R&D Manager @Red Hat

提交者

samzong

AI Infrastructure / Cloud-Native PM @DaoCloud

提交者

Liav Weiss

Software Engineer @Red Hat

提交者

Asaad Balum

Senior Software Engineer @Red Hat

提交者

Yehudit

Software Engineer @Red Hat

提交者

Noa Limoy

Software Engineer @Red Hat

提交者

Marina Koushnir

Open Source Contributor @Red Hat

提交者

JaredforReal

Software Engineer @Z.ai

提交者

Srinivas A

Software Engineer @Yokogawa

提交者

carlory

Open Source Engineer @DaoCloud

提交者

Yossi Ovadia

Senior Principal Engineer @Red Hat

提交者

Jintao Zhang

Senior Software Engineer @Kong

提交者

yuluo-yx

Individual Contributor

提交者

cryo-zd

Individual Contributor

提交者

OneZero-Y

Individual Contributor

提交者

aeft

Individual Contributor

提交者

Hao Wu

Individual Contributor

提交者