Huggingface releases open-source models from Llama, Deepseek, Qwen almost daily, while proprietary foundational models such as GPT, Claude, and Gemini keep raising the bar in coding, math, and creative writing. No single model wins across the board: Claude may top coding benchmarks, yet you might still favor GPT for its cleaner, more structured explanations of the coding concepts. Therefore, LLM routing has become an essential technique to operationalize the use of different models in an agentic application.
Most existing LLM routing systems optimize for academic benchmark performance —like MMLU or GPQA— that don’t reflect the messy, subjective, and task-specific judgments users and developers make in real-world applications. In the real world, it's less about benchmark scores and more about things like domain-specific accuracy, speed, and preference fit. That’s why we built Arch-Router, a lightweight (1.5B parameter) routing model that allows you to capture your preferences for model routing decisions
You define intuitive categories like “travel booking” or “image editing,” and Arch-Router routes each query to the model you’ve found to work best—based on your own experience and evaluation. Unlike rigid benchmark-tuned approaches, Arch-Router is transparent, adaptable to new models, and fast—clocking in at just 50ms per routing decision—while outperforming even proprietary LLMs like Claude Sonnet 3.7 and GPT-4o in our evaluations on real conversational data.
As developers, only you truly know which LLM works best for your use case through countless trial and error.. Benchmarks won’t reflect your real-world experience, specialized tasks, or unique expectations. Preference-aligned routing offers a new approach to LLM routing, focusing on practical, subjective preferences—such as domain expertise (finance, coding, medical) or specific actions (summarization, image generation). The framework closes that gap by letting you encode your own notion of “best.” You supply a routing policy that does two things:
Arch-Router LLM is a 1.5-billion-parameter model built around this preference-aligned framework. Instead of hard-coding rules or relying on a black box router, you hand Arch-Router your routing policy and it does the rest. Despite its compact size, the model outperforms larger proprietary LLMs from the GPT-4o, Claude, and Gemini families. Moreover, it is blazing fast, delivering end-to-end routing decisions in 50ms (p50), under 75ms (p99) while competing LLMs typically spend roughly 1 s just to pick a route (as shown in Figure 1). The result: state-of-the-art accuracy at a fraction of the latency and deployment cost.
Arch-Router introduces two key concepts:
Both domain and action policies are associated with preferred models or model variants. At inference time, Arch-Router analyzes the incoming prompt to infer its domain and action using semantic similarity, task indicators, and contextual cues. It then applies the user-defined routing preferences to select the model best suited to handle the request as shown in Figure 2.
To see this in action, we've built a Chrome extension that allow ChatGPT Plus users to automatically route their queries to the appropriate model based on their usage preferences. Check it out!
Arch-Router is fast and accurate, choosing a model almost instantaneous (50 ms) while scoring higher than the best proprietary LLMs on routing performance.. It aligns with your preferences, different individuals or teams can craft their own routing policies so each query lands on the model they trust most. And it stays flexible and adaptable: see a new model you want to try, or add a task to your product? Simply update the routing policy file and use it—no costly retraining, no pipeline rebuilding. Here are some stats:
Speed: 50ms median routing time (75ms at p99)
Accuracy: 93.06% routing accuracy on provided benchmark
Cost: $0.00132 per routing query
Comparison*: Proprietary routers average 1000ms+ routing time with upto $5 per routing query (GPT-4o)
This blog post scratches the surface of what and how to use Arch-Router; the full story lives in our open-source stack:
Visit our repository for implementation guides, contribute improvements, or report issues. We welcome community contributions to advance better LLM applications.
*Compares operational cost (price per 1M tokens) , latency (average ± standard deviation in milliseconds) benchmarked from OpenRouter), and overall routing performance. The cost for Arch-Router is estimated in hosting the model on AWS L40S instance. The calculation details is shown in Appendix in our routing paper.