Arch-Router: Outperforming Foundational Models in LLM Routing with a 1.5B Model
Co Tran
Machine Learning Engineer
July 9, 2025

Huggingface releases open-source models from Llama, Deepseek, Qwen almost daily, while proprietary foundational models such as GPT, Claude, and Gemini keep raising the bar in coding, math, and creative writing. No single model wins across the board: Claude may top coding benchmarks, yet you might still favor GPT for its cleaner, more structured explanations of the coding concepts. Therefore, LLM routing has become an essential technique to operationalize the use of different models in an agentic application.

Most existing LLM routing systems optimize for academic benchmark performance —like MMLU or GPQA— that don’t reflect the messy, subjective, and task-specific judgments users and developers make in real-world applications. In the real world, it's less about benchmark scores and more about things like domain-specific accuracy, speed, and preference fit. That’s why we built Arch-Router, a lightweight (1.5B parameter) routing model that allows you to capture your preferences for model routing decisions

You define intuitive categories like “travel booking” or “image editing,” and Arch-Router routes each query to the model you’ve found to work best—based on your own experience and evaluation. Unlike rigid benchmark-tuned approaches, Arch-Router is transparent, adaptable to new models, and fast—clocking in at just 50ms per routing decision—while outperforming even proprietary LLMs like Claude Sonnet 3.7 and GPT-4o in our evaluations on real conversational data.

What is Arch-Router?

As developers, only you truly know which LLM works best for your use case through countless trial and error.. Benchmarks won’t reflect your real-world experience, specialized tasks, or unique expectations. Preference-aligned routing offers a new approach to LLM routing, focusing on practical, subjective preferences—such as domain expertise (finance, coding, medical) or specific actions (summarization, image generation). The framework closes that gap by letting you encode your own notion of “best.” You supply a routing policy that does two things:

  1. Breaks the query space into policies at the domain level (e.g., finance, medical) and, when needed, the finer-grained action level (e.g., “summarize,” “generate SQL”).
  2. Maps each policy to the exact model you trust for that slice of work.

Arch-Router LLM is a 1.5-billion-parameter model built around this preference-aligned framework. Instead of hard-coding rules or relying on a black box router, you hand Arch-Router your routing policy and it does the rest. Despite its compact size, the model outperforms larger proprietary LLMs from the GPT-4o, Claude, and Gemini families. Moreover, it is blazing fast, delivering end-to-end routing decisions in 50ms (p50), under 75ms (p99) while competing LLMs typically spend roughly 1 s just to pick a route (as shown in Figure 1). The result: state-of-the-art accuracy at a fraction of the latency and deployment cost.

How does it work?

Figure 2: Preference-based routing workflow

Arch-Router introduces two key concepts:

  • Domain – the high-level thematic category or subject matter of a request (e.g., legal, healthcare, programming).
  • Action – the specific type of operation the user wants performed (e.g., summarization, code generation, booking appointment, translation).

Both domain and action policies are associated with preferred models or model variants. At inference time, Arch-Router analyzes the incoming prompt to infer its domain and action using semantic similarity, task indicators, and contextual cues. It then applies the user-defined routing preferences to select the model best suited to handle the request as shown in Figure 2.

To see this in action, we've built a Chrome extension that allow ChatGPT Plus users to automatically route their queries to the appropriate model based on their usage preferences. Check it out!


Performance

Arch-Router is fast and accurate, choosing a model almost instantaneous (50 ms) while scoring higher than the best proprietary LLMs on routing performance.. It aligns with your preferences, different individuals or teams can craft their own routing policies so each query lands on the model they trust most. And it stays flexible and adaptable: see a new model you want to try, or add a task to your product? Simply update the routing policy file and use it—no costly retraining, no pipeline rebuilding. Here are some stats:

Speed: 50ms median routing time (75ms at p99)

Accuracy: 93.06% routing accuracy on provided benchmark

Cost: $0.00132 per routing query 

Comparison*: Proprietary routers average 1000ms+ routing time with upto $5 per routing query (GPT-4o)

Ready to dive deeper?


This blog post scratches the surface of what and how to use Arch-Router; the full story lives in our open-source stack:

  • Research paper -  Detailed methodology, benchmarks, and ablation studies 
  • Arch-Router collection - Arch-Router-1.5B from Hugging Face with gguf
  • API documentation - Complete API guides and integration examples
  • ArchGW: The proxy server (and the universal data plane) for AI-native apps that supports preference-aligned routing and much more

Visit our repository for implementation guides, contribute improvements, or report issues. We welcome community contributions to advance better LLM applications.

*Compares operational cost (price per 1M tokens) , latency (average ± standard deviation in milliseconds) benchmarked from OpenRouter), and overall routing performance. The cost for Arch-Router is estimated in hosting the model on AWS L40S instance. The calculation details is shown in Appendix in our routing paper.