RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs
Fernandez, Nigel, Kveton, Branislav, Rossi, Ryan A., Lan, Andrew S., Wang, Zichao
–arXiv.org Artificial Intelligence
Reasoning language models have demonstrated remarkable performance on many challenging tasks in math, science, and coding. Choosing the right reasoning model for practical deployment involves a performance and cost tradeoff at two key levels: model size and reasoning budget, where larger models and higher reasoning budget lead to better performance but with increased cost and latency. Recent advances in large language models (LLMs) have leveraged reinforcement learning (RL) (Shao et al., 2024) to train models to reason using chain-of-thought before generating an output. The excitement has led to a flurry of new open-source and proprietary RLMs; for example, Hugging Face already lists 2, 710 RLMs as of September 17th, 2025. These models have varying sizes, specialize in different domains, and offer various configurations, including reasoning efforts to balance performance and cost. For example, OpenAI's reasoning models (OpenAI & et al., 2024) have "low", "medium", and "high" reasoning budgets for developers to choose from depending on their application. Always choosing the "best" and most expensive RLM configuration with the highest level of reasoning budget is not always the "right" choice for every query: for some simpler queries, there might exist a "worse" and cheaper RLM configuration with low or no reasoning budget that correctly answers the query, resulting in significant cost savings without sacrificing performance. Indeed, we empirically observe the same phenomenon in Figure 1, where we show that over 50% of the queries from MA TH-500 (Hendrycks et al., 2021c) can be solved using an RLM as small as Qwen3-0.6B with minimal reasoning budget (measured by the number of reasoning tokens). On the contrary, some challenging queries require a much more capable RLM with high reasoning effort. Strong RLMs can also "over-think" which could hurt performance even for simple queries (Su et al., 2025; Hassid et al., 2025; Hong et al., 2025; Shojaee et al., 2025; Ghosal et al., 2025). This performance-cost tradeoff presents a challenge for practitioners: how to choose the "right" RLM and its configu-Work done during an internship at Adobe. Figure 1: Left: Our pilot study on MA TH-500 (Hendrycks et al., 2021c) shows a performance differential over (RLM, reasoning budget) configurations with the smallest RLM already solving over 50% of the queries with minimal reasoning.
arXiv.org Artificial Intelligence
Oct-2-2025
- Country:
- North America > United States (0.28)
- Genre:
- Research Report (0.64)
- Industry:
- Education > Assessment & Standards (0.68)
- Technology: