RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs

Fernandez, Nigel, Kveton, Branislav, Rossi, Ryan A., Lan, Andrew S., Wang, Zichao

Oct-2-2025–arXiv.org Artificial Intelligence

Reasoning language models have demonstrated remarkable performance on many challenging tasks in math, science, and coding. Choosing the right reasoning model for practical deployment involves a performance and cost tradeoff at two key levels: model size and reasoning budget, where larger models and higher reasoning budget lead to better performance but with increased cost and latency. Recent advances in large language models (LLMs) have leveraged reinforcement learning (RL) (Shao et al., 2024) to train models to reason using chain-of-thought before generating an output. The excitement has led to a flurry of new open-source and proprietary RLMs; for example, Hugging Face already lists 2, 710 RLMs as of September 17th, 2025. These models have varying sizes, specialize in different domains, and offer various configurations, including reasoning efforts to balance performance and cost. For example, OpenAI's reasoning models (OpenAI & et al., 2024) have "low", "medium", and "high" reasoning budgets for developers to choose from depending on their application. Always choosing the "best" and most expensive RLM configuration with the highest level of reasoning budget is not always the "right" choice for every query: for some simpler queries, there might exist a "worse" and cheaper RLM configuration with low or no reasoning budget that correctly answers the query, resulting in significant cost savings without sacrificing performance. Indeed, we empirically observe the same phenomenon in Figure 1, where we show that over 50% of the queries from MA TH-500 (Hendrycks et al., 2021c) can be solved using an RLM as small as Qwen3-0.6B with minimal reasoning budget (measured by the number of reasoning tokens). On the contrary, some challenging queries require a much more capable RLM with high reasoning effort. Strong RLMs can also "over-think" which could hurt performance even for simple queries (Su et al., 2025; Hassid et al., 2025; Hong et al., 2025; Shojaee et al., 2025; Ghosal et al., 2025). This performance-cost tradeoff presents a challenge for practitioners: how to choose the "right" RLM and its configu-Work done during an internship at Adobe. Figure 1: Left: Our pilot study on MA TH-500 (Hendrycks et al., 2021c) shows a performance differential over (RLM, reasoning budget) configurations with the smallest RLM already solving over 50% of the queries with minimal reasoning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-2-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report (0.64)

Industry:
- Education > Assessment & Standards (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found