When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

Boizard, Nicolas, Gisserot-Boukhlef, Hippolyte, El-Haddad, Kevin, Hudelot, Céline, Colombo, Pierre

arXiv.org Artificial Intelligence 

MICS, CentraleSup elec, Universit e Paris-Saclay Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks. Reasoning helps most on open-ended and math tasks; gains are limited or inconsistent on general multiple-choice tasks. Large Language Models (LLMs) that generate explicit Chains of Thought (CoT) have rapidly become a defining paradigm. The research community is releasing increasingly capable reasoning models, which consistently outperform standard Instruction Fine-Tuned (IFT) counterparts at test time, especially on math, coding, and other reasoning-heavy tasks DeepSeek-AI (2025); OpenAI (2024); Mistral-AI (2025). Despite rapid progress, we still lack clarity on when explicit reasoning is most beneficial. Both prior evidence and our findings (Figure 1) point to a highly task-dependent picture: reasoning yields substantial gains on math and coding benchmarks where multi-step problem solving is essential (Zhu et al., 2024), but provides only limited improvements on simpler factual or classification tasks (Liu et al., 2024). As Figure 1 shows, these gains concentrate on reasoning-intensive (e.g., gsm8k, aime) and open-ended tasks, while benefits on general multiple-choice tasks are much smaller or inconsistent. Meanwhile, the scaling dynamics of reasoning models pose further challenges.