Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA

Schumann, Raphael, Riezler, Stefan

arXiv.org Artificial Intelligence 

Reasoning quality in large language models depends not only on producing correct answers but also on generating valid intermediate steps. We study this through multiple-choice question answering (MCQA), which provides a controlled setting with fixed answer options. Our analysis shows that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear, leading to false positives. By estimating the solvability of each question, we uncover an intermediate regime where learning is most effective. Building on this insight, we adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives. Across experiments on math and multimodal datasets, these modifications consistently yield higher rates of process-correct reasoning and, in reinforcement learning, improved answer accuracy as well. Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning. In many applications of CoT reasoning, the generated thought process is as important as the final answer. While some tasks provide gold-standard reasoning chains that can effectively be used for supervised training (Nye et al., 2021; Dziri et al., 2023; Hochlehnert et al., 2025), most datasets lack such annotations. For these cases, correct reasoning has to be incentivized by rewards on correct final answers (Wen et al., 2025). It is known that CoTs can lead to the correct answer, despite an incorrect explanation. Grattafiori et al. (2024) note that this often occurs for questions where only a small fraction of the generated answers is correct. In this work, we investigate this observation in controlled experiments on multiple datasets. To avoid confounding factors of noisy answer extraction and matching, we focus on multiple-choice question answering. This format is popular for evaluating models and widely used training sets like NuminaMath (LI et al., 2024) contain a large fraction of multiple-choice questions. The fixed number of answer options also allows us to explicitly model the solvability of a question.