Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
Lin, Yen-Ting, Jin, Di, Xu, Tengyu, Wu, Tianhao, Sukhbaatar, Sainbayar, Zhu, Chen, He, Yun, Chen, Yun-Nung, Weston, Jason, Tian, Yuandong, Rahnama, Arash, Wang, Sinong, Ma, Hao, Fang, Han
–arXiv.org Artificial Intelligence
Large language models (LLMs) have recently shown remarkable capabilities in reasoning-intensive tasks such as coding (Chen et al., 2021; Li et al., 2022; Rozière et al., 2023) and solving complex mathematical problems (Shao et al., 2024; Azerbayev et al., 2024). Prompting strategies like chain-of-thought prompting (Nye et al., 2021; Wei et al., 2022; Kojima et al., 2022; Adolphs et al., 2022) and self-consistency sampling (Wang et al., 2023) enhance these models' final-answer accuracy by encouraging them to articulate intermediate reasoning steps. However, a significant issue remains: even when these methods boost final-answer correctness, the internal reasoning steps are often unreliable or logically inconsistent (Uesato et al., 2022; Lightman et al., 2024). This discrepancy between correct final answers and flawed intermediate reasoning limits our ability to trust LLMs in scenarios where transparency and correctness of each reasoning stage are crucial (Lanham et al., 2023). For example, in mathematical problem-solving, a model might produce the right answer for the wrong reasons (Lyu et al., 2023; Zheng et al., 2024), confounding our understanding of its true capabilities (Turpin et al., 2023).
arXiv.org Artificial Intelligence
Jan-18-2025
- Country:
- Asia > Middle East
- UAE (0.14)
- Europe > Austria
- Vienna (0.14)
- North America > United States (0.28)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.68)
- Technology: