Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Lin, Yen-Ting, Jin, Di, Xu, Tengyu, Wu, Tianhao, Sukhbaatar, Sainbayar, Zhu, Chen, He, Yun, Chen, Yun-Nung, Weston, Jason, Tian, Yuandong, Rahnama, Arash, Wang, Sinong, Ma, Hao, Fang, Han

Jan-18-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) have recently shown remarkable capabilities in reasoning-intensive tasks such as coding (Chen et al., 2021; Li et al., 2022; Rozière et al., 2023) and solving complex mathematical problems (Shao et al., 2024; Azerbayev et al., 2024). Prompting strategies like chain-of-thought prompting (Nye et al., 2021; Wei et al., 2022; Kojima et al., 2022; Adolphs et al., 2022) and self-consistency sampling (Wang et al., 2023) enhance these models' final-answer accuracy by encouraging them to articulate intermediate reasoning steps. However, a significant issue remains: even when these methods boost final-answer correctness, the internal reasoning steps are often unreliable or logically inconsistent (Uesato et al., 2022; Lightman et al., 2024). This discrepancy between correct final answers and flawed intermediate reasoning limits our ability to trust LLMs in scenarios where transparency and correctness of each reasoning stage are crucial (Lanham et al., 2023). For example, in mathematical problem-solving, a model might produce the right answer for the wrong reasons (Lyu et al., 2023; Zheng et al., 2024), confounding our understanding of its true capabilities (Turpin et al., 2023).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jan-18-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - UAE (0.14)
- Europe > Austria
  - Vienna (0.14)
- North America > United States (0.28)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)