Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Shen, Maohao, Zeng, Guangtao, Qi, Zhenting, Hong, Zhang-Wei, Chen, Zhenfang, Lu, Wei, Wornell, Gregory, Das, Subhro, Cox, David, Gan, Chuang

arXiv.org Artificial Intelligence 

Large language models (LLMs) have demonstrated remarkable Large language models (LLMs) have demonstrated performance across a wide range of reasoning remarkable reasoning capabilities across tasks, including mathematical problems (Cobbe et al., 2021; diverse domains. Recent studies have shown that Hendrycks et al., 2021a), programming (Chen et al., 2021; increasing test-time computation enhances LLMs' Zhuo et al., 2024) and logical reasoning (Han et al., 2024; reasoning capabilities. This typically involves extensive Liu et al., 2020). One of the key techniques enabling these sampling at inference time guided by an strong reasoning capabilities is Chain-of-Thought (CoT) external LLM verifier, resulting in a two-player prompting (Wei et al., 2022), which allows LLMs to address system. Despite external guidance, the effectiveness complex tasks by generating a series of intermediate of this system demonstrates the potential of reasoning steps. As a result, many early efforts focus on finetuning a single LLM to tackle complex tasks. Thus, we LLMs using large-scale, high-quality CoT reasoning pose a new research problem: Can we internalize chains, either through human annotation (Hendrycks et al., the searching capabilities to fundamentally 2021a; Yue et al., 2024) or by distilling synthetic data from enhance the reasoning abilities of a single LLM? more advanced models (Yu et al., 2024; Toshniwal et al., This work explores an orthogonal direction focusing 2024a; Ding et al., 2024). However, human annotation is on post-training LLMs for autoregressive extremely labor intensive, and distillation often limits the searching (i.e., an extended reasoning process model's reasoning capabilities to certain level.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found