Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
Shen, Maohao, Zeng, Guangtao, Qi, Zhenting, Hong, Zhang-Wei, Chen, Zhenfang, Lu, Wei, Wornell, Gregory, Das, Subhro, Cox, David, Gan, Chuang
–arXiv.org Artificial Intelligence
Large language models (LLMs) have demonstrated remarkable Large language models (LLMs) have demonstrated performance across a wide range of reasoning remarkable reasoning capabilities across tasks, including mathematical problems (Cobbe et al., 2021; diverse domains. Recent studies have shown that Hendrycks et al., 2021a), programming (Chen et al., 2021; increasing test-time computation enhances LLMs' Zhuo et al., 2024) and logical reasoning (Han et al., 2024; reasoning capabilities. This typically involves extensive Liu et al., 2020). One of the key techniques enabling these sampling at inference time guided by an strong reasoning capabilities is Chain-of-Thought (CoT) external LLM verifier, resulting in a two-player prompting (Wei et al., 2022), which allows LLMs to address system. Despite external guidance, the effectiveness complex tasks by generating a series of intermediate of this system demonstrates the potential of reasoning steps. As a result, many early efforts focus on finetuning a single LLM to tackle complex tasks. Thus, we LLMs using large-scale, high-quality CoT reasoning pose a new research problem: Can we internalize chains, either through human annotation (Hendrycks et al., the searching capabilities to fundamentally 2021a; Yue et al., 2024) or by distilling synthetic data from enhance the reasoning abilities of a single LLM? more advanced models (Yu et al., 2024; Toshniwal et al., This work explores an orthogonal direction focusing 2024a; Ding et al., 2024). However, human annotation is on post-training LLMs for autoregressive extremely labor intensive, and distillation often limits the searching (i.e., an extended reasoning process model's reasoning capabilities to certain level.
arXiv.org Artificial Intelligence
Feb-4-2025
- Country:
- Africa (0.68)
- Asia (0.46)
- Europe
- Austria > Vienna (0.14)
- Middle East > Malta (0.14)
- North America > United States (0.28)
- Genre:
- Research Report > New Finding (0.45)
- Industry:
- Education (1.00)
- Health & Medicine (1.00)
- Technology: