Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently
Lyu, Bochen, Jia, Yiyang, Cai, Xiaohao, Zhu, Zhanxing
Large language models (LLMs), with the transformer architecture being their core building block, are remarkably successful across a wide range of tasks, in particular reasoning. LLMs excel in solving complex reasoning tasks by iteratively generating intermediate steps [Wei et al., 2022]-- an intriguing approach known as Chain-of-Thought (CoT). Fine-tuning has been shown to be a powerful method to enhance efficient CoT generation in LLMs, which in turn improves the multi-step reasoning performance of LLMs significantly [Wei et al., 2022, Zelikman et al., 2022, Lightman et al., 2024]. A widely adopted approach for fine-tuning to generate CoT is supervised fine-tuning (SFT), where the transformers are trained to minimize a loss over pairs of inputs and labeled outputs. While straightforward, SFT is restricted by the demand of a large amount of labeled CoT data. As a result, fine-tuning approaches based on reinforcement learning (RL) [DeepSeek-AI et al., 2025, Ouyang et al., 2022, Bai et al., 2022, Christiano et al., 2023, Kumar et al., 2024] are increasingly prevalent. Instead of minimizing a loss over labeled CoT data, RL guides transformers to generate CoT to solve complex reasoning tasks by maximizing a reward function via policy gradient methods [Mnih et al., 2016, Schulman et al., 2017, DeepSeek-AI et al., 2025], which has shown significant potential for improving the reasoning capabilities of LLMs.
Nov-25-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > Canada
- Asia > Middle East
- Genre:
- Research Report (0.65)
- Technology: