Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently

Lyu, Bochen, Jia, Yiyang, Cai, Xiaohao, Zhu, Zhanxing

Nov-25-2025–arXiv.org Machine Learning

Large language models (LLMs), with the transformer architecture being their core building block, are remarkably successful across a wide range of tasks, in particular reasoning. LLMs excel in solving complex reasoning tasks by iteratively generating intermediate steps [Wei et al., 2022]-- an intriguing approach known as Chain-of-Thought (CoT). Fine-tuning has been shown to be a powerful method to enhance efficient CoT generation in LLMs, which in turn improves the multi-step reasoning performance of LLMs significantly [Wei et al., 2022, Zelikman et al., 2022, Lightman et al., 2024]. A widely adopted approach for fine-tuning to generate CoT is supervised fine-tuning (SFT), where the transformers are trained to minimize a loss over pairs of inputs and labeled outputs. While straightforward, SFT is restricted by the demand of a large amount of labeled CoT data. As a result, fine-tuning approaches based on reinforcement learning (RL) [DeepSeek-AI et al., 2025, Ouyang et al., 2022, Bai et al., 2022, Christiano et al., 2023, Kumar et al., 2024] are increasingly prevalent. Instead of minimizing a loss over labeled CoT data, RL guides transformers to generate CoT to solve complex reasoning tasks by maximizing a reward function via policy gradient methods [Mnih et al., 2016, Schulman et al., 2017, DeepSeek-AI et al., 2025], which has shown significant potential for improving the reasoning capabilities of LLMs.

boolean function, critical gradient component, transformer, (15 more...)

arXiv.org Machine Learning

Nov-25-2025

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - Ontario > Toronto (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.65)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found