From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning

Cheng, Sitao, Yin, Xunjian, Zhou, Ruiwen, Li, Yuxuan, Wang, Xinyi, Pan, Liangming, Wang, William Yang, Zhong, Victor

arXiv.org Artificial Intelligence 

Reinforcement Learning (RL) following Supervised Fine-Tuning (SFT) has become the standard paradigm for post-training Large Language Models (LLMs). However, the mechanism by which RL contributes to reasoning capabilities-- whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors--remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge encoded in model parameters) and Contextual Reasoning (depending on novel information provided in the context window). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with out-of-distribution generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy (90%) but collapse on out-of-distribution generalization (18%), indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks. Code and data will be at https://github.com/sitaocheng/from The rapid evolution of Large Language Models (LLMs) has been fundamentally driven by advanced post-training strategies, specifically an initial Supervised Fine-Tuning (SFT) stage followed by a Reinforcement Learning (RL) stage (Achiam et al., 2023; Team et al., 2024; Guo et al., 2025). While SFT is effective at establishing behavioral norms and imparting foundational knowledge, it fundamentally relies on maximum likelihood estimation, which tends to favor the memorization of the training distribution.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found