On the Inductive Bias of Stacking Towards Improving Reasoning
–Neural Information Processing Systems
Given the increasing scale of model sizes, efficient training strategies like gradual stacking [Gong et al., 2019, Reddi et al., 2023] have garnered interest. Stacking enables efficient training by gradually growing the depth of a model in stages and using layers from a smaller model in an earlier stage to initialize the next stage. Although efficient for training, the model biases induced by such growing approaches are largely unexplored. In this work, we examine this fundamental aspect of gradual stacking, going beyond its efficiency benefits.
Neural Information Processing Systems
May-30-2025, 16:06:51 GMT