In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Zhang, Ruiqi, Wu, Jingfeng, Bartlett, Peter L.

arXiv.org Machine Learning 

W e study the in-context learning (ICL) ability of a Linear Transformer Block (L TB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a nonzero mean, we show that L TB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between L TB and one-step gradient descent estimators with learnable initialization ( GD- β), in the sense that every GD- β estimator can be implemented by an L TB estimator and every optimal L TB estimator that minimizes the in-class ICL risk is effectively a GD- β estimator. Finally, we show that GD- β estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that L TB achieves ICL by implementing GD- β, and they highlight the role of MLP layers in reducing approximation error.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found