In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization
Zhang, Ruiqi, Wu, Jingfeng, Bartlett, Peter L.
W e study the in-context learning (ICL) ability of a Linear Transformer Block (L TB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a nonzero mean, we show that L TB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between L TB and one-step gradient descent estimators with learnable initialization ( GD- β), in the sense that every GD- β estimator can be implemented by an L TB estimator and every optimal L TB estimator that minimizes the in-class ICL risk is effectively a GD- β estimator. Finally, we show that GD- β estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that L TB achieves ICL by implementing GD- β, and they highlight the role of MLP layers in reducing approximation error.
Feb-22-2024
- Country:
- North America > United States > California (0.14)
- Genre:
- Research Report > New Finding (0.34)
- Technology: