In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Zhang, Ruiqi, Wu, Jingfeng, Bartlett, Peter L.

Feb-22-2024–arXiv.org Machine Learning

W e study the in-context learning (ICL) ability of a Linear Transformer Block (L TB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a nonzero mean, we show that L TB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between L TB and one-step gradient descent estimators with learnable initialization ( GD- β), in the sense that every GD- β estimator can be implemented by an L TB estimator and every optimal L TB estimator that minimizes the in-class ICL risk is effectively a GD- β estimator. Finally, we show that GD- β estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that L TB achieves ICL by implementing GD- β, and they highlight the role of MLP layers in reducing approximation error.

artificial intelligence, in-context learning, machine learning, (15 more...)

arXiv.org Machine Learning

Feb-22-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.14)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks
    - Deep Learning (0.68)
    - Perceptrons (0.54)
  - Statistical Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found