AITopics | linear transformer block

Collaborating Authors

linear transformer block

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Neural Information Processing SystemsMar-18-2026, 22:37:19 GMT

We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}-\beta$), in the sense that every $\mathsf{GD}-\beta$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}-\beta$ estimator.Finally, we show that $\mathsf{GD}-\beta$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective.Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}-\beta$, and they highlight the role of MLP layers in reducing approximation error.

artificial intelligence, machine learning, proceedings, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.60)

Add feedback

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Neural Information Processing SystemsMay-26-2025, 18:31:11 GMT

We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ( \mathsf{GD}-\beta), in the sense that every \mathsf{GD}-\beta estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a \mathsf{GD}-\beta estimator.Finally, we show that \mathsf{GD}-\beta estimators can be efficiently optimized with gradient flow, despite a non-convex training objective.Our results reveal that LTB achieves ICL by implementing \mathsf{GD}-\beta, and they highlight the role of MLP layers in reducing approximation error.

artificial intelligence, linear transformer block, machine learning, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.64)

Add feedback

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Zhang, Ruiqi, Wu, Jingfeng, Bartlett, Peter L.

arXiv.org Machine LearningFeb-22-2024

W e study the in-context learning (ICL) ability of a Linear Transformer Block (L TB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a nonzero mean, we show that L TB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between L TB and one-step gradient descent estimators with learnable initialization ( GD- β), in the sense that every GD- β estimator can be implemented by an L TB estimator and every optimal L TB estimator that minimizes the in-class ICL risk is effectively a GD- β estimator. Finally, we show that GD- β estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that L TB achieves ICL by implementing GD- β, and they highlight the role of MLP layers in reducing approximation error.

icl risk, in-context learning, matrix, (12 more...)

arXiv.org Machine Learning

2402.14951

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Denmark (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)

Add feedback