f9dc462382fef56d58279e75de2438f3-Paper-Conference.pdf

Mar-27-2025, 15:52:01 GMT–Neural Information Processing Systems

Recent research has shown that Transformers with linear attention are capable of in-context learning (ICL) by implementing a linear estimator through gradient descent steps. However, the existing results on the optimization landscape apply under stylized settings where task and feature vectors are assumed to be IID and the attention weights are fully parameterized. In this work, we develop a stronger characterization of the optimization and generalization landscape of ICL through contributions on architectures, low-rank parameterization, and correlated designs: (1) We study the landscape of 1-layer linear attention and 1-layer H3, a statespace model. Under a suitable correlated design assumption, we prove that both implement 1-step preconditioned gradient descent. We show that thanks to its native convolution filters, H3 also has the advantage of implementing sample weighting and outperforming linear attention in suitable settings.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Mar-27-2025, 15:52:01 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report
  - Experimental Study (0.92)
  - New Finding (0.67)

Industry:
- Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.93)
    - Statistical Learning (1.00)
  - Natural Language > Large Language Model (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found