Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression

Mar-27-2025, 01:03:07 GMT–Neural Information Processing Systems

Transformers excel at in-context learning (ICL)--learning from demonstrations without parameter updates--but how they do so remains a mystery. Recent work suggests that Transformers may internally run Gradient Descent (GD), a first-order optimization method, to perform ICL. In this paper, we instead demonstrate that Transformers learn to approximate second-order optimization methods for ICL. For in-context linear regression, Transformers share a similar convergence rate as Iterative Newton's Method; both are exponentially faster than GD. Empirically, predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations; thus, Transformers and Newton's method converge at roughly the same rate.

artificial intelligence, machine learning, transformer, (14 more...)

Neural Information Processing Systems

Mar-27-2025, 01:03:07 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.67)
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.92)

Industry:
- Education > Educational Setting (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Statistical Learning
      - Gradient Descent (0.68)
      - Regression (0.71)
  - Representation & Reasoning (1.00)