Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression
–Neural Information Processing Systems
Transformers excel at in-context learning (ICL)--learning from demonstrations without parameter updates--but how they do so remains a mystery. Recent work suggests that Transformers may internally run Gradient Descent (GD), a first-order optimization method, to perform ICL. In this paper, we instead demonstrate that Transformers learn to approximate second-order optimization methods for ICL. For in-context linear regression, Transformers share a similar convergence rate as Iterative Newton's Method; both are exponentially faster than GD. Empirically, predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations; thus, Transformers and Newton's method converge at roughly the same rate.
Neural Information Processing Systems
Mar-27-2025, 01:03:07 GMT
- Country:
- Asia (0.67)
- North America > United States
- Minnesota > Hennepin County > Minneapolis (0.14)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Research Report
- Industry:
- Education > Educational Setting (0.45)
- Technology: