Transformers learn to implement preconditioned gradient descent for in-context learning

Jan-19-2025, 14:59:38 GMT–Neural Information Processing Systems

Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: \emph{Can transformers learn to implement such algorithms by training over random problem instances?} To our knowledge, we make the first theoretical progress on this question via an analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent.

gradient descent, preconditioned gradient descent, transformer, (5 more...)

Neural Information Processing Systems

Jan-19-2025, 14:59:38 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)