Transformers learn to implement preconditioned gradient descent for in-context learning

Neural Information Processing Systems 

Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found