Non-asymptotic Convergence of Training Transformers for Next-token Prediction

Neural Information Processing Systems 

NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found