Non-asymptotic Convergence of Training Transformers for Next-token Prediction
–Neural Information Processing Systems
NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer.
Neural Information Processing Systems
Oct-10-2025, 09:58:56 GMT
- Country:
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- Genre:
- Research Report > Experimental Study (0.93)
- Technology: