Attention with Trained Embeddings Provably Selects Important Tokens

Neural Information Processing Systems 

Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding is limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent.