Attention with Trained Embeddings Provably Selects Important Tokens
–Neural Information Processing Systems
Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., Softmax(p E X)EXv =
Neural Information Processing Systems
Jun-15-2026, 20:17:18 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Research Report
- Industry:
- Government (0.46)
- Technology: