Selective Attention: Enhancing Transformer through Principled Context Control

Neural Information Processing Systems 

The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query.