When Attention Sink Emerges in Language Models: An Empirical View

Open in new window