Wavy Transformer
–Neural Information Processing Systems
Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feedforward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP, CV, and sparse-graph tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.
Neural Information Processing Systems
Jun-22-2026, 06:23:48 GMT