Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Nguyen, Tam, Nguyen, Tan M., Baraniuk, Richard G.

Dec-1-2023–arXiv.org Artificial Intelligence

Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.

neutreno, proceedings, transformer, (13 more...)

arXiv.org Artificial Intelligence

Dec-1-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Texas > Travis County
      - Austin (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
  - Canada > Quebec
    - Montreal (0.04)
- Europe
  - France (0.04)
  - Austria (0.04)
  - Italy > Tuscany
    - Florence (0.04)
- Asia
  - Singapore (0.04)
  - China > Hong Kong (0.04)
  - Middle East > Qatar
    - Ad-Dawhah > Doha (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)