On Separate Normalization in Self-supervised Transformers Yinkai Wang Department of Computer Science Department of Computer Science Tufts University

Mar-27-2025, 11:06:27 GMT–Neural Information Processing Systems

Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the class token [CLS] and the tokens. We propose in this paper a new yet simple normalization method that separately normalizes embedding vectors respectively corresponding to normal tokens and the [CLS] token, in order to better capture their distinct characteristics and enhance downstream task performance. Our empirical study shows that the [CLS] embeddings learned with our separate normalization layer better encode the global contextual information and are distributed more uniformly in its anisotropic space. When the conventional normalization layer is replaced with a separate normalization layer, we observe an average 2.7% performance improvement in learning tasks from the image, natural language, and graph domains.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Mar-27-2025, 11:06:27 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)
    - Natural Language (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)