layernorm
- North America > United States > Massachusetts (0.04)
- Asia > China (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
- North America > United States > Texas > Travis County > Austin (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Arizona (0.04)
- (2 more...)
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Mateo County > Menlo Park (0.04)
- Asia > Middle East > Jordan (0.04)
Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers
Transformers have achieved great success in machine learning applications.Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and stabilizing the training of Transformers.While LayerNorm recenters and rescales input vectors, RMSNorm only rescales the vectors by their RMS value.Despite being more computationally efficient, RMSNorm may compromise the representation ability of Transformers.There is currently no consensus regarding the preferred normalization technique, as some models employ LayerNorm while others utilize RMSNorm, especially in recent large language models.It is challenging to convert Transformers with one normalization to the other type.While there is an ongoing disagreement between the two normalization types,we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers.By removing the inherent redundant mean information in the main branch of Pre-LN Transformers, we can reduce LayerNorm to RMSNorm, achieving higher efficiency.We further propose the Compressed RMSNorm (CRMSNorm) and Pre-CRMSNorm Transformer based on a lossless compression of the zero-mean vectors.We formally establish the equivalence of Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm Transformer variants in both training and inference.It implies that Pre-LN Transformers can be substituted with Pre-(C)RMSNorm counterparts at almost no cost, offering the same arithmetic functionality along with free efficiency improvement.Experiments demonstrate that we can reduce the training and inference time of Pre-LN Transformers by 1% - 10%.
Root Mean Square Layer Normalization
Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g.
L-JacobiNet and S-JacobiNet: An Analysis of Adaptive Generalization, Stabilization, and Spectral Domain Trade-offs in GNNs
Spectral GNNs, like ChebyNet, are limited by heterophily and over-smoothing due to their static, low-pass filter design. This work investigates the "Adaptive Orthogonal Polynomial Filter" (AOPF) class as a solution. We introduce two models operating in the [-1, 1] domain: 1) `L-JacobiNet`, the adaptive generalization of `ChebyNet` with learnable alpha, beta shape parameters, and 2) `S-JacobiNet`, a novel baseline representing a LayerNorm-stabilized static `ChebyNet`. Our analysis, comparing these models against AOPFs in the [0, infty) domain (e.g., `LaguerreNet`), reveals critical, previously unknown trade-offs. We find that the [0, infty) domain is superior for modeling heterophily, while the [-1, 1] domain (Jacobi) provides superior numerical stability at high K (K>20). Most significantly, we discover that `ChebyNet`'s main flaw is stabilization, not its static nature. Our static `S-JacobiNet` (ChebyNet+LayerNorm) outperforms the adaptive `L-JacobiNet` on 4 out of 5 benchmark datasets, identifying `S-JacobiNet` as a powerful, overlooked baseline and suggesting that adaptation in the [-1, 1] domain can lead to overfitting.
- North America > United States > Texas (0.05)
- Asia > Middle East > Republic of Türkiye > Antalya Province > Antalya (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > United States > Michigan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)
- Health & Medicine (0.67)
- Education (0.67)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Puerto Rico > San Juan > San Juan (0.04)
- (5 more...)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Spain (0.04)