AITopics | normalizer

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Neural Information Processing SystemsApr-25-2026, 04:01:45 GMT

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative normalization layers, these properties need to be generalized so that any given layer's success/failure can be accurately predicted. In this work, we take a first step towards this goal by extending known properties of BatchNorm in randomly initialized deep neural networks (DNNs) to several recently proposed normalization layers. Our primary findings follow: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric techniques require explicit remedies; (ii) use of GroupNorm can ensure an informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; and (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.

artificial intelligence, batchnorm, machine learning, (16 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Neural Information Processing SystemsApr-25-2026, 04:01:41 GMT

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative normalization layers, these properties need to be generalized so that any given layer's success/failure can be accurately predicted. In this work, we take a first step towards this goal by extending known properties of BatchNorm in randomly initialized deep neural networks (DNNs) to several recently proposed normalization layers. Our primary findings follow: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric techniques require explicit remedies; (ii) use of GroupNorm can ensure an informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; and (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.

artificial intelligence, batchnorm, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Online Normalization for Training Neural Networks

Vitaliy Chiley, Ilya Sharapov, Atli Kosson, Urs Koster, Ryan Reece, Sofia Samaniego de la Fuente, Vishal Subbiah, Michael James

Neural Information Processing SystemsFeb-14-2026, 03:07:06 GMT

Neural Information Processing Systems http://nips.cc/

gradient, latexit sha1, normalization, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > Richmond County > New York City (0.04)
North America > United States > New York > Queens County > New York City (0.04)
(8 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

d33174c464c877fb03e77efdab4ae804-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 13:00:46 GMT

arxiv preprint arxiv, gradient, optimization, (16 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > United Kingdom > England > Bristol (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)

Add feedback

2578eb9cdf020730f77793e8b58e165a-Supplemental.pdf

Neural Information Processing SystemsFeb-7-2026, 22:14:55 GMT

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning.

artificial intelligence, inproc, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

2578eb9cdf020730f77793e8b58e165a-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 22:14:52 GMT

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning.

artificial intelligence, inproc, machine learning, (16 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

20d749bc05f47d2bd3026ce457dcfd8e-Supplemental.pdf

Neural Information Processing SystemsFeb-7-2026, 18:45:42 GMT

functional fisher information, regularization term, representation, (11 more...)

Neural Information Processing Systems

Country: North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

Neural Information Processing SystemsDec-24-2025, 16:40:34 GMT

We formulate the problem of neural network optimization as Bayesian filtering, where the observations are backpropagated gradients. While neural network optimization has previously been studied using natural gradient methods which are closely related to Bayesian inference, they were unable to recover standard optimizers such as Adam and RMSprop with a root-mean-square gradient normalizer, instead getting a mean-square normalizer. To recover the root-mean-square normalizer, we find it necessary to account for the temporal dynamics of all the other parameters as they are optimized. The resulting optimizer, AdaBayes, adaptively transitions between SGD-like and Adam-like behaviour, automatically recovers AdamW, a state of the art variant of Adam with decoupled weight decay, and has generalisation performance competitive with SGD.

bayesian, name change, non-adaptive neural network optimization method, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.57)

Add feedback

The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks

Than, Khoat

arXiv.org Machine LearningNov-4-2025

Normalization methods are fundamental components of modern deep neural networks (DNNs). Empirically, they are known to stabilize optimization dynamics and improve generalization. However, the underlying theoretical mechanism by which normalization contributes to both optimization and generalization remains largely unexplained, especially when using many normalization layers in a DNN architecture. In this work, we develop a theoretical framework that elucidates the role of normalization through the lens of capacity control. We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs, implying excessive functional capacity and potential overfitting. Such bad DNNs are uncountably many. In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization operations. This exponential reduction yields two fundamental consequences: (1) it smooths the loss landscape at an exponential rate, facilitating faster and more stable optimization; and (2) it constrains the effective capacity of the network, thereby enhancing generalization guarantees on unseen data. Our results thus offer a principled explanation for the empirical success of normalization methods in deep learning.

artificial intelligence, lipschitz constant, machine learning, (17 more...)

arXiv.org Machine Learning

2511.00958

Country: Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models

Mudi, Prasenjit K, Sachan, Anshi, Devapriya, Dahlia, Kalyani, Sheetal

arXiv.org Artificial IntelligenceOct-15-2025

ABSTRACT Whisper models have achieved remarkable progress in speech recognition; yet their large size remains a bottleneck for deployment on resource-constrained edge devices. This paper proposes a framework to design fine-tuned variants of Whisper which address the above problem. Structured sparsity is enforced via the Sparse Group LASSO penalty as a loss regu-larizer, to reduce the number of FLOating Point operations (FLOPs). Further, a weight statistics aware pruning algorithm is proposed. On Common V oice 11.0 Hindi dataset, we obtain, without degrading WER, (a) 35.4% reduction in model parameters, 14.25% lower memory consumption and 18.5% fewer FLOPs on Whisper-small, and (b) 31% reduction in model parameters, 15.29% lower memory consumption and 16.95% fewer FLOPs on Whisper-medium; and, (c) substantially outperform the state-of-the-art Iterative Magnitude Pruning based method by pruning 18.7% more parameters along with a 12.31 reduction in WER.

artificial intelligence, machine learning, pruning, (19 more...)

arXiv.org Artificial Intelligence

2510.12666

Country:

Asia (0.28)
North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.70)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.49)

Add feedback

Filters

Collaborating Authors

normalizer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Online Normalization for Training Neural Networks

d33174c464c877fb03e77efdab4ae804-Paper.pdf

2578eb9cdf020730f77793e8b58e165a-Supplemental.pdf

2578eb9cdf020730f77793e8b58e165a-Paper.pdf

20d749bc05f47d2bd3026ce457dcfd8e-Supplemental.pdf

Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks

Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models