AITopics | weight normalization

15de21c670ae7c3f6f3f1f37029303c9-Paper.pdf

Neural Information Processing SystemsApr-24-2026, 20:33:02 GMT

artificial intelligence, machine learning, pruning, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Communications > Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Tim Salimans, Durk P. Kingma

Neural Information Processing SystemsApr-22-2026, 12:13:32 GMT

By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.

artificial intelligence, machine learning, normalization, (17 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Inherent Weight Normalization in Stochastic Neural Networks

Georgios Detorakis, Sourav Dutta, Abhishek Khanna, Matthew Jerry, Suman Datta, Emre Neftci

Neural Information Processing SystemsFeb-14-2026, 16:00:18 GMT

Neural Information Processing Systems http://nips.cc/

arxiv preprint arxiv, neural network, nsm, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Orange County > Irvine (0.14)
North America > United States > Indiana > St. Joseph County > Notre Dame (0.05)
North America > Canada (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Norm matters: efficient and accurate normalization schemes in deep networks

Elad Hoffer, Ron Banner, Itay Golan, Daniel Soudry

Neural Information Processing SystemsFeb-13-2026, 22:42:23 GMT

Finally, we suggest a modification to weight-normalization, which improves its performanceonlarge-scaletasks. 2

artificial intelligence, deep learning, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

cf38eb1549024cce4b3d2c1bb87a6c27-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 00:13:05 GMT

equation, learning rate, neural architecture, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

1de7d2b90d554be9f0db1c338e80197d-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 17:24:18 GMT

arxiv preprint arxiv, converge, normalization, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Asia > India (0.04)
North America > United States > Pennsylvania (0.04)
(3 more...)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.71)

Add feedback

Implicit Regularization and Convergence for Weight Normalization

Neural Information Processing SystemsDec-23-2025, 20:06:38 GMT

Normalization methods such as batch, weight, instance, and layer normalization are commonly used in modern machine learning. Here, we study the weight normalization (WN) method \cite{salimans2016weight} and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least squares regression and some more general loss functions. WN and rPGD reparametrize the weights with a scale $g$ and a unit vector such that the objective function becomes \emph{non-convex}. We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and \emph{converge linearly} close to the minimum $\ell_2$ norm solution even for initializations far from zero.

implicit regularization and convergence, name change, weight normalization, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models

Fu, Yonggan, Dong, Xin, Diao, Shizhe, Van keirsbilck, Matthijs, Ye, Hanrong, Byeon, Wonmin, Karnati, Yashaswi, Liebenwein, Lucas, Zhang, Hannah, Binder, Nikolaus, Khadkevich, Maksim, Keller, Alexander, Kautz, Jan, Lin, Yingyan Celine, Molchanov, Pavlo

arXiv.org Artificial IntelligenceNov-25-2025

Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.1889

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.67)

Add feedback

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Neural Information Processing SystemsNov-21-2025, 15:33:34 GMT

By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.

accelerate training, simple reparameterization, weight normalization, (6 more...)

Neural Information Processing Systems

Technology: