AITopics | gradient explosion

Revitalizing SVD for Global Covariance Pooling: Halley's Method to Overcome Over-Flattening

Neural Information Processing SystemsJun-10-2026, 12:19:28 GMT

Global Covariance Pooling (GCP) has garnered increasing attention in visual recognition tasks, where second-order statistics frequently yield stronger representations than first-order approaches.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.39)

Add feedback

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Neural Information Processing SystemsApr-25-2026, 04:01:45 GMT

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative normalization layers, these properties need to be generalized so that any given layer's success/failure can be accurately predicted. In this work, we take a first step towards this goal by extending known properties of BatchNorm in randomly initialized deep neural networks (DNNs) to several recently proposed normalization layers. Our primary findings follow: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric techniques require explicit remedies; (ii) use of GroupNorm can ensure an informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; and (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.

artificial intelligence, batchnorm, machine learning, (16 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Neural Information Processing SystemsApr-25-2026, 04:01:41 GMT

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative normalization layers, these properties need to be generalized so that any given layer's success/failure can be accurately predicted. In this work, we take a first step towards this goal by extending known properties of BatchNorm in randomly initialized deep neural networks (DNNs) to several recently proposed normalization layers. Our primary findings follow: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric techniques require explicit remedies; (ii) use of GroupNorm can ensure an informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; and (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.

artificial intelligence, batchnorm, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

2578eb9cdf020730f77793e8b58e165a-Supplemental.pdf

Neural Information Processing SystemsFeb-7-2026, 22:14:55 GMT

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning.

artificial intelligence, inproc, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

2578eb9cdf020730f77793e8b58e165a-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 22:14:52 GMT

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning.

artificial intelligence, inproc, machine learning, (16 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Preventing Gradient Explosions in Gated Recurrent Units

Sekitoshi Kanai, Yasuhiro Fujiwara, Sotetsu Iwamura

Neural Information Processing SystemsNov-21-2025, 13:47:49 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)
(2 more...)

Industry:

Media > Music (0.47)
Leisure & Entertainment (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Mean Field Residual Networks: On the Edge of Chaos Greg Y ang

Neural Information Processing SystemsNov-21-2025, 11:26:51 GMT

Work done while at Harvard University 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. These works have focused on vanilla (fully connected) feedforward networks.

artificial intelligence, machine learning, residual network, (19 more...)

Neural Information Processing Systems

Country: North America > United States > California > Los Angeles County > Long Beach (0.24)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

a7453a5f026fb6831d68bdc9cb0edcae-AuthorFeedback.pdf

Neural Information Processing SystemsAug-15-2025, 15:32:54 GMT

batch size, reviewer, weight norm, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.74)

Add feedback

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

Wang, Ya, Zhuo, Zhijian, Zeng, Yutao, Zhou, Xun, Yang, Jian, Li, Xiaoqing

arXiv.org Artificial IntelligenceFeb-25-2025

Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing $\textbf{gradient explosion and dissipation}$. This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.

enabling stable and effective training, gradient explosion, scale-distribution decoupling, (11 more...)

arXiv.org Artificial Intelligence

2502.15499

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)
Overview > Innovation (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Multi-Objective Large Language Model Unlearning

Pan, Zibin, Zhang, Shuwen, Zheng, Yuesheng, Li, Chi, Cheng, Yuheng, Zhao, Junhua

arXiv.org Artificial IntelligenceJan-4-2025

Machine unlearning in the domain of large language models (LLMs) has attracted great attention recently, which aims to effectively eliminate undesirable behaviors from LLMs without full retraining from scratch. In this paper, we explore the Gradient Ascent (GA) approach in LLM unlearning, which is a proactive way to decrease the prediction probability of the model on the target data in order to remove their influence. We analyze two challenges that render the process impractical: gradient explosion and catastrophic forgetting. To address these issues, we propose Multi-Objective Large Language Model Unlearning (MOLLM) algorithm. We first formulate LLM unlearning as a multi-objective optimization problem, in which the cross-entropy loss is modified to the unlearning version to overcome the gradient explosion issue. A common descent update direction is then calculated, which enables the model to forget the target data while preserving the utility of the LLM. Our empirical results verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods in terms of unlearning effect and model utility preservation. The source code is available at https://github.com/zibinpan/MOLLM.

fgt, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2412.20412

Country: Asia > China > Guangdong Province (0.16)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Filters

Collaborating Authors

gradient explosion

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Revitalizing SVD for Global Covariance Pooling: Halley's Method to Overcome Over-Flattening

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

2578eb9cdf020730f77793e8b58e165a-Supplemental.pdf

2578eb9cdf020730f77793e8b58e165a-Paper.pdf

Preventing Gradient Explosions in Gated Recurrent Units

Mean Field Residual Networks: On the Edge of Chaos Greg Y ang

a7453a5f026fb6831d68bdc9cb0edcae-AuthorFeedback.pdf

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

Multi-Objective Large Language Model Unlearning