AITopics | training instability

Collaborating Authors

training instability

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Kalra, Dayal Singh, Barkeshli, Maissam

arXiv.org Machine LearningMay-21-2026

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.

large language model, machine learning, parameterization, (18 more...)

arXiv.org Machine Learning

2605.21486

Country: North America (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

aac02401755a65904cf977a33136af4a-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 06:57:33 GMT

batch size, sequence length, training instability, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Appendices

Neural Information Processing SystemsFeb-8-2026, 22:29:37 GMT

The reason is discussed in Appendix E. A library

anomaly detection, artificial intelligence, machine learning, (14 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Israel (0.04)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision (0.70)
Information Technology > Sensing and Signal Processing > Image Processing (0.69)

Add feedback

ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction

Neural Information Processing SystemsFeb-8-2026, 22:29:33 GMT

UAD has a wide range of applications, e.g., industrial

artificial intelligence, data mining, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.46)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models

Neural Information Processing SystemsDec-24-2025, 23:14:17 GMT

Recent works have demonstrated great success in pre-training large-scale autoregressive language models (e.g., GPT-3) on massive GPUs. To reduce the wall-clock training time, a common practice is to increase the batch size and learning rate. However, such practice is often brittle and leads to a so-called stability-efficiency dilemma: increasing the batch sizes and learning rates leads to better training efficiency but can also result in training instability, leading to poor generalization accuracy or failed runs. To better understand this phenomenon, we conduct an in-depth analysis on large-scale pre-training experiments replicating the GPT-2 model with public dataset. We find that there is a strong correlation between training instability and extreme values of gradient variance.

sequence length warmup, stability-efficiency dilemma, training instability, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.84)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

CalFAT: Calibrated Federated Adversarial Training with Label Skewness

Neural Information Processing SystemsDec-23-2025, 20:03:49 GMT

Recent studies have shown that, like traditional machine learning, federated learning (FL) is also vulnerable to adversarial attacks.To improve the adversarial robustness of FL, federated adversarial training (FAT) methods have been proposed to apply adversarial training locally before global aggregation. Although these methods demonstrate promising results on independent identically distributed (IID) data, they suffer from training instability on non-IID data with label skewness, resulting in degraded natural accuracy. This tends to hinder the application of FAT in real-world applications where the label distribution across the clients is often skewed. In this paper, we study the problem of FAT under label skewness, and reveal one root cause of the training instability and natural accuracy degradation issues: skewed labels lead to non-identical class probabilities and heterogeneous local models. We then propose a Calibrated FAT (CalFAT) approach to tackle the instability issue by calibrating the logits adaptively to balance the classes. We show both theoretically and empirically that the optimization of CalFAT leads to homogeneous local models across the clients and better convergence points.

calfat, calibrated federated adversarial training, name change, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

Training Instabilities Induce Flatness Bias in Gradient Descent

Wang, Lawrence, Roberts, Stephen J.

arXiv.org Artificial IntelligenceNov-18-2025

Classical analyses of gradient descent (GD) define a stability threshold based on the largest eigenvalue of the loss Hessian, often termed sharpness. When the learning rate lies below this threshold, training is stable and the loss decreases monotonically. Yet, modern deep networks often achieve their best performance beyond this regime. We demonstrate that such instabilities induce an implicit bias in GD, driving parameters toward flatter regions of the loss landscape and thereby improving generalization. The key mechanism is the Rotational Polarity of Eigenvectors (RPE), a geometric phenomenon in which the leading eigenvectors of the Hessian rotate during training instabilities. These rotations, which increase with learning rates, promote exploration and provably lead to flatter minima. This theoretical framework extends to stochastic GD, where instability-driven flattening persists and its empirical effects outweigh minibatch noise. Finally, we show that restoring instabilities in Adam further improves generalization. Together, these results establish and understand the constructive role of training instabilities in deep learning.

artificial intelligence, instability, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2511.12558

Country:

North America > Canada > Ontario (0.27)
Europe > United Kingdom > England (0.27)

Genre:

Research Report > New Finding (1.00)
Workflow (0.92)
Instructional Material > Course Syllabus & Notes (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

A Non-Adversarial Approach to Idempotent Generative Modelling

Al-Jaff, Mohammed, Marchetti, Giovanni Luca, Welle, Michael C, Lundell, Jens, Gustafsson, Mats G., Henter, Gustav Eje, Azizpour, Hossein, Kragic, Danica

arXiv.org Artificial IntelligenceNov-5-2025

Idempotent Generative Networks (IGNs) are deep generative models that also function as local data manifold projectors, mapping arbitrary inputs back onto the manifold. They are trained to act as identity operators on the data and as idempotent operators off the data manifold. However, IGNs suffer from mode collapse, mode dropping, and training instability due to their objectives, which contain adversarial components and can cause the model to cover the data manifold only partially -- an issue shared with generative adversarial networks. We introduce Non-Adversarial Idempotent Generative Networks (NAIGNs) to address these issues. Our loss function combines reconstruction with the non-adversarial generative objective of Implicit Maximum Likelihood Estimation (IMLE). This improves on IGN's ability to restore corrupted data and generate new samples that closely match the data distribution. We moreover demonstrate that NAIGNs implicitly learn the distance field to the data manifold, as well as an energy-based model.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.3233/FAIA251035

2511.02614

Country: North America (0.28)

Genre: