AITopics | diagonal linear network

Neural networks trained with gradient descent often learn solutions of increasing complexity: the model first captures simple structure, then progressively incorporates finer details [AJB+17, KKN+19, ZSL25]. This incremental learning phenomenon, often visible as plateaus in the training loss separated by rapid transitions, has attracted significant theoretical attention. The most detailed analyses of incremental learning have been carried out for diagonal linear networks, including precise descriptions of transition times and plateau levels [Ber23, PF23]. This level of detail is possible because the training dynamics of these networks reduce to a mirror flow [WGL+20]. Mirror flows themselves feature incremental learning when initialized near the boundary of the domain of the mirror potential. This paper gives a rigorous description of this phenomenon for a broad class of mirror flows, thereby generalizing the previous analyses of diagonal linear networks.

artificial intelligence, machine learning, mirror flow, (14 more...)

arXiv.org Machine Learning

2606.23198

Country: Europe > France (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Add feedback

High-dimensional Limit of SGD for Diagonal Linear Networks

Malaxechebarría, Begoña García, Paquette, Courtney, Fazel, Maryam, Drusvyatskiy, Dmitriy

arXiv.org Machine LearningMay-19-2026

Understanding the behavior of stochastic gradient methods is a central problem in modern machine learning. Recent work has highlighted diagonal linear networks as a simplified yet expressive setting for analyzing the optimization and generalization properties of neural models. In this work, we show that in the high-dimensional regime, stochastic gradient descent on diagonal linear networks is well-approximated by continuous dynamics governed by a stochastic differential equation (SDE), which explicitly decouples the drift from the gradient noise. We further derive a deterministic partial differential equation whose solution propagates the relevant state of the iterates and characterizes the time evolution of a broad class of observable statistics, including the risk, curvature, and other metrics for optimality. Finally, we show that, under a suitable parametrization, the stochastic dynamics are globally well posed and converge exponentially fast to zero risk with high probability, yielding a fully explicit non-asymptotic description of their long-time behavior. Numerical simulations corroborate our theoretical findings.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Machine Learning

2605.17177

Country: North America > United States > New York (0.27)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

17a9ab4190289f0e1504bbb98d1d111a-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 08:29:00 GMT

artificial intelligence, iterate, machine learning, (15 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Inductive biases of multi-task learning and finetuning: multiple regimes of feature reuse

Neural Information Processing SystemsFeb-18-2026, 07:44:09 GMT

Neural networks are often trained on multiple tasks, either simultaneously (multi-task learning, MTL) or sequentially (pretraining and subsequent finetuning, PT+FT). In particular, it is common practice to pretrain neural networks on a large auxiliary task before finetuning on a downstream task with fewer samples. Despite the prevalence of this approach, the inductive biases that arise from learning multiple tasks are poorly characterized. In this work, we address this gap.

artificial intelligence, machine learning, relu network, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

b5b528767aa35f5b1a60fe0aaeca0563-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 16:48:19 GMT

artificial intelligence, linear network, machine learning, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

Implicit Bias of (Stochastic) Gradient Descent for Rank-1 Linear Neural Network

Neural Information Processing SystemsFeb-16-2026, 16:48:15 GMT

Unfortunately, even for standard linear networks in regression setting, a comprehensive characterization of the implicit bias is still an open problem.

artificial intelligence, linear network, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)

Add feedback

S)GD over Diagonal Linear Networks Implicit Bias Large and Edge of Stability

Neural Information Processing SystemsFeb-12-2026, 10:13:19 GMT

Currently, most theoretical works on implicit regularisation have primarily focused on continuous time approximations of (S)GD where the impact of crucial hyperparameters such as the stepsize and the minibatch size are ignored. One such common simplification is to analyse gradient flow, which is a continuous time limit of GD and minibatch SGD with an infinitesimal stepsize. By definition, this analysis does not capture the effect of stepsize or stochasticity.

artificial intelligence, machine learning, stepsize, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

5da6ce80e97671b70c01a2e703b868b3-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 10:13:16 GMT

artificial intelligence, machine learning, stepsize, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

Saddle-to-Saddle Dynamics in Diagonal Linear Networks

Neural Information Processing SystemsFeb-8-2026, 08:35:12 GMT

The main result is informally presented here.

artificial intelligence, iterate, machine learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

(S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability

Neural Information Processing SystemsDec-25-2025, 13:25:36 GMT

In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2$-layer diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp characterisation leads to qualitative insights about the impact of stochasticity and stepsizes on the recovered solution. Specifically, we show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD. These effects are magnified for stepsizes in a tight window just below the divergence threshold, in the ``edge of stability'' regime. Our findings are supported by experimental results.

diagonal linear network, implicit bias, stepsize and edge, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)

Add feedback

Filters

Collaborating Authors

diagonal linear network

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Incremental Learning in Mirror Flows

High-dimensional Limit of SGD for Diagonal Linear Networks

17a9ab4190289f0e1504bbb98d1d111a-Paper-Conference.pdf

Inductive biases of multi-task learning and finetuning: multiple regimes of feature reuse

b5b528767aa35f5b1a60fe0aaeca0563-Supplemental-Conference.pdf

Implicit Bias of (Stochastic) Gradient Descent for Rank-1 Linear Neural Network

S)GD over Diagonal Linear Networks Implicit Bias Large and Edge of Stability

5da6ce80e97671b70c01a2e703b868b3-Paper-Conference.pdf

Saddle-to-Saddle Dynamics in Diagonal Linear Networks

(S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability