AITopics | sgd

Collaborating Authors

sgd

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Do Deep Networks Forget Initialization? A Forgetting-Time View of Practical Inductive Bias

Das, Mohua, Beneventano, Pierfrancesco, Dey, Shibshankar, McKinkey, Gareth H., Poggio, Tomaso

arXiv.org Machine LearningMay-29-2026

Randomly initialized neural networks induce a prior over functions, but the predictor used in practice is produced only after training. We ask how much of this initial bias survives the training pipeline. To make the question measurable, we introduce initialization memory: the dependence of the validation-selected predictor on the scale of the random initialization. We perform controlled CIFAR-10 experiments on ResNets where initialization memory already sharply separates training regimes. Low-learning-rate SGD can interpolate while still remembering its initialization: on ResNet-9 with batch size $b=128$, test accuracy varies by $26.5$ percentage points across initialization scales despite $\ge99.5\%$ training accuracy. This is not undertraining: extending the same low-learning-rate regime to $5{,}000$ epochs leaves the spread essentially unchanged. In contrast, Adam-family methods largely erase the dependence. SGD can also be made to forget when larger learning rates are paired with explicit $L_2$ norm control. We interpret these findings in terms of the time scale of forgetting: gradient-flow-like dynamics can preserve initialization memory, whereas stochastic finite-step effects, explicit norm decay, and adaptive preconditioning erase it on scales governed by the size of explicit or implicit regularization. The practical inductive bias of a trained network is therefore not the architectural prior alone, but the architectural prior after being filtered by the forgetting dynamics of the training pipeline; and the same regularizers that improve generalization are precisely those that erase memory of initialization.

artificial intelligence, international conference, machine learning, (15 more...)

arXiv.org Machine Learning

2605.29152

Country: North America > United States (0.45)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Ridge Regression from Poisson Resetting: A Renewal Perspective on Spectral Regularization

Jolakoski, Petar

arXiv.org Machine LearningMay-29-2026

We connect stochastic resetting from non-equilibrium statistical physics with ridge regularization in statistical learning. For linear gradient flow, resetting to the origin at rate $r$ produces stationary mean $(X^\top X+rI)^{-1}X^\top y$, exactly the ridge estimator with penalty $λ=r$. This uses the known Laplace-transform relationship between ridge regression and exponential-time averaging of gradient flow, with the exponential time now interpreted as the stationary age associated with Poisson resetting. We then extend this identity to general renewal reset laws: the exponential reset time distribution is the unique renewal law whose stationary mean reproduces scalar ridge in every eigendirection as an exact filter identity for every positive curvature, while non-exponential renewal laws generate alternative spectral filters. At the fluctuation level, we study a separate additive Ornstein-Uhlenbeck extension with constant diffusion, interpreted as a stylized SGD approximation. In this setting, the equality holds only at the level of the mean, since the reset process has a nonzero stationary covariance from accumulated OU noise and reset-timing variance, whereas deterministic ridge is a fixed estimator with the same center. Stylized experiments compare the deterministic renewal-induced filters directly and illustrate when filters induced by non-exponential reset-time laws can differ predictively from ridge. The results for the stationary mean and the induced spectral filters are established for continuous-time gradient flow with isotropic resetting on quadratic objectives; the covariance and risk formulas additionally assume additive noise with state-independent covariance.

artificial intelligence, machine learning, noise, (19 more...)

arXiv.org Machine Learning

2605.30059

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)

Add feedback

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

Maziane, Yassine, Mahran, Ammar, Maranjyan, Artavazd, Richtárik, Peter

arXiv.org Machine LearningMay-21-2026

Communication is a major bottleneck in distributed learning, especially in large-scale settings and in federated learning environments with slow links. Three standard ways to reduce this cost are communication compression, local training, and communication-computation overlap. Methods that combine these ingredients are used in practice and have been found to be effective for large-scale training, but there is little theory for methods that combine all three. We study a heterogeneous-compute setting in which different workers may take different numbers of local steps, and we propose LOSCAR-SGD, a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. To the best of our knowledge, this is the first theory for this combination of ingredients. Experiments further show that communication-computation overlap reduces training time and that the delay-corrected merge outperforms naive overwriting.

artificial intelligence, communication, machine learning, (18 more...)

arXiv.org Machine Learning

2605.20866

Country: North America (0.28)

Genre: Research Report > Experimental Study (0.46)

Industry: Education (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

A Fourier perspective on the learning dynamics of neural networks: from sample complexities to mechanistic insights

Ricci, Fabiola, Merger, Claudia, Goldt, Sebastian

arXiv.org Machine LearningMay-19-2026

Neural networks trained with gradient-based methods exhibit a strong simplicity bias: they learn simpler statistical features of their data before moving to more complex features. Previous analyses of this phenomenon have largely focused on settings with (quasi-)isotropic inputs. In this work, we study the simplicity bias from a Fourier perspective, which allows us to include two key features of natural images in the analysis: approximate translation-invariance and power-law spectra. We first show experimentally that simple neural networks trained on image classification tasks first rely on amplitude information -- related to pair-wise correlations between pixels -- before exploiting phase information, which encodes edges and higher-order correlations. In view of this, we introduce a synthetic data model for translation-invariant inputs that allows precise control over amplitudes and phases while remaining tractable. We rigorously establish that for isotropic and high-dimensional inputs, classification based on phase information alone is a genuinely hard task: online stochastic gradient descent (SGD) cannot distinguish the structured inputs from noise within $n \ll N^3$ steps, but needs at least $n \gg N^3 \log^2{N}$ steps. In contrast, we show both experimentally and theoretically that power-law spectra can dramatically accelerate the speed of learning phase information, even if the spectra do not help with classification. Simulations with two-layer networks trained on textures and with deep convolutional networks on ImageNet and CIFAR100 confirm this non-trivial interaction between amplitudes and phases, providing mechanistic insights into how deep neural networks can learn natural image distributions efficiently.

artificial intelligence, fourier data model, machine learning, (15 more...)

arXiv.org Machine Learning

2605.16913

Country: Europe (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Add feedback

High-dimensional Limit of SGD for Diagonal Linear Networks

Malaxechebarría, Begoña García, Paquette, Courtney, Fazel, Maryam, Drusvyatskiy, Dmitriy

arXiv.org Machine LearningMay-19-2026

Understanding the behavior of stochastic gradient methods is a central problem in modern machine learning. Recent work has highlighted diagonal linear networks as a simplified yet expressive setting for analyzing the optimization and generalization properties of neural models. In this work, we show that in the high-dimensional regime, stochastic gradient descent on diagonal linear networks is well-approximated by continuous dynamics governed by a stochastic differential equation (SDE), which explicitly decouples the drift from the gradient noise. We further derive a deterministic partial differential equation whose solution propagates the relevant state of the iterates and characterizes the time evolution of a broad class of observable statistics, including the risk, curvature, and other metrics for optimality. Finally, we show that, under a suitable parametrization, the stochastic dynamics are globally well posed and converge exponentially fast to zero risk with high probability, yielding a fully explicit non-asymptotic description of their long-time behavior. Numerical simulations corroborate our theoretical findings.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Machine Learning

2605.17177

Country: North America > United States > New York (0.27)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

Optimal Asymptotic Rates for (Stochastic) Gradient Descent under the Local PL-Condition: A Geometric Approach

Kassing, Sebastian, Kruse, Thomas

arXiv.org Machine LearningMay-15-2026

Stochastic gradient descent (SGD) has been studied extensively over the past decades due to its simplicity and broad applicability in machine learning. In this work, we analyze the local behavior of gradient descent and stochastic gradient descent for minimizing $C^2$-functions that satisfy the Polyak-Lojasiewicz (PL) inequality and under a multiplicative gradient noise model motivated by overparameterized neural networks. Using a geometric interpretation of the PL-condition, we prove a simple yet surprising fact: in this possibly non-convex setting, the asymptotic convergence rate of (S)GD matches the rate obtained for strongly convex quadratics.

artificial intelligence, machine learning, projn, (17 more...)

arXiv.org Machine Learning

2605.14663

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Hyperparameter Transfer for Dense Associative Memories

Holtzman, Roi, Krotov, Dmitry, Hanin, Boris

arXiv.org Machine LearningMay-12-2026

Dense Associative Memory (DenseAM) is a promising family of AI architectures that is represented by a neural network performing temporal dynamics on an energy landscape. While hyperparameter transfer methods are well-studied for feed-forward networks, these methods have not been developed for settings in which weights are shared across layers and within the layer, which is common in DenseAMs. Additionally, DenseAMs utilize rapidly peaking activation functions that are rarely used in feed-forward architectures. The confluence of these aspects makes DenseAM a challenging framework for using existing methods for hyperparameter transfer. Our work initiates the development of hyperparameter transfer methods for this class of models. We derive explicit prescriptions for how the hyperparameters tuned on small models can be transferred to models trained at scale. We demonstrate excellent agreement between these theoretical findings and empirical results.

artificial intelligence, denseam, machine learning, (18 more...)

arXiv.org Machine Learning

2605.10164

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

Sahu, Sharan, Sarkar, Abir, Hogan, Cameron J., Wells, Martin T.

arXiv.org Machine LearningMay-7-2026

We provide a theoretical analysis of Adam under non-stationary stochastic objectives, separating two regimes: Euclidean tracking under adaptive strong monotonicity of the Adam-preconditioned mean-gradient operator, and high-probability projected stationarity guarantees under general $L$-smooth objectives. In the tracking regime, we derive finite-time expected and high-probability bounds that decompose sharply into four components: initialization, objective drift, a first-moment tracking error governed by $β_1$, and a preconditioner perturbation governed by $β_2$. We characterize the burn-in time to reach Adam's irreducible tracking floor under constant and step-decay schedules. We also prove a high-probability bound on the average projected stationarity gap for Adam under distribution shift. Across both analyses, our bounds reveal a noise--drift tradeoff: in noise-dominated regimes, first-moment averaging and adaptive preconditioning can improve the high-probability error, whereas in drift-dominated regimes, stale first-moment information and preconditioner perturbations can compound the cost of nonstationarity, allowing vanilla SGD to achieve a smaller tracking floor. Our explicit $(β_1,β_2,ε)$-dependent bounds delineate when adaptive step-sizing is beneficial versus harmful, and provide a theoretical mechanism for Adam's empirical instability and stabilization under distribution shift.

artificial intelligence, log 2, machine learning, (15 more...)

arXiv.org Machine Learning

2605.04269

Country: North America > United States > New York (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models Supplementary material

Neural Information Processing SystemsMay-1-2026, 06:27:07 GMT

The appendix is organized into five sections as follows: 1. Appendix A derives the Volterra equation and proves the main result for the homogenized SGD (Theorem 1). 2. We show in Appendix B a heuristic derivation of the homogenized SGD approximation to the SDA class of algorithms on the least squares problem and we show that SGD and homogenized SGD are close under orthogonal invariance (Theorem 2). 3. We give in Appendix C a general overview of the analysis of a convolution Volterra equation of the type that arises in the SDA class. Unless otherwise stated, all the results hold under Assumptions 1 and 2. We include all statements from the previous sections for clarity. The results presented in this paper concern the analysis of existing methods and a new method that is a variant of an existing method. The results are theoretical and we do not anticipate any direct ethical and societal issues. We believe the results will be used by machine learning practitioners and we encourage them to use it to build a more just, prosperous world. A.1 Homogenized SGD We recall that the diffusion model is given by dXt = 2 dZt 1 To connect these diffusions to SGD on the least squares problem (2.1) f(x)= 1 2 kAx bk2, we will use the singular value decomposition of U VT of A. We order the singular values 1 2 3 in decreasing order. We then let t = VT(Xt ex), where we recall that b = Aex+ . We may do a similar computation with N and conclude that: J(1) = 2 2 2jJ 2 1 '(t) '(s)d s,j In summary, we may express J in terms of N by J(1) = 2 2 2jJ 1 '2(t) N(1) + 22 dh t,jiwith J(0) = EH When (k,n)= k+n and thus '(t)=(1+ t) with (t)= 1+t, the corresponding ODE is precisely bJ(3) The other case is when (k,n)= n, or '(t)=exp( t). We call this the general SDAHB; one recovers SDAHB when 1 =, 2 =0, and = .

artificial intelligence, equation, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.54)

Add feedback

Filters

Collaborating Authors

sgd

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Do Deep Networks Forget Initialization? A Forgetting-Time View of Practical Inductive Bias

Ridge Regression from Poisson Resetting: A Renewal Perspective on Spectral Regularization

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

A Fourier perspective on the learning dynamics of neural networks: from sample complexities to mechanistic insights

High-dimensional Limit of SGD for Diagonal Linear Networks

Optimal Asymptotic Rates for (Stochastic) Gradient Descent under the Local PL-Condition: A Geometric Approach

Hyperparameter Transfer for Dense Associative Memories

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models Supplementary material

4cf0ed8641cfcbbf46784e620a0316fb-Paper.pdf