AITopics

We present a novel approach to accelerate stochastic gradient descent (SGD) by utilizing curvature information obtained from Hessian-vector products or finite differences of parameters and gradients, similar to the BFGS algorithm. Our approach involves two preconditioners: a matrix-free preconditioner and a low-rank approximation preconditioner. We update both preconditioners online using a criterion that is robust to stochastic gradient noise and does not require line search or damping. To preserve the corresponding symmetry or invariance, our preconditioners are constrained to certain connected Lie groups. The Lie group's equivariance property simplifies the preconditioner fitting process, while its invariance property eliminates the need for damping, which is commonly required in second-order optimizers. As a result, the learning rate for parameter updating and the step size for preconditioner fitting are naturally normalized, and their default values work well in most scenarios. Our proposed approach offers a promising direction for improving the convergence of SGD with low computational overhead. We demonstrate that Preconditioned SGD (PSGD) outperforms SoTA on Vision, NLP, and RL tasks across multiple modern deep-learning architectures. We have provided code for reproducing toy and large scale experiments in this paper.

optimizer, preconditioner, psgd, (14 more...)

2402.04553

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New York > Tompkins County > Ithaca (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)

Han, Yinbin, Razaviyayn, Meisam, Xu, Renyuan

Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization

Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. Our analysis covers both the optimization and the generalization aspects of the learning procedure. In particular, we propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to the standard supervised learning setup, the score-matching problem introduces distinct challenges, including unbounded input, vector-valued output, and an additional time variable, preventing existing techniques from being applied directly. In this paper, we show that with a properly designed neural network architecture, the score function can be accurately approximated by a reproducing kernel Hilbert space induced by neural tangent kernels. Furthermore, by applying an early-stopping rule for gradient descent and leveraging certain coupling arguments between neural network training and kernel regression, we establish the first generalization error (sample complexity) bounds for learning the score function despite the presence of noise in the observations. Our analysis is grounded in a novel parametric form of the neural network and an innovative connection between score matching and regression analysis, facilitating the application of advanced statistical and optimization techniques.

conference paper, neural network, probability, (13 more...)

2401.15604

Country:

North America > United States > California (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Ito, Akira, Yamada, Masanori, Kumagai, Atsutoshi

Analysis of Linear Mode Connectivity via Permutation-Based Weight Matching

Recently, Ainsworth et al. showed that using weight matching (WM) to minimize the $L_2$ distance in a permutation search of model parameters effectively identifies permutations that satisfy linear mode connectivity (LMC), in which the loss along a linear path between two independently trained models with different seeds remains nearly constant. This paper provides a theoretical analysis of LMC using WM, which is crucial for understanding stochastic gradient descent's effectiveness and its application in areas like model merging. We first experimentally and theoretically show that permutations found by WM do not significantly reduce the $L_2$ distance between two models and the occurrence of LMC is not merely due to distance reduction by WM in itself. We then provide theoretical insights showing that permutations can change the directions of the singular vectors, but not the singular values, of the weight matrices in each layer. This finding shows that permutations found by WM mainly align the directions of singular vectors associated with large singular values across models. This alignment brings the singular vectors with large singular values, which determine the model functionality, closer between pre-merged and post-merged models, so that the post-merged model retains functionality similar to the pre-merged models, making it easy to satisfy LMC. Finally, we analyze the difference between WM and straight-through estimator (STE), a dataset-dependent permutation search method, and show that WM outperforms STE, especially when merging three or more models.

permutation, singular value, singular vector, (13 more...)

2402.04051

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Räisä, Ossi, Jälkö, Joonas, Honkela, Antti

Subsampling is not Magic: Why Large Batch Sizes Work for Differentially Private Stochastic Optimisation

We study the effect of the batch size to the total gradient variance in differentially private stochastic gradient descent (DP-SGD), seeking a theoretical explanation for the usefulness of large batch sizes. As DP-SGD is the basis of modern DP deep learning, its properties have been widely studied, and recent works have empirically found large batch sizes to be beneficial. However, theoretical explanations of this benefit are currently heuristic at best. We first observe that the total gradient variance in DP-SGD can be decomposed into subsampling-induced and noise-induced variances. We then prove that in the limit of an infinite number of iterations, the effective noise-induced variance is invariant to the batch size. The remaining subsampling-induced variance decreases with larger batch sizes, so large batches reduce the effective total gradient variance. We confirm numerically that the asymptotic regime is relevant in practical settings when the batch size is not small, and find that outside the asymptotic regime, the total gradient variance decreases even more with large batch sizes. We also find a sufficient condition that implies that large batch sizes similarly reduce effective DP noise variance for one iteration of DP-SGD.

gaussian mechanism, mechanism, variance, (17 more...)

2402.0399

Country: Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions

Hong, Yusu, Lin, Junhong

The Adaptive Momentum Estimation (Adam) algorithm is highly effective in training various deep learning tasks. Despite this, there's limited theoretical understanding for Adam, especially when focusing on its vanilla form in non-convex smooth scenarios with potential unbounded gradients and affine variance noise. In this paper, we study vanilla Adam under these challenging conditions. We introduce a comprehensive noise model which governs affine variance noise, bounded noise and sub-Gaussian noise. We show that Adam can find a stationary point with a $\mathcal{O}(\text{poly}(\log T)/\sqrt{T})$ rate in high probability under this general noise model where $T$ denotes total number iterations, matching the lower rate of stochastic first-order algorithms up to logarithm factors. More importantly, we reveal that Adam is free of tuning step-sizes with any problem-parameters, yielding a better adaptation property than the Stochastic Gradient Descent under the same conditions. We also provide a probabilistic convergence result for Adam under a generalized smooth condition which allows unbounded smoothness parameters and has been illustrated empirically to more accurately capture the smooth property of many practical objective functions.

convergence, gradient, probability, (14 more...)

2402.03982

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(2 more...)

Genre: Research Report (0.63)

Industry: Education > Educational Setting > Online (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

He, Juncai, Liu, Liangchen, Tsai, Yen-Hsi Richard

Data-induced multiscale losses and efficient multirate gradient descent schemes

data-induced multiscale loss, efficient multirate gradient descent scheme, equation, (11 more...)

This paper investigates the impact of multiscale data on machine learning algorithms, particularly in the context of deep learning. A dataset is multiscale if its distribution shows large variations in scale across different directions. This paper reveals multiscale structures in the loss landscape, including its gradients and Hessians inherited from the data. Correspondingly, it introduces a novel gradient descent approach, drawing inspiration from multiscale algorithms used in scientific computing. This approach seeks to transcend empirical learning rate selection, offering a more systematic, data-informed strategy to enhance training efficiency, especially in the later stages.

2402.03021

Country:

North America > United States > Texas > Travis County > Austin (0.14)
Europe > Russia (0.04)
Asia > Russia (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

arXiv.org Machine LearningFeb-5-2024

Training morphological neural networks with gradient descent: some theoretical insights

Blusseau, Samy

Morphological neural networks, or layers, can be a powerful tool to boost the progress in mathematical morphology, either on theoretical aspects such as the representation of complete lattice operators, or in the development of image processing pipelines. However, these architectures turn out to be difficult to train when they count more than a few morphological layers, at least within popular machine learning frameworks which use gradient descent based optimization algorithms. In this paper we investigate the potential and limitations of differentiation based approaches and back-propagation applied to morphological networks, in light of the non-smooth optimization concept of Bouligand derivative. We provide insights and first theoretical guidelines, in particular regarding initialization and learning rates.

morphological layer, morphological network, neural network, (16 more...)

arXiv.org Machine Learning

2403.12975

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > France (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

arXiv.org Artificial IntelligenceFeb-5-2024

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

Lin, Wu, Dangel, Felix, Eschenhagen, Runa, Bae, Juhan, Turner, Richard E., Makhzani, Alireza

Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers. Their diagonal preconditioner is based on the gradient outer product which is incorporated into the parameter update via a square root. While these methods are often motivated as approximate second-order methods, the square root represents a fundamental difference. In this work, we investigate how the behavior of adaptive methods changes when we remove the root, i.e. strengthen their second-order motivation. Surprisingly, we find that such square-root-free adaptive methods close the generalization gap to SGD on convolutional architectures, while maintaining their root-based counterpart's performance on transformers. The second-order perspective also has practical benefits for the development of adaptive methods with non-diagonal preconditioner. In contrast to root-based counterparts like Shampoo, they do not require numerically unstable matrix square roots and therefore work well in low precision, which we demonstrate empirically. This raises important questions regarding the currently overlooked role of adaptivity for the success of adaptive methods.

adaptive method, approximation, square root, (15 more...)

2402.03496

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Indiana > Hamilton County > Fishers (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Zehtabi, Shahryar, Han, Dong-Jun, Parasnis, Rohit, Hosseinalipour, Seyyedali, Brinton, Christopher G.

Decentralized Sporadic Federated Learning: A Unified Methodology with Generalized Convergence Guarantees

arXiv.org Artificial IntelligenceFeb-5-2024

Decentralized Federated Learning (DFL) has received significant recent research attention, capturing settings where both model updates and model aggregations -- the two key FL processes -- are conducted by the clients. In this work, we propose Decentralized Sporadic Federated Learning ($\texttt{DSpodFL}$), a DFL methodology which generalizes the notion of sporadicity in both of these processes, modeling the impact of different forms of heterogeneity that manifest in realistic DFL settings. $\texttt{DSpodFL}$ unifies many of the prominent decentralized optimization methods, e.g., distributed gradient descent (DGD), randomized gossip (RG), and decentralized federated averaging (DFedAvg), under a single modeling framework. We analytically characterize the convergence behavior of $\texttt{DSpodFL}$, showing, among other insights, that we can match a geometric convergence rate to a finite optimality gap under more general assumptions than in existing works. Through experiments, we demonstrate that $\texttt{DSpodFL}$ achieves significantly improved training speeds and robustness to variations in system parameters compared to the state-of-the-art.

assumption 4, decentralized sporadic federated learning, nullnull 1, (13 more...)

2402.03448

Country:

North America > United States > New York > Erie County > Buffalo (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Artificial IntelligenceFeb-5-2024

The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents

Dandi, Yatin, Troiani, Emanuele, Arnaboldi, Luca, Pesce, Luca, Zdeborová, Lenka, Krzakala, Florent

We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.

algorithm, gradient descent, target function, (15 more...)

2402.0322

Country:

North America > United States (0.28)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.25)
Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)