Gradient Descent
Desynchronous Learning in a Physics-Driven Learning Network
Wycoff, Jacob F, Dillavou, Sam, Stern, Menachem, Liu, Andrea J, Durian, Douglas J
Here we demonstrate that desynchronous implementation of coupled learning is effective in self-adjusting resistor networks, in both simulation and experiment. Furthermore, we Learning is a special case of memory [1, 2], where the goal show that desynchronous learning can actually improve performance is to encode targeted functional responses in a network [3-by allowing the system to evolve indefinitely, escaping 6]. Artificial Neural Networks (ANNs) are complex functions local minima. We draw a direct analogy between stochastic designed to achieve such targeted responses. These networks gradient descent and desynchronous learning, and show are trained by using gradient descent on a cost function, they have similar effects on the learning degrees of freedom which evolves the system's parameters until a local minimum in our system. Thus we are able to remove the final vestige of is found [7, 8]. Typically, this algorithm is modified non-locality from our physics-driven learning network, moving such that subsections (batches) of data are used at each training it closer to biological implementations of learning. The step, effectively adding noise to the gradient calculation, ability to learn with entirely independent learning elements is known as Stochastic Gradient Descent (SGD) [9]. This algorithm expected to greatly improve the scalability of such physical produces more generalizable results [10-12], i.e. better learning systems.
#NeurIPS2022 outstanding paper โ Gradient descent: the ultimate optimizer
Kartik Chandra, Audrey Xie, Jonathan Ragan-Kelley and Erik Meijer won a NeurIPS 2022 outstanding paper award for their work Gradient descent: the ultimate optimizer. Here, they tell us more about their work, the methodology and their main findings. Our paper studies the classic problem of "hyperparameter optimization". Nearly all of today's machine learning algorithms use a process called "stochastic gradient descent" (SGD) to train neural networks. SGD requires users to pick certain settings, or "hyperparameters," before running it.
Penalized Langevin and Hamiltonian Monte Carlo Algorithms for Constrained Sampling
Gรผrbรผzbalaban, Mert, Hu, Yuanhan, Zhu, Lingjiong
We consider the constrained sampling problem where the goal is to sample from a distribution $\pi(x)\propto e^{-f(x)}$ and $x$ is constrained on a convex body $\mathcal{C}\subset \mathbb{R}^d$. Motivated by penalty methods from optimization, we propose penalized Langevin Dynamics (PLD) and penalized Hamiltonian Monte Carlo (PHMC) that convert the constrained sampling problem into an unconstrained one by introducing a penalty function for constraint violations. When $f$ is smooth and the gradient is available, we show $\tilde{\mathcal{O}}(d/\varepsilon^{10})$ iteration complexity for PLD to sample the target up to an $\varepsilon$-error where the error is measured in terms of the total variation distance and $\tilde{\mathcal{O}}(\cdot)$ hides some logarithmic factors. For PHMC, we improve this result to $\tilde{\mathcal{O}}(\sqrt{d}/\varepsilon^{7})$ when the Hessian of $f$ is Lipschitz and the boundary of $\mathcal{C}$ is sufficiently smooth. To our knowledge, these are the first convergence rate results for Hamiltonian Monte Carlo methods in the constrained sampling setting that can handle non-convex $f$ and can provide guarantees with the best dimension dependency among existing methods with deterministic gradients. We then consider the setting where unbiased stochastic gradients are available. We propose PSGLD and PSGHMC that can handle stochastic gradients without Metropolis-Hasting correction steps. When $f$ is strongly convex and smooth, we obtain an iteration complexity of $\tilde{\mathcal{O}}(d/\varepsilon^{18})$ and $\tilde{\mathcal{O}}(d\sqrt{d}/\varepsilon^{39})$ respectively in the 2-Wasserstein distance. For the more general case, when $f$ is smooth and non-convex, we also provide finite-time performance bounds and iteration complexity results. Finally, we test our algorithms on Bayesian LASSO regression and Bayesian constrained deep learning problems.
Infinite-width limit of deep linear neural networks
Chizat, Lรฉnaรฏc, Colombo, Maria, Fernรกndez-Real, Xavier, Figalli, Alessio
This paper studies the infinite-width limit of deep linear neural networks initialized with random parameters. We obtain that, when the number of neurons diverges, the training dynamics converge (in a precise sense) to the dynamics obtained from a gradient descent on an infinitely wide deterministic linear neural network. Moreover, even if the weights remain random, we get their precise law along the training dynamics, and prove a quantitative convergence result of the linear predictor in terms of the number of neurons. We finally study the continuous-time limit obtained for infinitely wide linear neural networks and show that the linear predictors of the neural network converge at an exponential rate to the minimal $\ell_2$-norm minimizer of the risk.
Nonconvex Matrix Factorization is Geodesically Convex: Global Landscape Analysis for Fixed-rank Matrix Optimization From a Riemannian Perspective
Luo, Yuetian, Trillos, Nicolas Garcia
We study a general matrix optimization problem with a fixed-rank positive semidefinite (PSD) constraint. We perform the Burer-Monteiro factorization and consider a particular Riemannian quotient geometry in a search space that has a total space equipped with the Euclidean metric. When the original objective f satisfies standard restricted strong convexity and smoothness properties, we characterize the global landscape of the factorized objective under the Riemannian quotient geometry. We show the entire search space can be divided into three regions: (R1) the region near the target parameter of interest, where the factorized objective is geodesically strongly convex and smooth; (R2) the region containing neighborhoods of all strict saddle points; (R3) the remaining regions, where the factorized objective has a large gradient. To our best knowledge, this is the first global landscape analysis of the Burer-Monteiro factorized objective under the Riemannian quotient geometry. Our results provide a fully geometric explanation for the superior performance of vanilla gradient descent under the Burer-Monteiro factorization. When f satisfies a weaker restricted strict convexity property, we show there exists a neighborhood near local minimizers such that the factorized objective is geodesically convex. To prove our results we provide a comprehensive landscape analysis of a matrix factorization problem with a least squares objective, which serves as a critical bridge. Our conclusions are also based on a result of independent interest stating that the geodesic ball centered at Y with a radius 1/3 of the least singular value of Y is a geodesically convex set under the Riemannian quotient geometry, which as a corollary, also implies a quantitative bound of the convexity radius in the Bures-Wasserstein space. The convexity radius obtained is sharp up to constants.
How Non-Convex Optimization works part2(Machine Learning)
Abstract: In this paper, we propose a weak approximation of the reflection coupling (RC) for stochastic differential equations (SDEs), and prove it converges weakly to the desired coupling. In contrast to the RC, the proposed approximate reflection coupling (ARC) need not take the hitting time of processes to the diagonal set into consideration and can be defined as the solution of some SDEs on the whole time interval. Therefore, ARC can work effectively against SDEs with different drift terms. As an application of ARC, an evaluation on the effectiveness of the stochastic gradient descent in a non-convex setting is also described. Abstract: The online optimization problem with non-convex loss functions over a closed convex set, coupled with a set of inequality (possibly non-convex) constraints is a challenging online learning problem.
Disentangling the Mechanisms Behind Implicit Regularization in SGD
Novack, Zachary, Kaur, Simran, Marwah, Tanya, Garg, Saurabh, Lipton, Zachary C.
A number of competing hypotheses have been proposed to explain why small-batch Stochastic Gradient Descent (SGD)leads to improved generalization over the full-batch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the small-to-large batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using micro-batches, i.e. disjoint smaller subsets of each mini-batch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the large-batch regime recovers small-batch SGD generalization, whereas Jacobian-based regularizations fail to do so. This generalization performance is shown to often be correlated with how well the regularized model's gradient norms resemble those of small-batch SGD. We additionally show that this behavior breaks down as the micro-batch size approaches the batch size. Finally, we note that in this line of inquiry, positive experimental findings on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the need to test hypotheses on a wider collection of datasets.
Stochastic Steffensen method
Zhao, Minda, Lai, Zehua, Lim, Lek-Heng
Is it possible for a first-order method, i.e., only first derivatives allowed, to be quadratically convergent? For univariate loss functions, the answer is yes -- the Steffensen method avoids second derivatives and is still quadratically convergent like Newton method. By incorporating an optimal step size we can even push its convergence order beyond quadratic to $1+\sqrt{2} \approx 2.414$. While such high convergence orders are a pointless overkill for a deterministic algorithm, they become rewarding when the algorithm is randomized for problems of massive sizes, as randomization invariably compromises convergence speed. We will introduce two adaptive learning rates inspired by the Steffensen method, intended for use in a stochastic optimization setting and requires no hyperparameter tuning aside from batch size. Extensive experiments show that they compare favorably with several existing first-order methods. When restricted to a quadratic objective, our stochastic Steffensen methods reduce to randomized Kaczmarz method -- note that this is not true for SGD or SLBFGS -- and thus we may also view our methods as a generalization of randomized Kaczmarz to arbitrary objectives.
Statistical Learning and Inverse Problems: A Stochastic Gradient Approach
Fonseca, Yuri R., Saporito, Yuri F.
Inverse Problems (IP) might be described as the search of an unknown parameter (that could be a function) that satisfies a given, known equation. Considering the notation: y = A[f] + noise, where f and y are elements of given Hilbert spaces, we would like to compute (or estimate) f given the data y for some level of noise. Typically, IPs are ill-posed in the sense that the solution does not depend continuously on the data. There are several very important and impressive examples of IPs in our daily lives. Medical imaging has been using IPs for decades and it has shaped the area, as for instance, Computerized Tomography (CT) and Magnetic Resonance Imaging (MRI). For an introductory text, see Vogel (2002). A vast literature of IPs is devoted to deterministic problems where the noise term is also a element of a Hilbert space and commonly assumed small in norm, which is not usually verified in practice.
A Particle-based Sparse Gaussian Process Optimizer
Bajaj, Chandrajit, Vaidya, Omatharv Bharat, Wang, Yi
Task learning in neural networks typically requires finding a globally optimal minimizer to a loss function objective. Conventional designs of swarm based optimization methods apply a fixed update rule, with possibly an adaptive step-size for gradient descent based optimization. While these methods gain huge success in solving different optimization problems, there are some cases where these schemes are either inefficient or suffering from local-minimum. We present a new particle-swarm-based framework utilizing Gaussian Process Regression to learn the underlying dynamical process of descent. The biggest advantage of this approach is greater exploration around the current state before deciding a descent direction. Empirical results show our approach can escape from the local minima compare with the widely-used state-of-the-art optimizers when solving non-convex optimization problems. We also test our approach under high-dimensional parameter space case, namely, image classification task.