Goto

Collaborating Authors

 Gradient Descent


Smooth over-parameterized solvers for non-smooth structured optimization

arXiv.org Machine Learning

Non-smooth optimization is a core ingredient of many imaging or machine learning pipelines. Non-smoothness encodes structural constraints on the solutions, such as sparsity, group sparsity, low-rank and sharp edges. It is also the basis for the definition of robust loss functions and scale-free functionals such as square-root Lasso. Standard approaches to deal with non-smoothness leverage either proximal splitting or coordinate descent. These approaches are effective but usually require parameter tuning, preconditioning or some sort of support pruning. In this work, we advocate and study a different route, which operates a non-convex but smooth over-parametrization of the underlying non-smooth optimization problems. This generalizes quadratic variational forms that are at the heart of the popular Iterative Reweighted Least Squares (IRLS). Our main theoretical contribution connects gradient descent on this reformulation to a mirror descent flow with a varying Hessian metric. This analysis is crucial to derive convergence bounds that are dimension-free. This explains the efficiency of the method when using small grid sizes in imaging. Our main algorithmic contribution is to apply the Variable Projection (VarPro) method which defines a new formulation by explicitly minimizing over part of the variables. This leads to a better conditioning of the minimized functional and improves the convergence of simple but very efficient gradient-based methods, for instance quasi-Newton solvers. We exemplify the use of this new solver for the resolution of regularized regression problems for inverse problems and supervised learning, including total variation prior and non-convex regularizers.


Ridgeless Regression with Random Features

arXiv.org Artificial Intelligence

Recent theoretical studies illustrated that kernel ridgeless regression can guarantee good generalization ability without an explicit regularization. In this paper, we investigate the statistical properties of ridgeless regression with random features and stochastic gradient descent. We explore the effect of factors in the stochastic gradient and random features, respectively. Specifically, random features error exhibits the double-descent curve. Motivated by the theoretical findings, we propose a tunable kernel algorithm that optimizes the spectral density of kernel during training. Our work bridges the interpolation theory and practical algorithm.


Implicit Regularization Properties of Variance Reduced Stochastic Mirror Descent

arXiv.org Artificial Intelligence

In statistics and machine learning, it is common to optimize an objective function that is a finitesum. SMD efficiently optimizes such an objective by using a subset of data to do one step update of the variable/parameter. Further adopting the variance reduction technique to SMD, we get the VRSMD algorithm that enjoys fast convergence [1], [2]. The implicit regularization is a relatively new concept [3] that explains why a result of an algorithm generalizes well in some overparameterized models [3], [4]. It refers to the fact that an algorithm can automatically select a minimum norm solution, which is not explicitly induced by the objective function. There are works on implicit regularization for Gradient Descent [5]- [8], Stochastic Gradient Descent [9]-[12], and Stochastic Mirror Descent [13]. Considering the computational advantage of VRSMD compared to all the algorithms above, it would be even better if VRSMD also has the useful implicit regularization property. From technical point of view, our work contains the following two results: In linear regression (including underfitting and overfitting), we show that the solution sequence of VRSMD converges to the minimum mirror interpolant, which is the implicit regularization property of VRSMD, and we also specify the convergence rate.


The Directional Bias Helps Stochastic Gradient Descent to Generalize in Kernel Regression Models

arXiv.org Artificial Intelligence

The Stochastic Gradient Descent (SGD) is a popular optimization algorithm that has a wide range of applications, including generalized linear model in statistics and deep Neural Network in machine learning. One main advantage of the SGD is the computational scalability due to low cost per iteration. Recent work also indicates that the SGD might also lead to outcomes that possess nice statistical properties under the linear regression framework, see [19]. In this paper, we study the statistical properties of the SGD under nonparametric regression models. We focus on the Reproducing Kernel Hilbert Space (RKHS) model, which is popular in both statistics and machine learning communities and is often simply referred to as the "kernel trick," see [2, 27].


A Novel Fast Exact Subproblem Solver for Stochastic Quasi-Newton Cubic Regularized Optimization

arXiv.org Machine Learning

In this work we describe an Adaptive Regularization using Cubics (ARC) method for large-scale nonconvex unconstrained optimization using Limited-memory Quasi-Newton (LQN) matrices. ARC methods are a relatively new family of optimization strategies that utilize a cubic-regularization (CR) term in place of trust-regions and line-searches. LQN methods offer a large-scale alternative to using explicit second-order information by taking identical inputs to those used by popular first-order methods such as stochastic gradient descent (SGD). Solving the CR subproblem exactly requires Newton's method, yet using properties of the internal structure of LQN matrices, we are able to find exact solutions to the CR subproblem in a matrix-free manner, providing large speedups and scaling into modern size requirements. Additionally, we expand upon previous ARC work and explicitly incorporate first-order updates into our algorithm. We provide experimental results when the SR1 update is used, which show substantial speed-ups and competitive performance compared to Adam and other second order optimizers on deep neural networks (DNNs). We find that our new approach, ARCLQN, compares to modern optimizers with minimal tuning, a common pain-point for second order methods.


A stochastic Stein Variational Newton method

arXiv.org Machine Learning

Stein variational gradient descent (SVGD) is a general-purpose optimization-based sampling algorithm that has recently exploded in popularity, but is limited by two issues: it is known to produce biased samples, and it can be slow to converge on complicated distributions. A recently proposed stochastic variant of SVGD (sSVGD) addresses the first issue, producing unbiased samples by incorporating a special noise into the SVGD dynamics such that asymptotic convergence is guaranteed. Meanwhile, Stein variational Newton (SVN), a Newton-like extension of SVGD, dramatically accelerates the convergence of SVGD by incorporating Hessian information into the dynamics, but also produces biased samples. In this paper we derive, and provide a practical implementation of, a stochastic variant of SVN (sSVN) which is both asymptotically correct and converges rapidly. We demonstrate the effectiveness of our algorithm on a difficult class of test problems -- the Hybrid Rosenbrock density -- and show that sSVN converges using three orders of magnitude fewer gradient evaluations of the log likelihood than its stochastic SVGD counterpart. Our results show that sSVN is a promising approach to accelerating high-precision Bayesian inference tasks with modest-dimension, $d\sim\mathcal{O}(10)$.


#003C Gradient Descent in Python - Master Data Science

#artificialintelligence

We will first import libraries as NumPy, matplotlib, pyplot and derivative function. Then with a NumPy function – linspace() we define our variable \(w \) domain between 1.0 and 5.0 and 100 points. Also we define alpha which will represent learning rate. Next, we will define our \(y \) ( in our case \(J(w) \)) and plot to see a convex function, we will use \((w-3) 2 \). So we can see that we plotted our convex function as an example.


Optimizing differential equations to fit data and predict outcomes

arXiv.org Artificial Intelligence

Many scientific problems focus on observed patterns of change or on how to design a system to achieve particular dynamics. Those problems often require fitting differential equation models to target trajectories. Fitting such models can be difficult because each evaluation of the fit must calculate the distance between the model and target patterns at numerous points along a trajectory. The gradient of the fit with respect to the model parameters can be challenging. Recent technical advances in automatic differentiation through numerical differential equation solvers potentially change the fitting process into a relatively easy problem, opening up new possibilities to study dynamics. However, application of the new tools to real data may fail to achieve a good fit. This article illustrates how to overcome a variety of common challenges, using the classic ecological data for oscillations in hare and lynx populations. Models include simple ordinary differential equations (ODEs) and neural ordinary differential equations (NODEs), which use artificial neural networks to estimate the derivatives of differential equation systems. Comparing the fits obtained with ODEs versus NODEs, representing small and large parameter spaces, and changing the number of variable dimensions provide insight into the geometry of the observed and model trajectories. To analyze the quality of the models for predicting future observations, a Bayesian-inspired preconditioned stochastic gradient Langevin dynamics (pSGLD) calculation of the posterior distribution of predicted model trajectories clarifies the tendency for various models to underfit or overfit the data. Coupling fitted differential equation systems with pSGLD sampling provides a powerful way to study the properties of optimization surfaces, raising an analogy with mutation-selection dynamics on fitness landscapes.


On Acceleration of Gradient-Based Empirical Risk Minimization using Local Polynomial Regression

arXiv.org Machine Learning

We study the acceleration of the Local Polynomial Interpolation-based Gradient Descent method (LPI-GD) recently proposed for the approximate solution of empirical risk minimization problems (ERM). We focus on loss functions that are strongly convex and smooth with condition number $\sigma$. We additionally assume the loss function is $\eta$-H\"older continuous with respect to the data. The oracle complexity of LPI-GD is $\tilde{O}\left(\sigma m^d \log(1/\varepsilon)\right)$ for a desired accuracy $\varepsilon$, where $d$ is the dimension of the parameter space, and $m$ is the cardinality of an approximation grid. The factor $m^d$ can be shown to scale as $O((1/\varepsilon)^{d/2\eta})$. LPI-GD has been shown to have better oracle complexity than gradient descent (GD) and stochastic gradient descent (SGD) for certain parameter regimes. We propose two accelerated methods for the ERM problem based on LPI-GD and show an oracle complexity of $\tilde{O}\left(\sqrt{\sigma} m^d \log(1/\varepsilon)\right)$. Moreover, we provide the first empirical study on local polynomial interpolation-based gradient methods and corroborate that LPI-GD has better performance than GD and SGD in some scenarios, and the proposed methods achieve acceleration.


Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression

arXiv.org Artificial Intelligence

Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.