Goto

Collaborating Authors

 Gradient Descent


Why Gradient Descent Works?

#artificialintelligence

Gradient descent is an iterative optimization algorithm that is used to optimize the weights of a machine learning model (linear regression, neural networks, etc.) by minimizing the cost function of that model. The intuition behind gradient descent is this: Picture the cost function (denoted by f(Θ) where Θ [Θ₁, … Θₙ]) plotted in n dimensions as a bowl. Imagine a randomly placed point on that bowl represented by n coordinates (this is the initial value of your cost function). The minimum of this "function" then will be the bottom of the bowl. The goal is then to reach to the bottom of the bowl (or minimize the cost) by progressively moving downwards on the bowl.


Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent

arXiv.org Machine Learning

We design step-size schemes that make stochastic gradient descent (SGD) adaptive to (i) the noise $\sigma^2$ in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number $\kappa$, we first prove that $T$ iterations of SGD with Nesterov acceleration and exponentially decreasing step-sizes can achieve a near-optimal $\tilde{O}(\exp(-T/\sqrt{\kappa}) + \sigma^2/T)$ convergence rate. Under a relaxed assumption on the noise, with the same step-size scheme and knowledge of the smoothness, we prove that SGD can achieve an $\tilde{O}(\exp(-T/\kappa) + \sigma^2/T)$ rate. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD converges at the desired rate, but only to a neighbourhood of the solution. Next, we use SGD with an offline estimate of the smoothness and prove convergence to the minimizer. However, its convergence is slowed down proportional to the estimation error and we prove a lower-bound justifying this slowdown. Compared to other step-size schemes, we empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.


Analyzing and Improving the Optimization Landscape of Noise-Contrastive Estimation

arXiv.org Machine Learning

Noise-contrastive estimation (NCE) is a statistically consistent method for learning unnormalized probabilistic models. It has been empirically observed that the choice of the noise distribution is crucial for NCE's performance. However, such observations have never been made formal or quantitative. In fact, it is not even clear whether the difficulties arising from a poorly chosen noise distribution are statistical or algorithmic in nature. In this work, we formally pinpoint reasons for NCE's poor performance when an inappropriate noise distribution is used. Namely, we prove these challenges arise due to an ill-behaved (more precisely, flat) loss landscape. To address this, we introduce a variant of NCE called "eNCE" which uses an exponential loss and for which normalized gradient descent addresses the landscape issues provably when the target and noise distributions are in a given exponential family.


FlyingSquid: A Python Framework For Interactive Weak Supervision

#artificialintelligence

In this research article, we will be discussing keypoints about FlyingSquid through the paper'Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods' published in 2020 by Stanford Researchers. Weak supervision is a common method for building machine learning models without relying on ground truth annotations. It generates probabilistic training labels by estimating the accuracy of multiple noisy labeling sources (e.g., heuristics). While it might seem like the easiest way to get started with ML, weak supervised training can be costly and time-consuming in practice. A group of computer science researchers from Stanford University shows that, for a class of latent variable models highly applicable to weak supervision, they could find an explicit closed-form solution obviating the need for iterative solutions like stochastic gradient descent (SGD). The research team used these insights to build the FlyingSquid framework, which is faster than previous weak supervision approaches and requires fewer assumptions.


Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

arXiv.org Machine Learning

In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-{\L}ojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.


Minimum $\ell_{1}$-norm interpolators: Precise asymptotics and multiple descent

arXiv.org Machine Learning

At the core of statistical learning lies the problem of understanding the generalization performance (e.g., out-of-sample errors) of the learning algorithms in use. Conventional wisdom in statistics held that including too many covariates when training statistical models can hurt generalization (despite improving training accuracy), due to the undesired over-fit. This leads to the classical conclusion that: proper regularization -- through either adding certain penalty functions to the loss function or algorithmic self-regularization -- seems to be critical in achieving the desired accuracy (e.g., Friedman et al. (2001); Wei et al. (2019)). However, an evolving line of works in machine learning observes empirical evidence that suggests, to the surprise of many statisticians, over-parameterization is not necessarily harmful. Indeed, many machine learning models (such as random forests or deep neural networks) are trained until the training error vanishes to zero -- meaning that they are able to perfectly interpolate the data -- while still generalizing well (e.g., Zhang et al. (2021); Wyner et al. (2017); Belkin et al. (2019)). As a key observation to explain this phenomenon, many models when trained by gradient type methods (e.g., gradient descent, stochastic gradient descent, AdaBoost) converge to certain minimum norm interpolators, which implicitly favor models with smaller model complexity. These empirical mysteries inspire a recent flurry of activity towards understanding the generalization properties of various interpolators.


Pareto Navigation Gradient Descent: a First-Order Algorithm for Optimization in Pareto Set

arXiv.org Artificial Intelligence

Many modern machine learning applications, such as multi-task learning, require finding optimal model parameters to trade-off multiple objective functions that may conflict with each other. The notion of the Pareto set allows us to focus on the set of (often infinite number of) models that cannot be strictly improved. But it does not provide an actionable procedure for picking one or a few special models to return to practical users. In this paper, we consider \emph{optimization in Pareto set (OPT-in-Pareto)}, the problem of finding Pareto models that optimize an extra reference criterion function within the Pareto set. This function can either encode a specific preference from the users, or represent a generic diversity measure for obtaining a set of diversified Pareto models that are representative of the whole Pareto set. Unfortunately, despite being a highly useful framework, efficient algorithms for OPT-in-Pareto have been largely missing, especially for large-scale, non-convex, and non-linear objectives in deep learning. A naive approach is to apply Riemannian manifold gradient descent on the Pareto set, which yields a high computational cost due to the need for eigen-calculation of Hessian matrices. We propose a first-order algorithm that approximately solves OPT-in-Pareto using only gradient information, with both high practical efficiency and theoretically guaranteed convergence property. Empirically, we demonstrate that our method works efficiently for a variety of challenging multi-task-related problems.


Gradient descent Method in Machine Learning

#artificialintelligence

Many deep learning models pick up objectives using the gradient-descent method. Gradient-descent optimization needs a big number of training samples for a model to converge. That creates it out of shape for few-shot learning. We train our models to learn to achieve a sure objective in generic deep learning models. However, humans train to learn any objective. There are different optimization methods that emphasize learn-to-learn mechanisms.


Nys-Curve: Nystr\"om-Approximated Curvature for Stochastic Optimization

arXiv.org Machine Learning

The quasi-Newton methods generally provide curvature information by approximating the Hessian using the secant equation. However, the secant equation becomes insipid in approximating the Newton step owing to its use of the first-order derivatives. In this study, we propose an approximate Newton step-based stochastic optimization algorithm for large-scale empirical risk minimization of convex functions with linear convergence rates. Specifically, we compute a partial column Hessian of size ($d\times k$) with $k\ll d$ randomly selected variables, then use the \textit{Nystr\"om method} to better approximate the full Hessian matrix. To further reduce the computational complexity per iteration, we directly compute the update step ($\Delta\boldsymbol{w}$) without computing and storing the full Hessian or its inverse. Furthermore, to address large-scale scenarios in which even computing a partial Hessian may require significant time, we used distribution-preserving (DP) sub-sampling to compute a partial Hessian. The DP sub-sampling generates $p$ sub-samples with similar first and second-order distribution statistics and selects a single sub-sample at each epoch in a round-robin manner to compute the partial Hessian. We integrate our approximated Hessian with stochastic gradient descent and stochastic variance-reduced gradients to solve the logistic regression problem. The numerical experiments show that the proposed approach was able to obtain a better approximation of Newton\textquotesingle s method with performance competitive with the state-of-the-art first-order and the stochastic quasi-Newton methods.


Towards Statistical and Computational Complexities of Polyak Step Size Gradient Descent

arXiv.org Machine Learning

We study the statistical and computational complexities of the Polyak step size gradient descent algorithm under generalized smoothness and Lojasiewicz conditions of the population loss function, namely, the limit of the empirical loss function when the sample size goes to infinity, and the stability between the gradients of the empirical and population loss functions, namely, the polynomial growth on the concentration bound between the gradients of sample and population loss functions. We demonstrate that the Polyak step size gradient descent iterates reach a final statistical radius of convergence around the true parameter after logarithmic number of iterations in terms of the sample size. It is computationally cheaper than the polynomial number of iterations on the sample size of the fixed-step size gradient descent algorithm to reach the same final statistical radius when the population loss function is not locally strongly convex. Finally, we illustrate our general theory under three statistical examples: generalized linear model, mixture model, and mixed linear regression model.