Gradient Descent
The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence
Stochastic Gradient Descent (SGD) is among the simplest and most popular methods in optimization. The convergence rate for SGD has been extensively studied and tight analyses have been established for the running average scheme, but the sub-optimality of the final iterate is still not well-understood. shamir2013stochastic gave the best known upper bound for the final iterate of SGD minimizing non-smooth convex functions, which is $O(\log T/\sqrt{T})$ for Lipschitz convex functions and $O(\log T/ T)$ with additional assumption on strongly convexity. The best known lower bounds, however, are worse than the upper bounds by a factor of $\log T$. harvey2019tight gave matching lower bounds but their construction requires dimension $d= T$. It was then asked by koren2020open how to characterize the final-iterate convergence of SGD in the constant dimension setting. In this paper, we answer this question in the more general setting for any $d\leq T$, proving $\Omega(\log d/\sqrt{T})$ and $\Omega(\log d/T)$ lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex and strongly convex functions respectively with standard step size schedules. Our results provide the first general dimension dependent lower bound on the convergence of SGD's final iterate, partially resolving a COLT open question raised by koren2020open. We also present further evidence to show the correct rate in one dimension should be $\Theta(1/\sqrt{T})$, such as a proof of a tight $O(1/\sqrt{T})$ upper bound for one-dimensional special cases in settings more general than koren2020open.
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails
We consider non-convex stochastic optimization using first-order algorithms for which the gradient estimates may have heavy tails. We show that a combination of gradient clipping, momentum, and normalized gradient descent yields convergence to critical points in high-probability with best-known rates for smooth losses when the gradients only have bounded $\mathfrak{p}$th moments for some $\mathfrak{p}\in(1,2]$. We then consider the case of second-order smooth losses, which to our knowledge have not been studied in this setting, and again obtain high-probability bounds for any $\mathfrak{p}$. Moreover, our results hold for arbitrary smooth norms, in contrast to the typical SGD analysis which requires a Hilbert space norm. Further, we show that after a suitable "burn-in" period, the objective value will monotonically decrease for every iteration until a critical point is identified, which provides intuition behind the popular practice of learning rate "warm-up" and also yields a last-iterate guarantee.
Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization
We study the asymmetric low-rank factorization problem: \[\min_{\mathbf{U} \in \mathbb{R}^{m \times d}, \mathbf{V} \in \mathbb{R}^{n \times d}} \frac{1}{2}\|\mathbf{U}\mathbf{V}^\top -\mathbf{\Sigma}\|_F^2\] where $\mathbf{\Sigma}$ is a given matrix of size $m \times n$ and rank $d$. This is a canonical problem that admits two difficulties in optimization: 1) non-convexity and 2) non-smoothness (due to unbalancedness of $\mathbf{U}$ and $\mathbf{V}$). This is also a prototype for more complex problems such as asymmetric matrix sensing and matrix completion. Despite being non-convex and non-smooth, it has been observed empirically that the randomly initialized gradient descent algorithm can solve this problem in polynomial time. Existing theories to explain this phenomenon all require artificial modifications of the algorithm, such as adding noise in each iteration and adding a balancing regularizer to balance the $\mathbf{U}$ and $\mathbf{V}$. This paper presents the first proof that shows randomly initialized gradient descent converges to a global minimum of the asymmetric low-rank factorization problem with a polynomial rate. For the proof, we develop 1) a new symmetrization technique to capture the magnitudes of the symmetry and asymmetry, and 2) a quantitative perturbation analysis to approximate matrix derivatives. We believe both are useful for other related non-convex problems.
Gradient Descent Optimization With AdaMax From Scratch
Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function. A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent, like the Adaptive Movement Estimation (Adam) algorithm, use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values. AdaMax is an extension to the Adam version of gradient descent that generalizes the approach to the infinite norm (max) and may result in a more effective optimization on some problems. In this tutorial, you will discover how to develop gradient descent optimization with AdaMax from scratch.
Tighter Analysis of Alternating Stochastic Gradient Method for Stochastic Nested Problems
Chen, Tianyi, Sun, Yuejiao, Yin, Wotao
Stochastic nested optimization, including stochastic compositional, min-max and bilevel optimization, is gaining popularity in many machine learning applications. While the three problems share the nested structure, existing works often treat them separately, and thus develop problem-specific algorithms and their analyses. Among various exciting developments, simple SGD-type updates (potentially on multiple variables) are still prevalent in solving this class of nested problems, but they are believed to have slower convergence rate compared to that of the non-nested problems. This paper unifies several SGD-type updates for stochastic nested problems into a single SGD approach that we term ALternating Stochastic gradient dEscenT (ALSET) method. By leveraging the hidden smoothness of the problem, this paper presents a tighter analysis of ALSET for stochastic nested problems. Under the new analysis, to achieve an $\epsilon$-stationary point of the nested problem, it requires ${\cal O}(\epsilon^{-2})$ samples. Under certain regularity conditions, applying our results to stochastic compositional, min-max and reinforcement learning problems either improves or matches the best-known sample complexity in the respective cases. Our results explain why simple SGD-type algorithms in stochastic nested problems all work very well in practice without the need for further modifications.
Private Adaptive Gradient Methods for Convex Optimization
Asi, Hilal, Duchi, John, Fallah, Alireza, Javidbakht, Omid, Talwar, Kunal
While the success of stochastic gradient methods for solving empirical risk minimization has motivated their adoption across much of machine learning, increasing privacy risks in data-intensive tasks have made applying them more challenging [DMNS06]: gradients can leak users' data, intermediate models can compromise individuals, and even final trained models may be non-private without substantial care. This motivates a growing line of work developing private variants of stochastic gradient descent (SGD), where algorithms guarantee differential privacy by perturbing individual gradients with random noise [DJW13; ST13b; ACGMMTZ16; DJW18; BFTT19; FKT20]. Yet these noise addition procedures typically fail to reflect the geometry underlying the optimization problem, which in non-private cases is essential: for high-dimensional problems with sparse parameters, mirror descent and its variants [BT03; NJLS09] are essential, while in the large-scale stochastic settings prevalent in deep learning, AdaGrad and other adaptive variants [DHS11] provide stronger theoretical and practical performance. Even more, methods that do not adapt (or do not leverage geometry) can be provably sub-optimal, in that there exist problems where their convergence is much slower than adaptive variants that reflect appropriate geometry [LD19]. To address these challenges, we introduce Pagan (Private AdaGrad with Adaptive Noise), a new differentially private variant of stochastic gradient descent and AdaGrad.
Assessing Generalization of SGD via Disagreement
Jiang, Yiding, Nagarajan, Vaishnavh, Baek, Christina, Kolter, J. Zico
We empirically show that the test error of deep networks can be estimated by simply training the same architecture on the same training set but with a different run of Stochastic Gradient Descent (SGD), and measuring the disagreement rate between the two networks on unlabeled test data. This builds on -- and is a stronger version of -- the observation in Nakkiran & Bansal '20, which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the \emph{well-calibrated} nature of \emph{ensembles} of SGD-trained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration.
Robust Regression Revisited: Acceleration and Improved Estimation Rates
Jambulapati, Arun, Li, Jerry, Schramm, Tselil, Tian, Kevin
We study fast algorithms for statistical regression problems under the strong contamination model, where the goal is to approximately optimize a generalized linear model (GLM) given adversarially corrupted samples. Prior works in this line of research were based on the robust gradient descent framework of Prasad et. al., a first-order method using biased gradient queries, or the Sever framework of Diakonikolas et. al., an iterative outlier-removal method calling a stationary point finder. We present nearly-linear time algorithms for robust regression problems with improved runtime or estimation guarantees compared to the state-of-the-art. For the general case of smooth GLMs (e.g. logistic regression), we show that the robust gradient descent framework of Prasad et. al. can be accelerated, and show our algorithm extends to optimizing the Moreau envelopes of Lipschitz GLMs (e.g. support vector machines), answering several open questions in the literature. For the well-studied case of robust linear regression, we present an alternative approach obtaining improved estimation rates over prior nearly-linear time algorithms. Interestingly, our method starts with an identifiability proof introduced in the context of the sum-of-squares algorithm of Bakshi and Prasad, which achieved optimal error rates while requiring large polynomial runtime and sample complexity. We reinterpret their proof within the Sever framework and obtain a dramatically faster and more sample-efficient algorithm under fewer distributional assumptions.
Let's Develop Artificial Neural Network in 30 lines of code -- II
II Simple yet Complete Guide on how to apply ANN for Regression with K-Fold Validation for accuracy over accuracy OMG! Cheers, Nice to see you again โฆ! Previously we have already learn what is ANN and performed ANN with real life example. If not follow this link. However i will be briefing the definitions of ANN terminologies just in case if i haven't bored you:) I believe you are already aware of how Neural Networks work if notโฆdon't worry,, there are plenty of resource available in web to get started with. However i will too walk you through in brief of what is neuron networks and how it learns?
A 2021 Guide to improving CNNs-Optimizers: Adam vs SGD
This will be my third post on my series A 2021 Guide to improving CNNs. Optimizers can be explained as a mathematical function to modify the weights of the network given the gradients and additional information, depending on the formulation of the optimizer. Optimizers are built upon the idea of gradient descent, the greedy approach of iteratively decreasing the loss function by following the gradient. Such functions can be as simple as subtracting the gradients from the weights, or can also be very complex. Better optimizers are mainly focused on being faster and efficient but are also often known to generalize well(less overfitting) compared to others.