interpolation-like condition
Escaping Saddle-Point Faster under Interpolation-like Conditions
In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an $\epsilon$-local-minimizer, matches the corresponding deterministic rate of $O(1/\epsilon^{2})$. We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an $\epsilon$-local-minimizer under interpolation-like conditions, is $O(1/\epsilon^{2.5})$. While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of $O(1/\epsilon^{1.5})$
Review for NeurIPS paper: Escaping Saddle-Point Faster under Interpolation-like Conditions
Weaknesses: - The importance of the SGC condition remains unclear. In Line 129, the authors claimed that SGC condition is satisfied in some practical settings such as the training of deep neural networks, therefore the SGC condition should be regarded as an interesting special setting for nonconvex optimization. However, recent work [1,2] showed that the training of deep neural networks can be further regarded as a special task of convex optimization in the Neural tangent kernel (NTK) regime, which is a stronger condition than SGC. Therefore, the authors may want to clarify the importance of SGC by showing some more examples in machine learning. As the authors suggested, [VBS18] firstly studied the SGC condition under nonconvex setting and proposed that SGD costs O(1/\epsilon 2) gradient complexity to find first-order stationary points. Meanwhile, note that [AZL18] proposed a generic framework which could turn any algorithms for finding first-order stationary points into algorithms for finding approximate local minimizer, without hurting the convergence rate.
Escaping Saddle-Point Faster under Interpolation-like Conditions
In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an \epsilon -local-minimizer, matches the corresponding deterministic rate of O(1/\epsilon {2}) . We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an \epsilon -local-minimizer under interpolation-like conditions, is O(1/\epsilon {2.5}) . While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of O(1/\epsilon {1.5}) corresponding to deterministic Cubic-Regularized Newton method.
Improved Complexities for Stochastic Conditional Gradient Methods under Interpolation-like Conditions
Xiao, Tesi, Balasubramanian, Krishnakumar, Ghadimi, Saeed
In a machine learning setup, the function F could be interpreted as the loss function associated with a sample ξ and the function f could represent the risk, which is defined as the expected loss. Such constrained stochastic optimization problems arise frequently in statistical machine learning applications. Conditional gradient algorithm, also called as Frank-Wolfe algorithm, is an efficient method for solving constrained optimization problems of the form in (1) due to their projection-free nature [Jag13, HJN15, FGM17, LPZZ17, BZK18, RDLS18]. In each step of the conditional gradient method, it is only required to minimize a linear objective over the set Ω. This operation could be implemented efficiently for a variety of sets arising in statistical machine learning, compared to the operation of projecting on to the set Ω, which is required for example by the projected gradient method. Hence, conditional gradient method has regained popularity in the last decade in the optimization and machine learning community.
- North America > United States > California > Yolo County > Davis (0.04)
- Asia > Middle East > Jordan (0.04)