Goto

Collaborating Authors

 saddle point


Classical and Quantum Speedups for Non-Convex Optimization via Energy Conserving Descent

Sun, Yihang, Wang, Huaijin, Hayden, Patrick, Blanchet, Jose

arXiv.org Machine Learning

The Energy Conserving Descent (ECD) algorithm was recently proposed (De Luca & Silverstein, 2022) as a global non-convex optimization method. Unlike gradient descent, appropriately configured ECD dynamics escape strict local minima and converge to a global minimum, making it appealing for machine learning optimization. We present the first analytical study of ECD, focusing on the one-dimensional setting for this first installment. We formalize a stochastic ECD dynamics (sECD) with energy-preserving noise, as well as a quantum analog of the ECD Hamiltonian (qECD), providing the foundation for a quantum algorithm through Hamiltonian simulation. For positive double-well objectives, we compute the expected hitting time from a local to the global minimum. We prove that both sECD and qECD yield exponential speedup over respective gradient descent baselines--stochastic gradient descent and its quantization. For objectives with tall barriers, qECD achieves a further speedup over sECD.


Distributionally Robust K-Means Clustering

Malik, Vikrant, Kargin, Taylan, Hassibi, Babak

arXiv.org Machine Learning

In recent years, the widespreadavailability of large-scale, high-dimensionaldatasets has driven significant interest in clustering algorithms that are both computationally efficient and robust to distributional shifts and outliers. The classical clustering method, K-means, can be seen as an application of the Lloyd-Max quantization algorithm, in which the distribution being quantized is the empirical distribution of the points to be clustered. This empirical distribution generally differs from the true underlying distribution, especially when the number of points to be clustered is small. This induces a distributional shift, which can also arise in many real-world settings, such as image segmentation, biological data analysis, and sensor networks, due to noise variations, sensor inaccuracies, or environmental changes. Distributional shifts can severely impact the performance of clustering algorithms, leading to degraded cluster assignments and unreliable downstream analysis. The field of clustering has a rich history. One of the most popular algorithms in this field is theK-means (KM) algorithm, introduced by [1], which computes centroids by iteratively updating the conditional mean of the data in the Voronoi regions induced by the centroids. However, standardK-means is sensitive to initialization and, in general, converges only to a local minimum.


Gradient Descent Can Take Exponential Time to Escape Saddle Points

Neural Information Processing Systems

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points--it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.


Deep Learning without Poor Local Minima

Neural Information Processing Systems

In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. For an expected loss function of a deep nonlinear neural network, we prove the following statements under the independence assumption adopted from recent work: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) the property of saddle points differs for shallow networks (with three layers) and deeper networks (with more than three layers). Moreover, we prove that the same four statements hold for deep linear neural networks with any depth, any widths and no unrealistic assumptions. As a result, we present an instance, for which we can answer to the following question: how difficult to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima and the property of the saddle points). We note that even though we have advanced the theoretical foundations of deep learning, there is still a gap between theory and practice.