Gradient Descent
Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features
Yashima, Shingo, Nitanda, Atsushi, Suzuki, Taiji
Although kernel methods are widely used in many learning problems, they have poor scalability to large datasets. To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive efficient large-scale learning algorithms. In this study, we consider solving a binary classification problem using random features and stochastic gradient descent. In recent research, an exponential convergence rate of the expected classification error under the strong low-noise condition has been shown. We extend these analyses to a random features setting, analyzing the error induced by the approximation of random features in terms of the distance between the generated hypothesis including population risk minimizers and empirical risk minimizers when using general Lipschitz loss functions, to show that an exponential convergence of the expected classification error is achieved even if random features approximation is applied. Additionally, we demonstrate that the convergence rate does not depend on the number of features and there is a significant computational benefit in using random features in classification problems because of the strong low-noise condition.
Convergence to minima for the continuous version of Backtracking Gradient Descent
The main result of this paper is: {\bf Theorem.} Let $f:\mathbb{R}^k\rightarrow \mathbb{R}$ be a $C^{1}$ function, so that $\nabla f$ is locally Lipschitz continuous. Assume moreover that $f$ is $C^2$ near its generalised saddle points. Fix real numbers $\delta_0>0$ and $0<\alpha <1$. Then there is a smooth function $h:\mathbb{R}^k\rightarrow (0,\delta_0]$ so that the map $H:\mathbb{R}^k\rightarrow \mathbb{R}^k$ defined by $H(x)=x-h(x)\nabla f(x)$ has the following property: (i) For all $x\in \mathbb{R}^k$, we have $f(H(x)))-f(x)\leq -\alpha h(x)||\nabla f(x)||^2$. (ii) For every $x_0\in \mathbb{R}^k$, the sequence $x_{n+1}=H(x_n)$ either satisfies $\lim_{n\rightarrow\infty}||x_{n+1}-x_n||=0$ or $ \lim_{n\rightarrow\infty}||x_n||=\infty$. Each cluster point of $\{x_n\}$ is a critical point of $f$. If moreover $f$ has at most countably many critical points, then $\{x_n\}$ either converges to a critical point of $f$ or $\lim_{n\rightarrow\infty}||x_n||=\infty$. (iii) There is a set $\mathcal{E}_1\subset \mathbb{R}^k$ of Lebesgue measure $0$ so that for all $x_0\in \mathbb{R}^k\backslash \mathcal{E}_1$, the sequence $x_{n+1}=H(x_n)$, {\bf if converges}, cannot converge to a {\bf generalised} saddle point. (iv) There is a set $\mathcal{E}_2\subset \mathbb{R}^k$ of Lebesgue measure $0$ so that for all $x_0\in \mathbb{R}^k\backslash \mathcal{E}_2$, any cluster point of the sequence $x_{n+1}=H(x_n)$ is not a saddle point, and more generally cannot be an isolated generalised saddle point. Some other results are proven.
Intro to optimization in deep learning: Momentum, RMSProp and Adam
In another post, we covered the nuts and bolts of Stochastic Gradient Descent and how to address problems like getting stuck in a local minima or a saddle point. In this post, we take a look at another problem that plagues training of neural networks, pathological curvature. While local minima and saddle points can stall our training, pathological curvature can slow down training to an extent that the machine learning practitioner might think that search has converged to a sub-optimal minma. Let us understand in depth what pathological curvature is. Consider the following loss contour.
bims-arihec 2019-11-10 papers
OBJECTIVE: To evaluate the potential value of the machine learning (ML)-based MRI texture analysis for predicting 1p/19q codeletion status of lower-grade gliomas (LGG), using various state-of-the-art ML algorithms.MATERIALS AND METHODS: For this retrospective study, 107 patients with LGG were included from a public database. Texture features were extracted from conventional T2-weighted and contrast-enhanced T1-weighted MRI images, using LIFEx software. Training and unseen validation splits were created using stratified 10-fold cross-validation technique along with minority over-sampling. Dimension reduction was done using collinearity analysis and feature selection (ReliefF). Classifications were done using adaptive boosting, k-nearest neighbours, naive Bayes, neural network, random forest, stochastic gradient descent, and support vector machine.
Neural Architecture Search Could Tune AI's Algorithmic Heart - InformationWeek
Data science has evolved far beyond science. It now represents the heart and soul of many disruptive business applications. Everywhere you look, enterprise data science practices have become industrialized within 24x7 DevOps workflows. Under that trend, automation has come to practically every process in the machine-learning DevOps pipeline that surrounds AI. Modeling is the next and perhaps ultimate milestone in the move toward end-to-end, data-science pipeline automation.
Nonconvex Low-Rank Symmetric Tensor Completion from Noisy Data
Cai, Changxiao, Li, Gen, Poor, H. Vincent, Chen, Yuxin
We study a noisy symmetric tensor completion problem of broad practical interest, namely, the reconstruction of a low-rank symmetric tensor from highly incomplete and randomly corrupted observations of its entries. While a variety of prior work has been dedicated to this problem, prior algorithms either are computationally too expensive for large-scale applications, or come with sub-optimal statistical guarantees. Focusing on "incoherent" and well-conditioned tensors of a constant CP rank, we propose a two-stage nonconvex algorithm --- (vanilla) gradient descent following a rough initialization --- that achieves the best of both worlds. Specifically, the proposed nonconvex algorithm faithfully completes the tensor and retrieves all individual tensor factors within nearly linear time, while at the same time enjoying near-optimal statistical guarantees (i.e. minimal sample complexity and optimal estimation accuracy). The estimation errors are evenly spread out across all entries, thus achieving optimal $\ell_{\infty}$ statistical accuracy. The insight conveyed through our analysis of nonconvex optimization might have implications for other tensor estimation problems.
Variance Reduced Stochastic Proximal Algorithm for AUC Maximization
Stochastic Gradient Descent has been widely studied with classification accuracy as a performance measure. However, these stochastic algorithms cannot be directly used when non-decomposable pairwise performance measures are used such as Area under the ROC curve (AUC) which is a common performance metric when the classes are imbalanced. There have been several algorithms proposed for optimizing AUC as a performance metric, and one of the recent being a stochastic proximal gradient algorithm (SPAM). But the downside of the stochastic methods is that they suffer from high variance leading to slower convergence. To combat this issue, several variance reduced methods have been proposed with faster convergence guarantees than vanilla stochastic gradient descent. Again, these variance reduced methods are not directly applicable when non-decomposable performance measures are used. In this paper, we develop a Variance Reduced Stochastic Proximal algorithm for AUC Maximization (\textsc{VRSPAM}) and perform a theoretical analysis as well as empirical analysis to show that our algorithm converges faster than SPAM which is the previous state-of-the-art for the AUC maximization problem.
r/MachineLearning - [P] Predict figure skating world championship ranking from season performances (part 2: hybrid models learned by gradient descent)
I previous posted the write-up on the first part of my project (Github repo) to predict how skaters would rank in the figure skating world championship from earlier scores that they earned in the season. The main idea is to separate the skater effect, the intrinsic ability of each skater, from the event effect, the influence of an event on a skater's performance, so that a more accurate ranking could be built. Unfortunately, this model does not have a closed-form solution to learn the parameters as opposed to the earlier models. Therefore, gradient descent was used to learn them, which resulted in this neat little animation that tracks how the model residuals, RMSE, as well as predicted ranking gets better and better as gradient descent runs. I also explore different strategies to reduce model overfit (so that it can predict skater ranking more accurately), using familiar methods such as model penalization and early stopping. Lastly, note that this hybrid model is nothing but factorizing the event-skater score matrix into an event-specific vector and skater-specific vector, which can multiply together to approximate the score matrix.
How implicit regularization of Neural Networks affects the learned function -- Part I
Heiss, Jakob, Teichmann, Josef, Wutte, Hanna
Today, various forms of neural networks are trained to perform approximation tasks in many fields. However, the solutions obtained are not wholly understood. Empirical results suggest that the training favors regularized solutions. These observations motivate us to analyze properties of the solutions found by the gradient descent algorithm frequently employed to perform the training task. As a starting point, we consider one dimensional (shallow) neural networks in which weights are chosen randomly and only the terminal layer is trained. We show, that the resulting solution converges to the smooth spline interpolation of the training data as the number of hidden nodes tends to infinity. This might give valuable insight on the properties of the solutions obtained using gradient descent methods in general settings.
Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates
Negrea, Jeffrey, Haghifam, Mahdi, Dziugaite, Gintare Karolina, Khisti, Ashish, Roy, Daniel M.
Information-Theoretic Generalization Bounds for SGLD via Data-Dependent EstimatesJeffrey Negrea University of Toronto, V ector Institute Mahdi Haghifam University of Toronto, Element AI Gintare Karolina Dziugaite Element AI Ashish Khisti University of Toronto Daniel M. Roy University of Toronto, V ector Institute Abstract In this work, we improve upon the stepwise analysis of noisy iterative learning algorithms initiated by Pensia, Jog, and Loh (2018) and recently extended by Bu, Zou, and V eeravalli (2019). Our main contributions are significantly improved mutual information bounds for Stochastic Gradient Langevin Dynamics via data-dependent estimates. Our approach is based on the variational characterization of mutual information and the use of data-dependent priors that forecast the mini-batch gradient based on a subset of the training samples. Our approach is broadly applicable within the information-theoretic framework of Russo and Zou (2015) and Xu and Raginsky (2017). Our bound can be tied to a measure of flatness of the empirical risk surface. As compared with other bounds that depend on the squared norms of gradients, empirical investigations show that the terms in our bounds are orders of magnitude smaller. 1 Introduction Stochastic subgradient methods, especially stochastic gradient descent (SGD), are at the core of recent advances in deep-learning practice. Despite some progress, developing a precise understanding of generalization error for that class of algorithms remains wide open. Concurrently, there has been steady progress for noisy variants of SGD, such as stochastic gradient Langevin dynamics (SGLD) [13, 25, 33] and its full-batch counterpart, the Langevin algorithm [13]. The introduction of Gaussian noise to the iterates of SGD expands the set of theoretical frameworks that can be brought to bear on the study of generalization. In pioneering work, Raginsky, Rakhlin, and Telgarsky [25] exploit the fact that SGLD approximates Langevin diffusion, a continuous time Markov process, in the small step size limit. One drawback of this and related analyses involving Markov processes is the reliance on mixing. We hypothesize that SGLD is not mixing in practice, so results based upon mixing may not be representative of empirical performance. In recent work, Pensia, Jog, and Loh [23] perform a stepwise analysis of a family of noisy iterative algorithms that includes SGLD and the Langevin algorithm. At the foundation of this work is the framework of Russo and Zou [28] and Xu and Raginsky [34], where mean generalization error is controlled in terms of the mutual information between the dataset and the learned parameters.