Gradient Descent
AI Nanodegree Program Syllabus: Term 2 (Deep Learning), In Depth
Here at Udacity, we are tremendously excited to announce the kick-off of the second term of our Artificial Intelligence Nanodegree program. Because we are able to provide a depth of education that is commensurate with university education; because we are bridging the gap between universities and industry by providing you with hands-on projects and partnering with the top industries in the field; and last but certainly not least, because we are able to bring this education to many more people across the globe, at a cost that makes a top-notch AI education realistic for all aspiring learners. During the first term, you've enjoyed learning about Game Playing Agents, Simulated Annealing, Constraint Satisfaction, Logic and Planning, and Probabilistic AI from some of the biggest names in the field: Sebastian Thrun, Peter Norvig, and Thad Starner. Term 2 will be focused on one of the cutting-edge advancements of AI -- Deep Learning. In this Term, you will learn about the foundations of neural networks, understand how to train these neural networks with techniques such as gradient descent and backpropagation, and learn different types of architectures that make neural networks work for a variety of different applications.
Performance Limits of Stochastic Sub-Gradient Learning, Part I: Single Agent Case
The minimization of non-differentiable convex cost functions is a critical step in the solution of many design problems [3]-[5], including the design of sparse-aware (LASSO) solutions [6], [7], support-vector machine (SVM) learners [8]-[12], or total-variation-based image denoising solutions [13], [14]. Several powerful techniques have been proposed in the literature to deal with the non-differentiability aspect of the problem formulation, including methods that employ sub-gradient iterations [3]-[5], cutting-plane techniques [15], or proximal iterations [16], [17]. This work focuses on the class of sub-gradient methods for the reasons explained in the sequel. The sub-gradient technique is closely related to the traditional gradient-descent method [3], [4] where the actual gradient is replaced by a sub-gradient at points of nondifferentiability. It is one of the simplest methods in current practice but is known to suffer from slow convergence. For instance, it is shown in [5] that, for convex cost functions, the optimal convergence rate that can be delivered by sub-gradient methods in deterministic optimization problems cannot be faster than O(1/ i) under worst case conditions, where i is the iteration index.
Hard Mixtures of Experts for Large Scale Weakly Supervised Vision
Gross, Sam, Ranzato, Marc'Aurelio, Szlam, Arthur
Training convolutional networks (CNN's) that fit on a single GPU with minibatch stochastic gradient descent has become effective in practice. However, there is still no effective method for training large CNN's that do not fit in the memory of a few GPU cards, or for parallelizing CNN training. In this work we show that a simple hard mixture of experts model can be efficiently trained to good effect on large scale hashtag (multilabel) prediction tasks. Mixture of experts models are not new [7, 3], but in the past, researchers have had to devise sophisticated methods to deal with data fragmentation. We show empirically that modern weakly supervised data sets are large enough to support naive partitioning schemes where each data point is assigned to a single expert. Because the experts are independent, training them in parallel is easy, and evaluation is cheap for the size of the model. Furthermore, we show that we can use a single decoding layer for all the experts, allowing a unified feature embedding space. We demonstrate that it is feasible (and in fact relatively painless) to train far larger models than could be practically trained with standard CNN architectures, and that the extra capacity can be well used on current datasets.
Larger is Better: The Effect of Learning Rates Enjoyed by Stochastic Optimization with Progressive Variance Reduction
In this paper, we propose a simple variant of the original stochastic variance reduction gradient (SVRG), where hereafter we refer to as the variance reduced stochastic gradient descent (VR-SGD). Different from the choices of the snapshot point and starting point in SVRG and its proximal variant, Prox-SVRG, the two vectors of each epoch in VR-SGD are set to the average and last iterate of the previous epoch, respectively. This setting allows us to use much larger learning rates or step sizes than SVRG, e.g., 3/(7L) for VR-SGD vs 1/(10L) for SVRG, and also makes our convergence analysis more challenging. In fact, a larger learning rate enjoyed by VR-SGD means that the variance of its stochastic gradient estimator asymptotically approaches zero more rapidly. Unlike common stochastic methods such as SVRG and proximal stochastic methods such as Prox-SVRG, we design two different update rules for smooth and non-smooth objective functions, respectively. In other words, VR-SGD can tackle non-smooth and/or non-strongly convex problems directly without using any reduction techniques such as quadratic regularizers. Moreover, we analyze the convergence properties of VR-SGD for strongly convex problems, which show that VR-SGD attains a linear convergence rate. We also provide the convergence guarantees of VR-SGD for non-strongly convex problems. Experimental results show that the performance of VR-SGD is significantly better than its counterparts, SVRG and Prox-SVRG, and it is also much better than the best known stochastic method, Katyusha.
On the Gap Between Strict-Saddles and True Convexity: An Omega(log d) Lower Bound for Eigenvector Approximation
Simchowitz, Max, Alaoui, Ahmed El, Recht, Benjamin
We prove a \emph{query complexity} lower bound on rank-one principal component analysis (PCA). We consider an oracle model where, given a symmetric matrix $M \in \mathbb{R}^{d \times d}$, an algorithm is allowed to make $T$ \emph{exact} queries of the form $w^{(i)} = Mv^{(i)}$ for $i \in \{1,\dots,T\}$, where $v^{(i)}$ is drawn from a distribution which depends arbitrarily on the past queries and measurements $\{v^{(j)},w^{(j)}\}_{1 \le j \le i-1}$. We show that for a small constant $\epsilon$, any adaptive, randomized algorithm which can find a unit vector $\widehat{v}$ for which $\widehat{v}^{\top}M\widehat{v} \ge (1-\epsilon)\|M\|$, with even small probability, must make $T = \Omega(\log d)$ queries. In addition to settling a widely-held folk conjecture, this bound demonstrates a fundamental gap between convex optimization and "strict-saddle" non-convex optimization of which PCA is a canonical example: in the former, first-order methods can have dimension-free iteration complexity, whereas in PCA, the iteration complexity of gradient-based methods must necessarily grow with the dimension. Our argument proceeds via a reduction to estimating the rank-one spike in a deformed Wigner model. We establish lower bounds for this model by developing a "truncated" analogue of the $\chi^2$ Bayes-risk lower bound of Chen et al.
Gaussian variational approximation with sparse precision matrices
Tan, Linda S. L., Nott, David J.
The stochastic gradients constructed in this manner are "doubly stochastic" as they are built upon two sources of stochasticity that comes from sampling from the variational distribution and the full data set. This approach is very general in that it can be applied to any model where the joint density is differentiable. Unlike variational Bayes, it does not assume independence relationships among blocks of an appropriate partition of θ. Such independence assumptions have been shown to result in underestimation of the posterior variance (Wang and Titterington, 2005; Bishop, 2006). The quality of the resulting approximation is thus limited only by how well the form of q(θ) matches the true posterior. Using this approach, Kucukelbir et al. (2016) develop an automatic differentiation variational inference (ADVI) algorithm in Stan, where q(θ) is assumed to be either a diagonal (meanfield) or unrestricted Gaussian variational approximation. Constrained variables are transformed to the real line via Stan's library of transformations and the gradients are computed using Monte Carlo integration. They note that while unrestricted ADVI is able to capture posterior correlations and hence produces more accurate marginal variance estimates than mean field ADVI, it can be prohibitively slow for large data since the number of variational parameters scales as the square of the length of θ. In this article, we consider variational approximations which take the form of a multivariate Gaussian distribution N(µ, Σ) for models with high-dimensional parameters (µ denotes the mean and Σ the covariance matrix).
Riemannian stochastic variance reduced gradient
Sato, Hiroyuki, Kasai, Hiroyuki, Mishra, Bamdev
Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large but finite number of loss functions. In this paper, we propose a novel Riemannian extension of the Euclidean stochastic variance reduced gradient algorithm (R-SVRG) to a manifold search space. The key challenges of averaging, adding, and subtracting multiple gradients are addressed with retraction and vector transport. We present a global convergence analysis of the proposed algorithm with a decay step size and a local convergence rate analysis under a fixed step size under some natural assumptions. The proposed algorithm is applied to problems on the Grassmann manifold, such as principal component analysis, low-rank matrix completion, and computation of the Karcher mean of subspaces, and outperforms the standard Riemannian stochastic gradient descent algorithm in each case.
Riemannian stochastic variance reduced gradient on Grassmann manifold
Kasai, Hiroyuki, Sato, Hiroyuki, Mishra, Bamdev
Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large, but finite, number of loss functions. In this paper, we propose a novel Riemannian extension of the Euclidean stochastic variance reduced gradient algorithm (R-SVRG) to a compact manifold search space. To this end, we show the developments on the Grassmann manifold. The key challenges of averaging, addition, and subtraction of multiple gradients are addressed with notions like logarithm mapping and parallel translation of vectors on the Grassmann manifold. We present a global convergence analysis of the proposed algorithm with decay step-sizes and a local convergence rate analysis under fixed step-size with some natural assumptions. The proposed algorithm is applied on a number of problems on the Grassmann manifold like principal components analysis, low-rank matrix completion, and the Karcher mean computation. In all these cases, the proposed algorithm outperforms the standard Riemannian stochastic gradient descent algorithm.
Optimal algorithms for smooth and strongly convex distributed optimization in networks
Scaman, Kevin, Bach, Francis, Bubeck, Sébastien, Lee, Yin Tat, Massoulié, Laurent
In this paper, we determine the optimal convergence rates for strongly convex and smooth distributed optimization in two settings: centralized and decentralized communications over a network. For centralized (i.e. master/slave) algorithms, we show that distributing Nesterov's accelerated gradient descent is optimal and achieves a precision $\varepsilon > 0$ in time $O(\sqrt{\kappa_g}(1+\Delta\tau)\ln(1/\varepsilon))$, where $\kappa_g$ is the condition number of the (global) function to optimize, $\Delta$ is the diameter of the network, and $\tau$ (resp. $1$) is the time needed to communicate values between two neighbors (resp. perform local computations). For decentralized algorithms based on gossip, we provide the first optimal algorithm, called the multi-step dual accelerated (MSDA) method, that achieves a precision $\varepsilon > 0$ in time $O(\sqrt{\kappa_l}(1+\frac{\tau}{\sqrt{\gamma}})\ln(1/\varepsilon))$, where $\kappa_l$ is the condition number of the local functions and $\gamma$ is the (normalized) eigengap of the gossip matrix used for communication between nodes. We then verify the efficiency of MSDA against state-of-the-art methods for two problems: least-squares regression and classification by logistic regression.