Gradient Descent
A stochastic behavior analysis of stochastic restricted-gradient descent algorithm in reproducing kernel Hilbert spaces
Takizawa, Masa-aki, Yukawa, Masahiro, Richard, Cedric
This paper presents a stochastic behavior analysis of a kernel-based stochastic restricted-gradient descent method. The restricted gradient gives a steepest ascent direction within the so-called dictionary subspace. The analysis provides the transient and steady state performance in the mean squared error criterion. It also includes stability conditions in the mean and mean-square sense. The present study is based on the analysis of the kernel normalized least mean square (KNLMS) algorithm initially proposed by Chen et al. Simulation results validate the analysis.
Autoencoder Trees
ฤฐrsoy, Ozan, Alpaydฤฑn, Ethem
We discuss an autoencoder model in which the encoding and decoding functions are implemented by decision trees. We use the soft decision tree where internal nodes realize soft multivariate splits given by a gating function and the overall output is the average of all leaves weighted by the gating values on their path. The encoder tree takes the input and generates a lower dimensional representation in the leaves and the decoder tree takes this and reconstructs the original input. Exploiting the continuity of the trees, autoencoder trees are trained with stochastic gradient descent. On handwritten digit and news data, we see that the autoencoder trees yield good reconstruction error compared to traditional autoencoder percep-trons. We also see that the autoencoder tree captures hierarchical representations at different granularities of the data on its different levels and the leaves capture the localities in the input space.
Global Convergence of Online Limited Memory BFGS
Mokhtari, Aryan, Ribeiro, Alejandro
Global convergence of an online (stochastic) limited memory version of the Broyden-Fletcher- Goldfarb-Shanno (BFGS) quasi-Newton method for solving optimization problems with stochastic objectives that arise in large scale machine learning is established. Lower and upper bounds on the Hessian eigenvalues of the sample functions are shown to suffice to guarantee that the curvature approximation matrices have bounded determinants and traces, which, in turn, permits establishing convergence to optimal arguments with probability 1. Numerical experiments on support vector machines with synthetic data showcase reductions in convergence time relative to stochastic gradient descent algorithms as well as reductions in storage and computation relative to other online quasi-Newton methods. Experimental evaluation on a search engine advertising problem corroborates that these advantages also manifest in practical applications.
Optimal rates for zero-order convex optimization: the power of two function evaluations
Duchi, John C., Jordan, Michael I., Wainwright, Martin J., Wibisono, Andre
We consider derivative-free algorithms for stochastic and non-stochastic convex optimization problems that use only function values rather than gradients. Focusing on non-asymptotic bounds on convergence rates, we show that if pairs of function values are available, algorithms for $d$-dimensional optimization that use gradient estimates based on random perturbations suffer a factor of at most $\sqrt{d}$ in convergence rate over traditional stochastic gradient methods. We establish such results for both smooth and non-smooth cases, sharpening previous analyses that suggested a worse dimension dependence, and extend our results to the case of multiple ($m \ge 2$) evaluations. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, establishing the sharpness of our achievable results up to constant (sometimes logarithmic) factors.
ProPPR: Efficient First-Order Probabilistic Logic Programming for Structure Discovery, Parameter Learning, and Scalable Inference
Wang, William Yang (Carnegie Mellon University) | Mazaitis, Kathryn (Carnegie Mellon University) | Cohen, William W (Carnegie Mellon University)
A key challenge in statistical relational learning is to develop a semantically rich formalism that supports efficient probabilistic reasoning using large collections of extracted information. This paper presents a new, scalable probabilistic logic called ProPPR, which further extends stochastic logic programs (SLP) to a framework that enables efficient learning and inference on graphs: using an abductive second-order probabilistic logic, we show that first-order theories can be automatically generated via parameter learning; that in parameter learning, weight learning can be performed using parallel stochastic gradient descent with a supervised personalized PageRank algorithm; and that most importantly, queries can be approximately grounded with a small graph, and inference is independent of the size of the database.
Gradient Descent with Proximal Average for Nonconvex and Composite Regularization
Zhong, Wenliang (Hong Kong University of Science and Technology) | Kwok, James T. (Hong Kong University of Science and Technology)
Sparse modeling has been highly successful in many real-world applications. While a lot of interests have been on convex regularization, recent studies show that nonconvexregularizers can outperform their convex counterparts in many situations.However, the resulting nonconvex optimization problems are often challenging, especiallyfor composite regularizers such as the nonconvex overlapping group lasso. In thispaper, byusing a recent mathematical tool known as the proximal average,we propose a novel proximal gradient descent method for optimization with a wide class of nonconvex and composite regularizers.Instead of directlysolving the proximal stepassociated with a composite regularizer, we average thesolutions from the proximal problems of the constituent regularizers. This simple strategy has guaranteed convergenceand low per-iteration complexity.Experimental results on a number of synthetic andreal-world data sets demonstrate the effectiveness and efficiency of theproposed optimization algorithm, and also the improved classification performanceresulting from thenonconvex regularizers.
Learning Relative Similarity by Stochastic Dual Coordinate Ascent
Wu, Pengcheng (Nanyang Technological University) | Ding, Yi (Nanyang Technological University) | Zhao, Peilin (Rutgers University) | Miao, Chunyan (Nanyang Technological University) | Hoi, Steven C. H. (Nanyang Technological University)
Learning relative similarity from pairwise instances is an important problem in machine learning and has a wide range of applications. Despite being studied for years, some existing methods solved by Stochastic Gradient Descent (SGD) techniques generally suffer from slow convergence. In this paper, we investigate the application of Stochastic Dual Coordinate Ascent (SDCA) technique to tackle the optimization task of relative similarity learning by extending from vector to matrix parameters. Theoretically, we prove the optimal linear convergence rate for the proposed SDCA algorithm, beating the well-known sublinear convergence rate by the previous best metric learning algorithms. Empirically, we conduct extensive experiments on both standard and large-scale data sets to validate the effectiveness of the proposed algorithm for retrieval tasks.
Fast Multi-Instance Multi-Label Learning
Huang, Sheng-Jun (Nanjing University) | Gao, Wei (Nanjing University) | Zhou, Zhi-Hua (Nanjing University)
In multi-instance multi-label learning (MIML), one object is represented by multiple instances and simultaneously associated with multiple labels. Existing MIML approaches have been found useful in many applications; however, most of them can only handle moderate-sized data. To efficiently handle large data sets, we propose the MIMLfast approach, which first constructs a low-dimensional subspace shared by all labels, and then trains label specific linear models to optimize approximated ranking loss via stochastic gradient descent. Although the MIML problem is complicated, MIMLfast is able to achieve excellent performance by exploiting label relations with shared space and discovering sub-concepts for complicated labels. Experiments show that the performance of MIMLfast is highly competitive to state-of-the-art techniques, whereas its time cost is much less; particularly, on a data set with 30K bags and 270K instances, where none of existing approaches can return results in 24 hours, MIMLfast takes only 12 minutes. Moreover, our approach is able to identify the most representative instance for each label, and thus providing a chance to understand the relation between input patterns and output semantics.
Natural Temporal Difference Learning
Dabney, William (University of Massachusetts Amherst) | Thomas, Philip (University of Massachusetts Amherst)
In this paper we investigate the application of natural gradient descent to Bellman error based reinforcement learning algorithms. This combination is interesting because natural gradient descent is invariant to the parameterization of the value function. This invariance property means that natural gradient descent adapts its update directions to correct for poorly conditioned representations. We present and analyze quadratic and linear time natural temporal difference learning algorithms, and prove that they are covariant. We conclude with experiments which suggest that the natural algorithms can match or outperform their non-natural counterparts using linear function approximation, and drastically improve upon their non-natural counterparts when using non-linear function approximation.
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Stochastic gradient descent algorithms for training linear and kernel predictors are gaining more and more importance, thanks to their scalability. While various methods have been proposed to speed up their convergence, the model selection phase is often ignored. In fact, in theoretical works most of the time assumptions are made, for example, on the prior knowledge of the norm of the optimal solution, while in the practical world validation methods remain the only viable approach. In this paper, we propose a new kernel-based stochastic gradient descent algorithm that performs model selection while training, with no parameters to tune, nor any form of cross-validation. The algorithm builds on recent advancement in online learning theory for unconstrained settings, to estimate over time the right regularization in a data-dependent way. Optimal rates of convergence are proved under standard smoothness assumptions on the target function, using the range space of the fractional integral operator associated with the kernel.