Optimization
Random Shuffling Beats SGD after Finite Epochs
HaoChen, Jeffery Z., Sra, Suvrit
A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we prove that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/T^3), where n is the number of components in the objective, and T is the total number of iterations. This result shows that after a reasonable number of epochs RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis will show, in general a dependence on n is unavoidable without further changes to the algorithm. We show that for sparse data RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Furthermore, we discuss extensions to nonconvex gradient dominated functions, as well as non-strongly convex settings.
Quadratic Decomposable Submodular Function Minimization
Li, Pan, He, Niao, Milenkovic, Olgica
We introduce a new convex optimization problem, termed quadratic decomposable submodular function minimization. The problem arises in many learning on graphs and hypergraphs settings and is closely related to decomposable submodular function minimization. We approach the problem via a new dual strategy and describe an objective that may be optimized via random coordinate descent (RCD) methods and projections onto cones. We also establish the linear convergence rate of the RCD algorithm and develop efficient projection algorithms with provable performance guarantees. Numerical experiments in transductive learning on hypergraphs confirm the efficiency of the proposed algorithm and demonstrate the significant improvements in prediction accuracy with respect to state-of-the-art methods.
Piecewise Approximations of Black Box Models for Model Interpretation
Ahuja, Kartik, Zame, William R., van der Schaar, Mihaela
Machine Learning models have proved extremely successful for a wide variety of supervised learning problems, but the predictions of many of these models are difficult to interpret. A recent literature interprets the predictions of more general "black-box" machine learning models by approximating these models in terms of simpler models such as piecewise linear or piecewise constant models. Existing literature constructs these approximations in an ad-hoc manner. We provide a tractable dynamic programming algorithm that partitions the feature space into clusters in a principled way and then uses this partition to provide both piecewise constant and piecewise linear interpretations of an arbitrary "black-box" model. When loss is measured in terms of mean squared error, our approximation is optimal (under certain conditions); for more general loss functions, our interpretation is probably approximately optimal (in the sense of PAC learning). Experiments with real and synthetic data show that it continues to provide significant improvements (in terms of mean squared error) over competing approaches.
Volkswagen electric car powered by sweeteners smashes hill climbing record at Pikes Peak
Volkswagen has shown off the sporty side of its electric technology by setting an all-time record in the annual Pikes Peak International Hill Climb in Colorado. Former Le Mans winner Romain Dumas took the I.D. R Pikes Peak prototype up in a time of seven minutes 57.148 seconds on the 19.9 km mountain road on Sunday. That was 16 seconds quicker than the 2013 record set by fellow-Frenchman Sebastien Loeb in a 3.2 litre V6 engined Peugeot 208. The radical car was fuelled by glycerol, a sugar alcohol often used as a sweetener in food. Former Le Mans winner Romain Dumas took the I.D. R Pikes Peak prototype up in a time of seven minutes 57.148 seconds on the 19.9 km mountain road on Sunday.
Multi-objective Model-based Policy Search for Data-efficient Learning with Sparse Rewards
Kaushik, Rituraj, Chatzilygeroudis, Konstantinos, Mouret, Jean-Baptiste
The most data-efficient algorithms for reinforcement learning in robotics are model-based policy search algorithms, which alternate between learning a dynamical model of the robot and optimizing a policy to maximize the expected return given the model and its uncertainties. However, the current algorithms lack an effective exploration strategy to deal with sparse or misleading reward scenarios: if they do not experience any state with a positive reward during the initial random exploration, it is very unlikely to solve the problem. Here, we propose a novel model-based policy search algorithm, Multi-DEX, that leverages a learned dynamical model to efficiently explore the task space and solve tasks with sparse rewards in a few episodes. To achieve this, we frame the policy search problem as a multi-objective, model-based policy optimization problem with three objectives: (1) generate maximally novel state trajectories, (2) maximize the expected return and (3) keep the system in state-space regions for which the model is as accurate as possible. We then optimize these objectives using a Pareto-based multi-objective optimization algorithm. The experiments show that Multi-DEX is able to solve sparse reward scenarios (with a simulated robotic arm) in much lower interaction time than VIME, TRPO, GEP-PG, CMA-ES and Black-DROPS.
Accelerating likelihood optimization for ICA on real signals
Ablin, Pierre, Cardoso, Jean-Franรงois, Gramfort, Alexandre
We study optimization methods for solving the maximum likelihood formulation of independent component analysis (ICA). We consider both the the problem constrained to white signals and the unconstrained problem. The Hessian of the objective function is costly to compute, which renders Newton's method impractical for large data sets. Many algorithms proposed in the literature can be rewritten as quasi-Newton methods, for which the Hessian approximation is cheap to compute. These algorithms are very fast on simulated data where the linear mixture assumption really holds. However, on real signals, we observe that their rate of convergence can be severely impaired. In this paper, we investigate the origins of this behavior, and show that the recently proposed Preconditioned ICA for Real Data (Picard) algorithm overcomes this issue on both constrained and unconstrained problems.
A Distributed Flexible Delay-tolerant Proximal Gradient Algorithm
Mishchenko, Konstantin, Iutzeler, Franck, Malick, Jรฉrรดme
We develop and analyze an asynchronous algorithm for distributed convex optimization when the objective writes a sum of smooth functions, local to each worker, and a non-smooth function. Unlike many existing methods, our distributed algorithm is adjustable to various levels of communication cost, delays, machines computational power, and functions smoothness. A unique feature is that the stepsizes do not depend on communication delays nor number of machines, which is highly desirable for scalability. We prove that the algorithm converges linearly in the strongly convex case, and provide guarantees of convergence for the non-strongly convex case. The obtained rates are the same as the vanilla proximal gradient algorithm over some introduced epoch sequence that subsumes the delays of the system. We provide numerical results on large-scale machine learning problems to demonstrate the merits of the proposed method.
Diversified Late Acceptance Search
Namazi, Majid, Sanderson, Conrad, Newton, M. A. Hakim, Polash, M. M. A., Sattar, Abdul
The well-known Late Acceptance Hill Climbing (LAHC) search aims to overcome the main downside of traditional Hill Climbing (HC) search, which is often quickly trapped in a local optimum due to strictly accepting only non-worsening moves within each iteration. In contrast, LAHC also accepts worsening moves, by keeping a circular array of fitness values of previously visited solutions and comparing the fitness values of candidate solutions against the least recent element in the array. While the straightforward strategy followed by LAHC has proven effective, there are nevertheless situations where LAHC can unfortunately behave in a similar manner to HC, even when using a large fitness array. For example, when the same fitness value is stored many times in the array, particularly when a new local optimum is found. To address this shortcoming, we propose to improve both the diversity of the accepted solutions and the diversity of values in the array through new acceptance and replacement strategies. The proposed Diversified Late Acceptance Search approach is shown to outperform the current state-of-the-art LAHC method on benchmark sets of Travelling Salesman Problem and Quadratic Assignment Problem instances.
The Insider's Guide to Adam Optimization Algorithm for Deep Learning
Adam is the super star optimization algorithm of Deep Learning. Optimization algorithms aim to find optimum weights, minimize error and maximize accuracy. We find partial derivative of total error with respect to each weight and use this calculation to update weights. This is common because it works slowly but surely. In 2015, Adam optimization algorithm is raised. The name of the algorithm refers to adaptive moment estimation.
Towards Optimal Transport with Global Invariances
Alvarez-Melis, David, Jegelka, Stefanie, Jaakkola, Tommi S.
Many problems in machine learning involve calculating correspondences between sets of objects, such as point clouds or images. Discrete optimal transport (OT) provides a natural and successful approach to such tasks whenever the two sets of objects can be represented in the same space or when we can evaluate distances between the objects. Unfortunately neither requirement is likely to hold when object representations are learned from data. Indeed, automatically derived representations such as word embeddings are typically fixed only up to some global transformations, for example, reflection or rotation. As a result, pairwise distances across the two types of objects are ill-defined without specifying their relative transformation. In this work, we propose a general framework for optimal transport in the presence of latent global transformations. We discuss algorithms for the specific case of orthonormal transformations, and show promising results in unsupervised word alignment.