Optimization
Learning of Gaussian Processes in Distributed and Communication Limited Systems
Tavassolipour, Mostafa, Motahari, Seyed Abolfazl, Shalmani, Mohammad-Taghi Manzuri
It is of fundamental importance to find algorithms obtaining optimal performance for learning of statistical models in distributed and communication limited systems. Aiming at characterizing the optimal strategies, we consider learning of Gaussian Processes (GPs) in distributed systems as a pivotal example. We first address a very basic problem: how many bits are required to estimate the inner-products of Gaussian vectors across distributed machines? Using information theoretic bounds, we obtain an optimal solution for the problem which is based on vector quantization. Two suboptimal and more practical schemes are also presented as substitute for the vector quantization scheme. In particular, it is shown that the performance of one of the practical schemes which is called per-symbol quantization is very close to the optimal one. Schemes provided for the inner-product calculations are incorporated into our proposed distributed learning methods for GPs. Experimental results show that with spending few bits per symbol in our communication scheme, our proposed methods outperform previous zero rate distributed GP learning schemes such as Bayesian Committee Model (BCM) and Product of experts (PoE).
Metacontrol for Adaptive Imagination-Based Optimization
Hamrick, Jessica B., Ballard, Andrew J., Pascanu, Razvan, Vinyals, Oriol, Heess, Nicolas, Battaglia, Peter W.
Many machine learning systems are built to solve the hardest examples of a particular task, which often makes them large and expensive to run--especially with respect to the easier examples, which might require much less computation. For an agent with a limited computational budget, this "one-size-fits-all" approach may result in the agent wasting valuable computation on easy examples, while not spending enough on hard examples. Rather than learning a single, fixed policy for solving all instances of a task, we introduce a metacontroller which learns to optimize a sequence of "imagined" internal simulations over predictive models of the world in order to construct a more informed, and more economical, solution. The metacontroller component is a model-free reinforcement learning agent, which decides both how many iterations of the optimization procedure to run, as well as which model to consult on each iteration. The models (which we call "experts") can be state transition models, action-value functions, or any other mechanism that provides information useful for solving the task, and can be learned on-policy or off-policy in parallel with the metacontroller. When the metacontroller, controller, and experts were trained with "interaction networks" (Battaglia et al., 2016) as expert models, our approach was able to solve a challenging decision-making problem under complex non-linear dynamics. The metacontroller learned to adapt the amount of computation it performed to the difficulty of the task, and learned how to choose which experts to consult by factoring in both their reliability and individual computational resource costs. This allowed the metacontroller to achieve a lower overall cost (task loss plus computational cost) than more traditional fixed policy approaches. These results demonstrate that our approach is a powerful framework for using rich forward models for efficient model-based reinforcement learning. While there have been significant recent advances in deep reinforcement learning (Mnih et al., 2015; Silver et al., 2016) and control (Lillicrap et al., 2015; Levine et al., 2016), most efforts train a network that performs a fixed sequence of computations. Here we introduce an alternative in which an agent uses a metacontroller to choose which, and how many, computations to perform.
The GPU-based Parallel Ant Colony System
The Ant Colony System (ACS) is, next to Ant Colony Optimization (ACO) and the MAX-MIN Ant System (MMAS), one of the most efficient metaheuristic algorithms inspired by the behavior of ants. In this article we present three novel parallel versions of the ACS for the graphics processing units (GPUs). To the best of our knowledge, this is the first such work on the ACS which shares many key elements of the ACO and the MMAS, but differences in the process of building solutions and updating the pheromone trails make obtaining an efficient parallel version for the GPUs a difficult task. The proposed parallel versions of the ACS differ mainly in their implementations of the pheromone memory. The first two use the standard pheromone matrix, and the third uses a novel selective pheromone memory. Computational experiments conducted on several Travelling Salesman Problem (TSP) instances of sizes ranging from 198 to 2392 cities showed that the parallel ACS on Nvidia Kepler GK104 GPU (1536 CUDA cores) is able to obtain a speedup up to 24.29x vs the sequential ACS running on a single core of Intel Xeon E5-2670 CPU. The parallel ACS with the selective pheromone memory achieved speedups up to 16.85x, but in most cases the obtained solutions were of significantly better quality than for the sequential ACS.
Sketching for Large-Scale Learning of Mixture Models
Keriven, Nicolas, Bourrier, Anthony, Gribonval, Rรฉmi, Pรฉrez, Patrick
Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a "compressive learning" framework where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set, and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical Expectation-Maximization (EM) technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over 10 8 training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive information preservation guarantees, in the spirit of infinite-dimensional compressive sensing.
Scalable and Flexible Multiview MAX-VAR Canonical Correlation Analysis
Fu, Xiao, Huang, Kejun, Hong, Mingyi, Sidiropoulos, Nicholas D., So, Anthony Man-Cho
Generalized canonical correlation analysis (GCCA) aims at finding latent low-dimensional common structure from multiple views (feature vectors in different domains) of the same entities. Unlike principal component analysis (PCA) that handles a single view, (G)CCA is able to integrate information from different feature spaces. Here we focus on MAX-VAR GCCA, a popular formulation which has recently gained renewed interest in multilingual processing and speech modeling. The classic MAX-VAR GCCA problem can be solved optimally via eigen-decomposition of a matrix that compounds the (whitened) correlation matrices of the views; but this solution has serious scalability issues, and is not directly amenable to incorporating pertinent structural constraints such as non-negativity and sparsity on the canonical components. We posit regularized MAX-VAR GCCA as a non-convex optimization problem and propose an alternating optimization (AO)-based algorithm to handle it. Our algorithm alternates between {\em inexact} solutions of a regularized least squares subproblem and a manifold-constrained non-convex subproblem, thereby achieving substantial memory and computational savings. An important benefit of our design is that it can easily handle structure-promoting regularization. We show that the algorithm globally converges to a critical point at a sublinear rate, and approaches a global optimal solution at a linear rate when no regularization is considered. Judiciously designed simulations and large-scale word embedding tasks are employed to showcase the effectiveness of the proposed algorithm.
One-Class Semi-Supervised Learning: Detecting Linearly Separable Class by its Mean
Bauman, Evgeny, Bauman, Konstantin
In this paper, we presented a novel semi-supervised one-class classification algorithm which assumes that class is linearly separable from other elements. We proved theoretically that class is linearly separable if and only if it is maximal by probability within the sets with the same mean. Furthermore, we presented an algorithm for identifying such linearly separable class utilizing linear programming. We described three application cases including an assumption of linear separability, Gaussian distribution, and the case of linear separability in transformed space of kernel functions. Finally, we demonstrated the work of the proposed algorithm on the USPS dataset and analyzed the relationship of the performance of the algorithm and the size of the initially labeled sample.
Stochastic Optimization from Distributed, Streaming Data in Rate-limited Networks
Nokleby, Matthew, Bajwa, Waheed U.
Motivated by machine learning applications in networks of sensors, internet-of-things (IoT) devices, and autonomous agents, we propose techniques for distributed stochastic convex learning from high-rate data streams. The setup involves a network of nodes---each one of which has a stream of data arriving at a constant rate---that solve a stochastic convex optimization problem by collaborating with each other over rate-limited communication links. To this end, we present and analyze two algorithms---termed distributed stochastic approximation mirror descent (D-SAMD) and {\em accelerated} distributed stochastic approximation mirror descent (AD-SAMD)---that are based on two stochastic variants of mirror descent. The main collaborative step in the proposed algorithms is approximate averaging of the local, noisy subgradients using distributed consensus. While distributed consensus is well suited for collaborative learning, its use in our setup results in perturbed subgradient averages due to rate-limited links, which may slow down or prevent convergence. Our main contributions in this regard are: (i) bounds on the convergence rates of D-SAMD and AD-SAMD in terms of the number of nodes, network topology, and ratio of the data streaming and communication rates, and (ii) sufficient conditions for order-optimum convergence of D-SAMD and AD-SAMD. In particular, we show that there exist regimes under which AD-SAMD, when compared to D-SAMD, achieves order-optimum convergence with slower communications rates. This is in contrast to the centralized setting in which use of accelerated mirror descent results in a modest improvement over regular mirror descent for stochastic composite optimization. Finally, we demonstrate the effectiveness of the proposed algorithms using numerical experiments.
Group-sparse block PCA and explained variance
The paper addresses the simultneous determination of goup-sparse loadings by block optimization, and the correlated problem of defining explained variance for a set of non orthogonal components. We give in both cases a comprehensive mathematical presentation of the problem, which leads to propose i) a new formulation/algorithm for group-sparse block PCA and ii) a framework for the definition of explained variance with the analysis of five definitions. The numerical results i) confirm the superiority of block optimization over deflation for the determination of group-sparse loadings, and the importance of group information when available, and ii) show that ranking of algorithms according to explained variance is essentially independant of the definition of explained variance. These results lead to propose a new optimal variance as the definition of choice for explained variance.
Seeded Graph Matching
Fishkind, Donniell E., Adali, Sancar, Patsolic, Heather G., Meng, Lingyao, Lyzinski, Vince, Priebe, Carey E.
Given two graphs, the graph matching problem is to align the two vertex sets so as to minimize the number of adjacency disagreements between the two graphs. The seeded graph matching problem is the graph matching problem when we are first given a partial alignment that we are tasked with completing. In this paper, we modify the state-of-the-art approximate graph matching algorithm "FAQ" of Vogelstein et al. (2015) to make it a fast approximate seeded graph matching algorithm, adapt its applicability to include graphs with differently sized vertex sets, and extend the algorithm so as to provide, for each individual vertex, a nomination list of likely matches. We demonstrate the effectiveness of our algorithm via simulation and real data experiments; indeed, knowledge of even a few seeds can be extremely effective when our seeded graph matching algorithm is used to recover a naturally existing alignment that is only partially observed.
Matrix Completion via Max-Norm Constrained Optimization
Matrix completion has been well studied under the uniform sampling model and the trace-norm regularized methods perform well both theoretically and numerically in such a setting. However, the uniform sampling model is unrealistic for a range of applications and the standard trace-norm relaxation can behave very poorly when the underlying sampling scheme is non-uniform. In this paper we propose and analyze a max-norm constrained empirical risk minimization method for noisy matrix completion under a general sampling model. The optimal rate of convergence is established under the Frobenius norm loss in the context of approximately low-rank matrix reconstruction. It is shown that the max-norm constrained method is minimax rate-optimal and yields a unified and robust approximate recovery guarantee, with respect to the sampling distributions. The computational effectiveness of this method is also discussed, based on first-order algorithms for solving convex optimizations involving max-norm regularization.