Goto

Collaborating Authors

 Optimization


Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

arXiv.org Machine Learning

Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While current methods offer efficiencies by adaptively choosing new configurations to train, an alternative strategy is to adaptively allocate resources across the selected configurations. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinitely many armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce Hyperband for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with state-of-the-art methods on a suite of hyperparameter optimization problems. We observe that Hyperband provides five times to thirty times speedup over state-of-the-art Bayesian optimization algorithms on a variety of deep-learning and kernel-based learning problems.


On Design Mining: Coevolution and Surrogate Models

arXiv.org Artificial Intelligence

Design mining [54, 55, 56] is the use of computational intelligence techniques to iteratively search and model the attribute space of physical objects evaluated directly through rapid prototyping to meet given objectives. It enables the exploitation of novel materials and processes without formal models or complex simulation, whilst harnessing the creativity of both computational and human design methods. A sample-model-search-sample loop creates an agile/flexible approach, i.e., primarily test-driven, enabling a continuing process of prototype design consideration and criteria refinement by both producers and users. Computational intelligence techniques have long been used in design, particularly for optimisation within simulations/models. Recent developments in additive-layer manufacturing (3D printing) means that it is now possible to work with over a hundred different materials, from ceramics to cells.


Randomized Distributed Mean Estimation: Accuracy vs Communication

arXiv.org Machine Learning

We consider the problem of estimating the arithmetic average of a finite collection of real vectors stored in a distributed fashion across several compute nodes subject to a communication budget constraint. Our analysis does not rely on any statistical assumptions about the source of the vectors. This problem arises as a subproblem in many applications, including reduce-all operations within algorithms for distributed and federated optimization and learning. We propose a flexible family of randomized algorithms exploring the trade-off between expected communication cost and estimation error. Our family contains the full-communication and zero-error method on one extreme, and an $\epsilon$-bit communication and ${\cal O}\left(1/(\epsilon n)\right)$ error method on the opposite extreme. In the special case where we communicate, in expectation, a single bit per coordinate of each vector, we improve upon existing results by obtaining $\mathcal{O}(r/n)$ error, where $r$ is the number of bits used to represent a floating point value.


Limbo: A Fast and Flexible Library for Bayesian Optimization

arXiv.org Machine Learning

Bayesian Optimization (BO) is designed for the most challenging ones: when the gradient is unknown, evaluating a solution is costly, and evaluations are noisy. This is, for instance, the case when we want to find optimal parameters for a machine learning algorithm [Snoek et al., 2012], because testing a set of parameters can take hours, and because of the stochastic nature of many machine learning algorithms. Besides parameter tuning, Bayesian optimization recently attracted a lot of interest for direct policy search in robot learning [Lizotte et al., 2007, Wilson et al., 2014, Calandra et al., 2016] and online adaptation; for example, it was recently used to allow a legged robot to learn a new gait after a mechanical damage in about 10-15 trials (2 minutes) [Cully et al., 2015]. At its core, Bayesian optimization builds a probabilistic model of the function to be optimized (the reward/performance/cost function) using the samples that have already been evaluated [Shahriari et al., 2016]; usually, this model is a Gaussian process [Williams and Rasmussen, 2006]. To select the next sample to be evaluated, Bayesian optimization optimizes an acquisition function which leverages the model to predict the most promising areas of the search space.


A Randomized Rounding Algorithm for Sparse PCA

arXiv.org Machine Learning

Large matrices are a common way of representing modern, massive datasets, since an m n real-valued matrix X provides a natural structure for encoding information about m objects, each of which is described by n features. Principal Components Analysis (PCA) and the Singular Value Decomposition (SVD) are fundamental data analysis tools, expressing a data matrix in terms of a sequence of orthogonal vectors of decreasing importance. While these vectors enjoy strong optimality properties and are often interpreted as fundamental latent factors that underlie the observed data, they are linear combinations of up to all the data points and features. As a result, they are notoriously difficult to interpret in terms of the underlying processes generating the data [MD09]. The seminal work of [dGJ07] introduced the concept of Sparse PCA, where sparsity constraints are enforced on the singular vectors in order to improve interpretability. As noted in [dGJ07, MD09, PDK13], an example where sparsity implies interpretability is document analysis, where sparse principal components can be mapped to specific topics by inspecting the (few) keywords in their support.


An Elementary Proof of Convex Phase Retrieval in the Natural Parameter Space via the Linear Program PhaseMax / Compressed Sensing from Phaseless Gaussian Measurements via Linear Programming in the Natural Parameter Space

#artificialintelligence

The phase retrieval problem has garnered significant attention since the development of the PhaseLift algorithm, which is a convex program that operates in a lifted space of matrices. Because of the substantial computational cost due to lifting, many approaches to phase retrieval have been developed, including non-convex optimization algorithms which operate in the natural parameter space, such as Wirtinger Flow. Very recently, a convex formulation called PhaseMax has been discovered, and it has been proven to achieve phase retrieval via linear programming in the natural parameter space under optimal sample complexity. The current proofs of PhaseMax rely on statistical learning theory or geometric probability theory. Here, we present a short and elementary proof that PhaseMax exactly recovers real-valued vectors from random measurements under optimal sample complexity.


A new approach to Laplacian solvers and flow problems

arXiv.org Machine Learning

This paper investigates message-passing algorithms for solving systems of linear equations in the Laplacian matrices of graphs and to compute electric flows. These two problems are fundamental primitives that arise in several domains such as computer science, electrical engineering, operations research, and machine learning. Despite the extensive literature on approximately solving these problems in quasi-linear time, the algorithms that have been proposed are typically centralized and involve multiple graph theoretic constructions or sampling mechanisms that make them difficult to implement and analyze. On the other hand, message-passing routines are distributed, simple, and easy to implement. In this paper we establish a framework to analyze message-passing algorithms to solve voltage and flow problems. We characterize the error committed by the algorithms in d-regular graphs with equal weights. We show that the convergence of the algorithms is controlled by the total variation distance between the distributions of non-backtracking random walks that start from neighbor nodes. More broadly, our analysis of message-passing introduces new insights to address generic optimization problems with constraints. Laplacian solver, flow problem, message-passing, min-sum, distributed algo-Keywords: rithms.


Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent

arXiv.org Machine Learning

A growing body of recent research is shedding new light on the role of nonconvex optimization for tackling large scale problems in machine learning, signal processing, and convex programming. This work is developing techniques that help to explain the surprising effectiveness of relatively simple first-order algorithms for certain nonconvex optimizations. When applied to problems that can be formulated as semidefinite programs, these techniques can often be viewed as part of a framework proposed by Burer and Monteiro [4]. The Burer-Monteiro technique is based on factoring the semidefinite variable, and applying classical optimization techniques to the resulting nonconvex objective over the factor. While worst-case complexity considerations imply that such an approach cannot succeed in general, a series of recent papers [11, 40, 35, 13, 1] has shown the strategy to be remarkably effective for a number of problems of practical interest, with analytical convergence guarantees and strong empirical performance. In this paper, we enlarge the collection of problems to which the Burer-Monteiro technique can be successfully applied, by analyzing the convergence properties of gradient descent applied to the problem of rectangular matrix completion from incomplete measurements.


Faster variational inducing input Gaussian process classification

arXiv.org Machine Learning

Gaussian processes (GP) provide a prior over functions and allow finding complex regularities in data. Gaussian processes are successfully used for classification/regression problems and dimensionality reduction. In this work we consider the classification problem only. The complexity of standard methods for GP-classification scales cubically with the size of the training dataset. This complexity makes them inapplicable to big data problems. Therefore, a variety of methods were introduced to overcome this limitation. In the paper we focus on methods based on so called inducing inputs. This approach is based on variational inference and proposes a particular lower bound for marginal likelihood (evidence). This bound is then maximized w.r.t. parameters of kernel function of the Gaussian process, thus fitting the model to data. The computational complexity of this method is $O(nm^2)$, where $m$ is the number of inducing inputs used by the model and is assumed to be substantially smaller than the size of the dataset $n$. Recently, a new evidence lower bound for GP-classification problem was introduced. It allows using stochastic optimization, which makes it suitable for big data problems. However, the new lower bound depends on $O(m^2)$ variational parameter, which makes optimization challenging in case of big m. In this work we develop a new approach for training inducing input GP models for classification problems. Here we use quadratic approximation of several terms in the aforementioned evidence lower bound, obtaining analytical expressions for optimal values of most of the parameters in the optimization, thus sufficiently reducing the dimension of optimization space. In our experiments we achieve as well or better results, compared to the existing method. Moreover, our method doesn't require the user to manually set the learning rate, making it more practical, than the existing method.


Generalizing diffuse interface methods on graphs: non-smooth potentials and hypergraphs

arXiv.org Machine Learning

Diffuse interface methods have recently been introduced for the task of semi-supervised learning. The underlying model is well-known in materials science but was extended to graphs using a Ginzburg--Landau functional and the graph Laplacian. We here generalize the previously proposed model by a non-smooth potential function. Additionally, we show that the diffuse interface method can be used for the segmentation of data coming from hypergraphs. For this we show that the graph Laplacian in almost all cases is derived from hypergraph information. Additionally, we show that the formerly introduced hypergraph Laplacian coming from a relaxed optimization problem is well suited to be used within the diffuse interface method. We present computational experiments for graph and hypergraph Laplacians.