Goto

Collaborating Authors

 Optimization


The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences

arXiv.org Machine Learning

Various applications involve assigning discrete label values to a collection of objects based on some pairwise noisy data. Due to the discrete---and hence nonconvex---structure of the problem, computing the optimal assignment (e.g.~maximum likelihood assignment) becomes intractable at first sight. This paper makes progress towards efficient computation by focusing on a concrete joint alignment problem---that is, the problem of recovering $n$ discrete variables $x_i \in \{1,\cdots, m\}$, $1\leq i\leq n$ given noisy observations of their modulo differences $\{x_i - x_j~\mathsf{mod}~m\}$. We propose a low-complexity and model-free procedure, which operates in a lifted space by representing distinct label values in orthogonal directions, and which attempts to optimize quadratic functions over hypercubes. Starting with a first guess computed via a spectral method, the algorithm successively refines the iterates via projected power iterations. We prove that for a broad class of statistical models, the proposed projected power method makes no error---and hence converges to the maximum likelihood estimate---in a suitable regime. Numerical experiments have been carried out on both synthetic and real data to demonstrate the practicality of our algorithm. We expect this algorithmic framework to be effective for a broad range of discrete assignment problems.


Collective Neurodynamic Optimization Technology For Distributed Big Data Processing

@machinelearnbot

With the development of artificial intelligence, especially in big data, machine learning and related areas, the size and complexity of modern datasets are increasing explosively. To solve these problems with a large-scale dataset, the distributed/decentralized computing frame has been proposed and well established. There are mainly two considerations for the distributed computing: The first one is that the data itself is stored distributedly due to the considering from the sense of data storing and security. In this case, the data must be processed in a distributed manner. The second one is that the data is very large and it is difficult to be processed using the centralized method.


Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

arXiv.org Machine Learning

We study the conditions under which one is able to efficiently apply variance-reduction and acceleration schemes on finite sum optimization problems. First, we show that, perhaps surprisingly, the finite sum structure by itself, is not sufficient for obtaining a complexity bound of $\tilde{\cO}((n+L/\mu)\ln(1/\epsilon))$ for $L$-smooth and $\mu$-strongly convex individual functions - one must also know which individual function is being referred to by the oracle at each iteration. Next, we show that for a broad class of first-order and coordinate-descent finite sum algorithms (including, e.g., SDCA, SVRG, SAG), it is not possible to get an `accelerated' complexity bound of $\tilde{\cO}((n+\sqrt{n L/\mu})\ln(1/\epsilon))$, unless the strong convexity parameter is given explicitly. Lastly, we show that when this class of algorithms is used for minimizing $L$-smooth and convex finite sums, the optimal complexity bound is $\tilde{\cO}(n+L/\epsilon)$, assuming that (on average) the same update rule is used in every iteration, and $\tilde{\cO}(n+\sqrt{nL/\epsilon})$, otherwise.


A Local Analysis of Block Coordinate Descent for Gaussian Phase Retrieval

arXiv.org Machine Learning

While convergence of the Alternating Direction Method of Multipliers (ADMM) on convex problems is well studied, convergence on nonconvex problems is only partially understood. In this paper, we consider the Gaussian phase retrieval problem, formulated as a linear constrained optimization problem with a biconvex objective. The particular structure allows for a novel application of the ADMM. It can be shown that the dual variable is zero at the global minimizer. This motivates the analysis of a block coordinate descent algorithm, which is equivalent to the ADMM with the dual variable fixed to be zero. We show that the block coordinate descent algorithm converges to the global minimizer at a linear rate, when starting from a deterministically achievable initialization point.


Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization

arXiv.org Artificial Intelligence

Many problems in artificial intelligence require adaptively making a sequence of decisions with uncertain outcomes under partial observability. Solving such stochastic optimization problems is a fundamental but notoriously difficult challenge. In this paper, we introduce the concept of adaptive submodularity, generalizing submodular set functions to adaptive policies. We prove that if a problem satisfies this property, a simple adaptive greedy algorithm is guaranteed to be competitive with the optimal policy. In addition to providing performance guarantees for both stochastic maximization and coverage, adaptive submodularity can be exploited to drastically speed up the greedy algorithm by using lazy evaluations. We illustrate the usefulness of the concept by giving several examples of adaptive submodular objectives arising in diverse AI applications including management of sensing resources, viral marketing and active learning. Proving adaptive submodularity for these problems allows us to recover existing results in these applications as special cases, improve approximation guarantees and handle natural generalizations.


Stochastic Cubic Regularization for Fast Nonconvex Optimization

arXiv.org Machine Learning

In this setting, we only have access to the stochastic function f(x; ฮพ), where the random variable ฮพ is sampled from an underlying distribution D. The task is to optimize the expected function f(x), which in general may be nonconvex. This framework covers a wide range of problems, including the offline setting where we minimize the empirical loss over a fixed amount of data, and the online setting where data arrives sequentially. One of the most prominent applications of stochastic optimization has been in large-scale statistics and machine learning problems, such as the optimization of deep neural networks. Classical analysis in nonconvex optimization only guarantees convergence to a first-order stationary point (i.e., a point x satisfying โ€– f(x)โ€– 0), which can be a local minimum, a local maximum, or a saddle point. This paper goes further, proposing an algorithm that escapes saddle points and converges to a local minimum.


Vprop: Variational Inference using RMSprop

arXiv.org Machine Learning

Many computationally-efficient methods for Bayesian deep learning rely on continuous optimization algorithms, but the implementation of these methods requires significant changes to existing code-bases. In this paper, we propose Vprop, a method for Gaussian variational inference that can be implemented with two minor changes to the off-the-shelf RMSprop optimizer. Vprop also reduces the memory requirements of Black-Box Variational Inference by half. We derive Vprop using the conjugate-computation variational inference method, and establish its connections to Newton's method, natural-gradient methods, and extended Kalman filters. Overall, this paper presents Vprop as a principled, computationally-efficient, and easy-to-implement method for Bayesian deep learning.


Saturating Splines and Feature Selection

arXiv.org Machine Learning

We extend the adaptive regression spline model by incorporating saturation, the natural requirement that a function extend as a constant outside a certain range. We fit saturating splines to data using a convex optimization problem over a space of measures, which we solve using an efficient algorithm based on the conditional gradient method. Unlike many existing approaches, our algorithm solves the original infinite-dimensional (for splines of degree at least two) optimization problem without pre-specified knot locations. We then adapt our algorithm to fit generalized additive models with saturating splines as coordinate functions and show that the saturation requirement allows our model to simultaneously perform feature selection and nonlinear function fitting. Finally, we briefly sketch how the method can be extended to higher order splines and to different requirements on the extension outside the data range.


Dependent relevance determination for smooth and structured sparse regression

arXiv.org Machine Learning

In many problem settings, parameter vectors are not merely sparse, but dependent in such a way that non-zero coefficients tend to cluster together. We refer to this form of dependency as "region sparsity". Classical sparse regression methods, such as the lasso and automatic relevance determination (ARD), which model parameters as independent a priori, and therefore do not exploit such dependencies. Here we introduce a hierarchical model for smooth, region-sparse weight vectors and tensors in a linear regression setting. Our approach represents a hierarchical extension of the relevance determination framework, where we add a transformed Gaussian process to model the dependencies between the prior variances of regression weights. We combine this with a structured model of the prior variances of Fourier coefficients, which eliminates unnecessary high frequencies. The resulting prior encourages weights to be region-sparse in two different bases simultaneously. We develop Laplace approximation and Monte Carlo Markov Chain (MCMC) sampling to provide efficient inference for the posterior. Furthermore, a two-stage convex relaxation of the Laplace approximation approach is also provided to relax the inevitable non-convexity during the optimization. We finally show substantial improvements over comparable methods for both simulated and real datasets from brain imaging.


A Novel Stochastic Stratified Average Gradient Method: Convergence Rate and Its Complexity

arXiv.org Machine Learning

SGD (Stochastic Gradient Descent) is a popular algorithm for large scale optimization problems due to its low iterative cost. However, SGD can not achieve linear convergence rate as FGD (Full Gradient Descent) because of the inherent gradient variance. To attack the problem, mini-batch SGD was proposed to get a trade-off in terms of convergence rate and iteration cost. In this paper, a general CVI (Convergence-Variance Inequality) equation is presented to state formally the interaction of convergence rate and gradient variance. Then a novel algorithm named SSAG (Stochastic Stratified Average Gradient) is introduced to reduce gradient variance based on two techniques, stratified sampling and averaging over iterations that is a key idea in SAG (Stochastic Average Gradient). Furthermore, SSAG can achieve linear convergence rate of $\mathcal {O}((1-\frac{\mu}{8CL})^k)$ at smaller storage and iterative costs, where $C\geq 2$ is the category number of training data. This convergence rate depends mainly on the variance between classes, but not on the variance within the classes. In the case of $C\ll N$ ($N$ is the training data size), SSAG's convergence rate is much better than SAG's convergence rate of $\mathcal {O}((1-\frac{\mu}{8NL})^k)$. Our experimental results show SSAG outperforms SAG and many other algorithms.