Optimization
Methods for Integrating Knowledge with the Three-Weight Optimization Algorithm for Hybrid Cognitive Processing
Derbinsky, Nate (Disney Research) | Bento, Jose (Disney Research) | Yedidia, Jonathan S. (Disney Research)
In this paper we consider optimization as an approach for quickly and flexibly developing hybrid cognitive capabilities that are efficient, scalable, and can exploit knowledge to improve solution speed and quality. In this context, we focus on the Three-Weight Algorithm, which aims to solve general optimization problems. We propose novel methods by which to integrate knowledge with this algorithm to improve expressiveness, efficiency, and scaling, and demonstrate these techniques on two example problems (Sudoku and circle packing).
Predictable Feature Analysis
Richthofer, Stefan, Wiskott, Laurenz
Every organism in an environment, whether biological, robotic or virtual, must be able to predict certain aspects of its environment in order to survive or perform whatever task is intended. It needs a model that is capable of estimating the consequences of possible actions, so that planning, control, and decision-making become feasible. For scientific purposes, such models are usually created in a problem specific manner using differential equations and other techniques from control- and system-theory. In contrast to that, we aim for an unsupervised approach that builds up the desired model in a self-organized fashion. Inspired by Slow Feature Analysis (SFA), our approach is to extract sub-signals from the input, that behave as predictable as possible. These "predictable features" are highly relevant for modeling, because predictability is a desired property of the needed consequence-estimating model by definition. In our approach, we measure predictability with respect to a certain prediction model. We focus here on the solution of the arising optimization problem and present a tractable algorithm based on algebraic methods which we call Predictable Feature Analysis (PFA). We prove that the algorithm finds the globally optimal signal, if this signal can be predicted with low error. To deal with cases where the optimal signal has a significant prediction error, we provide a robust, heuristically motivated variant of the algorithm and verify it empirically. Additionally, we give formal criteria a prediction-model must meet to be suitable for measuring predictability in the PFA setting and also provide a suitable default-model along with a formal proof that it meets these criteria.
Exploiting correlation and budget constraints in Bayesian multi-armed bandit optimization
Hoffman, Matthew W., Shahriari, Bobak, de Freitas, Nando
We address the problem of finding the maximizer of a nonlinear smooth function, that can only be evaluated point-wise, subject to constraints on the number of permitted function evaluations. This problem is also known as fixed-budget best arm identification in the multi-armed bandit literature. We introduce a Bayesian approach for this problem and show that it empirically outperforms both the existing frequentist counterpart and other Bayesian optimization methods. The Bayesian approach places emphasis on detailed modelling, including the modelling of correlations among the arms. As a result, it can perform well in situations where the number of arms is much larger than the number of allowed function evaluation, whereas the frequentist counterpart is inapplicable. This feature enables us to develop and deploy practical applications, such as automatic machine learning toolboxes. The paper presents comprehensive comparisons of the proposed approach, Thompson sampling, classical Bayesian optimization techniques, more recent Bayesian bandit approaches, and state-of-the-art best arm identification methods. This is the first comparison of many of these methods in the literature and allows us to examine the relative merits of their different features.
Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints
We investigate two new optimization problems -- minimizing a submodular function subject to a submodular lower bound constraint (submodular cover) and maximizing a submodular function subject to a submodular upper bound constraint (submodular knapsack). We are motivated by a number of real-world applications in machine learning including sensor placement and data subset selection, which require maximizing a certain submodular function (like coverage or diversity) while simultaneously minimizing another (like cooperative cost). These problems are often posed as minimizing the difference between submodular functions [14, 37] which is in the worst case inapproximable. We show, however, that by phrasing these problems as constrained optimization, which is more natural for many applications, we achieve a number of bounded approximation guarantees. We also show that both these problems are closely related and an approximation algorithm solving one can be used to obtain an approximation guarantee for the other. We provide hardness results for both problems thus showing that our approximation factors are tight up to log-factors. Finally, we empirically demonstrate the performance and good scalability properties of our algorithms.
The Squared-Error of Generalized LASSO: A Precise Analysis
Oymak, Samet, Thrampoulidis, Christos, Hassibi, Babak
We consider the problem of estimating an unknown signal $x_0$ from noisy linear observations $y = Ax_0 + z\in R^m$. In many practical instances, $x_0$ has a certain structure that can be captured by a structure inducing convex function $f(\cdot)$. For example, $\ell_1$ norm can be used to encourage a sparse solution. To estimate $x_0$ with the aid of $f(\cdot)$, we consider the well-known LASSO method and provide sharp characterization of its performance. We assume the entries of the measurement matrix $A$ and the noise vector $z$ have zero-mean normal distributions with variances $1$ and $\sigma^2$ respectively. For the LASSO estimator $x^*$, we attempt to calculate the Normalized Square Error (NSE) defined as $\frac{\|x^*-x_0\|_2^2}{\sigma^2}$ as a function of the noise level $\sigma$, the number of observations $m$ and the structure of the signal. We show that, the structure of the signal $x_0$ and choice of the function $f(\cdot)$ enter the error formulae through the summary parameters $D(cone)$ and $D(\lambda)$, which are defined as the Gaussian squared-distances to the subdifferential cone and to the $\lambda$-scaled subdifferential, respectively. The first LASSO estimator assumes a-priori knowledge of $f(x_0)$ and is given by $\arg\min_{x}\{{\|y-Ax\|_2}~\text{subject to}~f(x)\leq f(x_0)\}$. We prove that its worst case NSE is achieved when $\sigma\rightarrow 0$ and concentrates around $\frac{D(cone)}{m-D(cone)}$. Secondly, we consider $\arg\min_{x}\{\|y-Ax\|_2+\lambda f(x)\}$, for some $\lambda\geq 0$. This time the NSE formula depends on the choice of $\lambda$ and is given by $\frac{D(\lambda)}{m-D(\lambda)}$. We then establish a mapping between this and the third estimator $\arg\min_{x}\{\frac{1}{2}\|y-Ax\|_2^2+ \lambda f(x)\}$. Finally, for a number of important structured signal classes, we translate our abstract formulae to closed-form upper bounds on the NSE.
Stochastic Dual Coordinate Ascent with Alternating Direction Multiplier Method
We propose a new stochastic dual coordinate ascent technique that can be applied to a wide range of regularized learning problems. Our method is based on Alternating Direction Multiplier Method (ADMM) to deal with complex regularization functions such as structured regularizations. Although the original ADMM is a batch method, the proposed method offers a stochastic update rule where each iteration requires only one or few sample observations. Moreover, our method can naturally afford mini-batch update and it gives speed up of convergence. We show that, under mild assumptions, our method converges exponentially. The numerical experiments show that our method actually performs efficiently.
Convergence Properties of Kronecker Graphical Lasso Algorithms
Tsiligkaridis, Theodoros, Hero, Alfred O. III, Zhou, Shuheng
Covariance estimation is a problem of great interest in many different disciplines, including machine learning, signal processing, economics and bioinformatics. In many applications the number of variables is very large, e.g., in the tens or hundreds of thousands, leading to a number of covariance parameters that greatly exceeds the number of observations. To address this problem constraints are frequently imposed on the covariance to reduce the number of parameters in the model. For example, the Glasso model of Yuan and Lin [2] and Banerjee et al [3] imposes sparsity constraints on the covariance. The Kronecker product model of Dutilleul [4] and Werner et al [5] assumes that the covariance can be represented as the Kronecker product of two lower dimensional covariance matrices. The transposable regularized covariance model of Allen et al [1] imposes a combination of sparsity and Kronecker product form on the covariance. When there is no missing data, an extension of the alternating optimization algorithm of [4], [5], called the flip flop (FF) algorithm, can be applied to estimate the parameters of this combined sparse and Kronecker product model. In this report we call this algorithm the Kronecker Glasso (KGlasso) and we thoroughly analyze convergence of the algorithm in the high dimensional setting.
Nonlinear unmixing of hyperspectral images using a semiparametric model and spatial regularization
Chen, Jie, Richard, Cรฉdric, Hero, Alfred O. III
Incorporating spatial information into hyperspectral unmixing procedures has been shown to have positive effects, due to the inherent spatial-spectral duality in hyperspectral scenes. Current research works that consider spatial information are mainly focused on the linear mixing model. In this paper, we investigate a variational approach to incorporating spatial correlation into a nonlinear unmixing procedure. A nonlinear algorithm operating in reproducing kernel Hilbert spaces, associated with an $\ell_1$ local variation norm as the spatial regularizer, is derived. Experimental results, with both synthetic and real data, illustrate the effectiveness of the proposed scheme.
Safe and Efficient Screening For Sparse Support Vector Machine
Assume that X E Him" is a data set containing 71 samples, X: (x1, . . . Let w*()\) be the optimal solution of Eq. (1) All the features With nonzero values in "w" (A) are called active The Lagrangian multiplier [1] of the problem defined in Eq. (1) is: The Eq. (2) can be reformulated as: Since the problem defined in Eq. (1) is convex and the optimal value of the In the preceding equation i'j: ij, and Y is a diagonal matrix and YM: When the input is given, it can be obtained in a closed form. The Ll--regularized L2--Loss SVM in Eq. (1) can be rewritten in an uncon-- Eq. (22) shows that the necessary condition for a feature f to be active in the To bound value of 0Tf' 7 we need to first construct a closed convex set K that We first study how to construct the convex set K. In the following, we construct a closed convex set K based on Eq. (19) and The proof of this proposition can be found in [2]. Let 01 and 02 be the optimal solutions of the problem defined in Eq. (19) for Assume that /\1 A2, and 01 is known. In the preceding equations, 01, A1, and /\2 are known. Figure 1 shows an example of the K in a two dimensional space. And K is indicated by the shaded area. It is indicated by the shaded area. Besides the n dimensional hyperball defined in Eq. (32), it is possible to By applying Proposition 6.1 to the objective function defined in Eq. (33) for 01, Let t::--: Z 0. By substituting 0: 02 and 0: 01 into Eq. Eq. (35)7 respectively, and then combining the two obtained equations7 the As the value of t change from 0 to 007 Eq. (36) generates a series of hyperball. Eq. (36) reaches it minimum when, The theorem can be proved by minimizing the 7" defined in Eq. (36).
Structured Optimal Transmission Control in Network-coded Two-way Relay Channels
Ding, Ni, Sadeghi, Parastoo, Kennedy, Rodney A.
This paper considers a transmission control problem in network-coded two-way relay channels (NC-TWRC), where the relay buffers random symbol arrivals from two users, and the channels are assumed to be fading. The problem is modeled by a discounted infinite horizon Markov decision process (MDP). The objective is to find a transmission control policy that minimizes the symbol delay, buffer overflow and transmission power consumption and error rate simultaneously and in the long run. By using the concepts of submodularity, multimodularity and L-natural convexity, we study the structure of the optimal policy searched by dynamic programming (DP) algorithm. We show that the optimal transmission policy is nondecreasing in queue occupancies or/and channel states under certain conditions such as the chosen values of parameters in the MDP model, channel modeling method, modulation scheme and the preservation of stochastic dominance in the transitions of system states. The results derived in this paper can be used to relieve the high complexity of DP and facilitate real-time control.