Optimization
Seam Carving: Using Dynamic Programming to implement Context-Aware Image Resizing in Python
The following problem appeared as an assignment in the Algorithm Course (COS 226) at Princeton University taught by Prof. Sedgewick. The following description of the problem is taken from the assignment itself. The first step is to calculate the energy of a pixel, which is a measure of its importance--the higher the energy, the less likely that the pixel will be included as part of a seam (as you will see in the next step). In this assignment, we shall use the dual-gradient energy function, which is described below. The next step is to find a vertical seam of minimum total energy.
Duality-free Methods for Stochastic Composition Optimization
Liu, Liu, Liu, Ji, Tao, Dacheng
We consider the composition optimization with two expected-value functions in the form of $\frac{1}{n}\sum\nolimits_{i = 1}^n F_i(\frac{1}{m}\sum\nolimits_{j = 1}^m G_j(x))+R(x)$, { which formulates many important problems in statistical learning and machine learning such as solving Bellman equations in reinforcement learning and nonlinear embedding}. Full Gradient or classical stochastic gradient descent based optimization algorithms are unsuitable or computationally expensive to solve this problem due to the inner expectation $\frac{1}{m}\sum\nolimits_{j = 1}^m G_j(x)$. We propose a duality-free based stochastic composition method that combines variance reduction methods to address the stochastic composition problem. We apply SVRG and SAGA based methods to estimate the inner function, and duality-free method to estimate the outer function. We prove the linear convergence rate not only for the convex composition problem, but also for the case that the individual outer functions are non-convex while the objective function is strongly-convex. We also provide the results of experiments that show the effectiveness of our proposed methods.
Social Cognitive Optimization (SCO): Project Portal โ Xiao-Feng Xie, Ph.D.
Social Cognitive Optimization (SCO) is an optimization algorithm for solving the (constrained) numerical optimization problem. SCO is a simple agent-based model based on the observational learning mechanism in human social cognition. Related Information: Please find other related code and software in our Source Code Library. License information: SCO is free software; you can redistribute and/or modify it under the terms of Creative Commons Non-Commercial License 3.0. Problem to be solved: (constrained) numerical optimization problem (NOP), or called the nonlinear programming problem.
Feature learning in feature-sample networks using multi-objective optimization
Verri, Filipe Alves Neto, Tinรณs, Renato, Zhao, Liang
Data and knowledge representation are fundamental concepts in machine learning. The quality of the representation impacts the performance of the learning model directly. Feature learning transforms or enhances raw data to structures that are effectively exploited by those models. In recent years, several works have been using complex networks for data representation and analysis. However, no feature learning method has been proposed for such category of techniques. Here, we present an unsupervised feature learning mechanism that works on datasets with binary features. First, the dataset is mapped into a feature--sample network. Then, a multi-objective optimization process selects a set of new vertices to produce an enhanced version of the network. The new features depend on a nonlinear function of a combination of preexisting features. Effectively, the process projects the input data into a higher-dimensional space. To solve the optimization problem, we design two metaheuristics based on the lexicographic genetic algorithm and the improved strength Pareto evolutionary algorithm (SPEA2). We show that the enhanced network contains more information and can be exploited to improve the performance of machine learning methods. The advantages and disadvantages of each optimization strategy are discussed.
Curvature-aided Incremental Aggregated Gradient Method
Wai, Hoi-To, Shi, Wei, Nedic, Angelia, Scaglione, Anna
We propose a new algorithm for finite sum optimization which we call the curvature-aided incremental aggregated gradient (CIAG) method. Motivated by the problem of training a classifier for a d-dimensional problem, where the number of training data is $m$ and $m \gg d \gg 1$, the CIAG method seeks to accelerate incremental aggregated gradient (IAG) methods using aids from the curvature (or Hessian) information, while avoiding the evaluation of matrix inverses required by the incremental Newton (IN) method. Specifically, our idea is to exploit the incrementally aggregated Hessian matrix to trace the full gradient vector at every incremental step, therefore achieving an improved linear convergence rate over the state-of-the-art IAG methods. For strongly convex problems, the fast linear convergence rate requires the objective function to be close to quadratic, or the initial point to be close to optimal solution. Importantly, we show that running one iteration of the CIAG method yields the same improvement to the optimality gap as running one iteration of the full gradient method, while the complexity is $O(d^2)$ for CIAG and $O(md)$ for the full gradient. Overall, the CIAG method strikes a balance between the high computation complexity incremental Newton-type methods and the slow IAG method. Our numerical results support the theoretical findings and show that the CIAG method often converges with much fewer iterations than IAG, and requires much shorter running time than IN when the problem dimension is high.
Distributionally Ambiguous Optimization Techniques in Batch Bayesian Optimization
Rontsis, Nikitas, Osborne, Michael A., Goulart, Paul J.
We propose a novel, theoretically-grounded, acquisition function for batch Bayesian optimization informed by insights from distributionally ambiguous optimization. Our acquisition function is a lower bound on the well-known Expected Improvement function -- which requires a multi-dimensional Gaussian Expectation over a piecewise affine function -- and is computed by evaluating instead the best-case expectation over all probability distributions consistent with the same mean and variance as the original Gaussian distribution. Unlike alternative approaches including Expected Improvement, our proposed acquisition function avoids multi-dimensional integrations entirely, and can be computed exactly as the solution of a convex optimization problem in the form of a tractable semidefinite program (SDP). Moreover, we prove that the solution of this SDP also yields exact numerical derivatives, which enable efficient optimization of the acquisition function. Finally, it efficiently handles marginalized posteriors with respect to the Gaussian Process' hyperparameters. We demonstrate superior performance to heuristic alternatives and approximations of the intractable expected improvement, justifying this performance difference based on simple examples that break the assumptions of state-of-the-art methods.
Efficient Online Minimization for Low-Rank Subspace Clustering
Low-rank representation~(LRR) has been a significant method for segmenting data that are generated from a union of subspaces. It is, however, known that solving the LRR program is challenging in terms of time complexity and memory footprint, in that the size of the nuclear norm regularized matrix is $n$-by-$n$ (where $n$ is the number of samples). In this paper, we thereby develop a fast online implementation of LRR that reduces the memory cost from $O(n^2)$ to $O(pd)$, with $p$ being the ambient dimension and $d$ being some estimated rank~($d < p \ll n$). The crux for this end is a non-convex reformulation of the LRR program, which pursues the basis dictionary that generates the (uncorrupted) observations. We build the theoretical guarantee that the sequence of the solutions produced by our algorithm converges to a stationary point of the empirical and the expected loss function asymptotically. Extensive experiments on synthetic and realistic datasets further substantiate that our algorithm is fast, robust and memory efficient.
Hierarchical State Abstractions for Decision-Making Problems with Computational Constraints
Larsson, Daniel T., Braun, Daniel, Tsiotras, Panagiotis
In this semi-tutorial paper, we first review the information-theoretic approach to account for the computational costs incurred during the search for optimal actions in a sequential decision-making problem. The traditional (MDP) framework ignores computational limitations while searching for optimal policies, essentially assuming that the acting agent is perfectly rational and aims for exact optimality. Using the free-energy, a variational principle is introduced that accounts not only for the value of a policy alone, but also considers the cost of finding this optimal policy. The solution of the variational equations arising from this formulation can be obtained using familiar Bellman-like value iterations from dynamic programming (DP) and the Blahut-Arimoto (BA) algorithm from rate distortion theory. Finally, we demonstrate the utility of the approach for generating hierarchies of state abstractions that can be used to best exploit the available computational resources.
First-order Methods Almost Always Avoid Saddle Points
Lee, Jason D., Panageas, Ioannis, Piliouras, Georgios, Simchowitz, Max, Jordan, Michael I., Recht, Benjamin
Saddle points have long been regarded as a major obstacle for non-convex optimization over continuous spaces. It is well understood that in many applications of interest, the number of saddle points significantly outnumber the number of local minima, which is especially problematic when the solutions associated with worst-case saddle points are considerably worse than those associated with worst-case local minima [12, 14, 34]. Moreover, it is not hard to construct examples where a worst-case initialization of gradient descent (or other first-order methods) provably converge to saddle points [30, Section 1.2.3]. The main message of our paper is that, under very mild regularity conditions, saddle points have little effect on the asymptotic behavior of first-order methods. Building on tools from the theory of dynamical systems, we generalize recent analysis of gradient descent [24, 33] to establish that a wide variety of first-order methods -- including gradient descent, proximal point algorithm, block coordinate descent, mirror descent -- avoid so-called "strict" saddle points for almost all initializations; that is, saddle points where the Hessian of the objective function admits at least one direction of negative curvature (see Definition 1). Our results provide a unified theoretical framework for analyzing the asymptotic behavior of a wide variety of classic optimization heuristics in non-convex optimization. Furthermore, we believe that furthering our understanding of the behavior and geometry of deterministic optimization techniques with random initialization can serve in the development of stochastic algorithms which improve upon their deterministic counterparts and achieve strong convergence-rate results; indeed, such insights have already led to significant improves in modifying gradient descent to navigate saddle-point geometry [15, 21]. This paper significantly extends upon the special case of gradient descent dynamics developed in the conference proceedings of the authors [24, 33].
A complete characterization of optimal dictionaries for least squares representation
Sheriff, Mohammed Rayyan, Chatterjee, Debasish
Dictionaries are collections of vectors used for representations of elements in Euclidean spaces. While recent research on optimal dictionaries is focussed on providing sparse (i.e., $\ell_0$-optimal,) representations, here we consider the problem of finding optimal dictionaries such that representations of samples of a random vector are optimal in an $\ell_2$-sense. For us, optimality of representation is equivalent to minimization of the average $\ell_2$-norm of the coefficients used to represent the random vector, with the lengths of the dictionary vectors being specified a priori. With the help of recent results on rank-$1$ decompositions of symmetric positive semidefinite matrices and the theory of majorization, we provide a complete characterization of $\ell_2$-optimal dictionaries. Our results are accompanied by polynomial time algorithms that construct $\ell_2$-optimal dictionaries from given data.