Optimization
Variance Reduced methods for Non-convex Composition Optimization
Liu, Liu, Liu, Ji, Tao, Dacheng
This composition between two finite-sum structures 1 n n i 1 F i ( 1 m m j 1 G j (x)) arises in many machine learning applications such as reinforcement learning [1, 2, 3] and nonlinear embedding [4]. For example, stochastic neighbor embedding (SNE) [4] is a powerful approach to map data from a high dimensional space to a low dimensional space. Let{ z i} n i 1 and { x i} n i 1 denote the representation ofn data points in the high dimensional space and the low dimensional space, respectively. The objective is to pursue a low dimensional representation{ x i} n i 1, such that the distribution in the low dimensional space is as close to the distribution in the high dimensional space as possible. This problem is essentially a composition optimization problem: min x t i p i t log p i t q i t, (2) where p i t exp( ‖ z t z i‖ 2 / 2σ 2 i) j 6 t exp( ‖ z t z j ‖ 2 /2σ 2 i), q i t exp( ‖ x t x i‖ 2) j 6 t exp( ‖ x t x j ‖ 2), lliu8101@uni.sydney.edu.au
Should You Derive, Or Let the Data Drive? An Optimization Framework for Hybrid First-Principles Data-Driven Modeling
Lam, Remi R., Horesh, Lior, Avron, Haim, Willcox, Karen E.
Mathematical models are used extensively for diverse tasks including analysis, optimization, and decision making. Frequently, those models are principled but imperfect representations of reality. This is either due to incomplete physical description of the underlying phenomenon (simplified governing equations, defective boundary conditions, etc.), or due to numerical approximations (discretization, linearization, round-off error, etc.). Model misspecification can lead to erroneous model predictions, and respectively suboptimal decisions associated with the intended end-goal task. To mitigate this effect, one can amend the available model using limited data produced by experiments or higher fidelity models. A large body of research has focused on estimating explicit model parameters. This work takes a different perspective and targets the construction of a correction model operator with implicit attributes. We investigate the case where the end-goal is inversion and illustrate how appropriate choices of properties imposed upon the correction and corrected operator lead to improved end-goal insights.
FLAG n' FLARE: Fast Linearly-Coupled Adaptive Gradient Methods
Cheng, Xiang, Roosta-Khorasani, Farbod, Palombo, Stefan, Bartlett, Peter L., Mahoney, Michael W.
We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions. We present accelerated and adaptive gradient methods, called FLAG and FLARE, which can offer the best of both worlds. They can achieve the optimal convergence rate by attaining the optimal first-order oracle complexity for smooth convex optimization. Additionally, they can adaptively and non-uniformly re-scale the gradient direction to adapt to the limited curvature available and conform to the geometry of the domain. We show theoretically and empirically that, through the compounding effects of acceleration and adaptivity, FLAG and FLARE can be highly effective for many data fitting and machine learning applications.
GPflowOpt: A Bayesian Optimization Library using TensorFlow
Knudde, Nicolas, van der Herten, Joachim, Dhaene, Tom, Couckuyt, Ivo
A novel Python framework for Bayesian optimization known as GPflowOpt is introduced. The package is based on the popular GPflow library for Gaussian processes, leveraging the benefits of TensorFlow including automatic differentiation, parallelization and GPU computations for Bayesian optimization. Design goals focus on a framework that is easy to extend with custom acquisition functions and models. The framework is thoroughly tested and well documented, and provides scalability. The current released version of GPflowOpt includes some standard single-objective acquisition functions, the state-of-the-art max-value entropy search, as well as a Bayesian multi-objective approach. Finally, it permits easy use of custom modeling strategies implemented in GPflow.
The Future Fabric of Data Analysis Quanta Magazine
When subatomic particles smash together at the Large Hadron Collider in Switzerland, they create showers of new particles whose signatures are recorded by four detectors. The LHC captures 5 trillion bits of data -- more information than all of the world's libraries combined -- every second. After the judicious application of filtering algorithms, more than 99 percent of those data are discarded, but the four experiments still produce a whopping 25 petabytes (25 1015 bytes) of data per year that must be stored and analyzed. That is a scale far beyond the computing resources of any single facility, so the LHC scientists rely on a vast computing grid of 160 data centers around the world, a distributed network that is capable of transferring as much as 10 gigabytes per second at peak performance. The LHC's approach to its big data problem reflects just how dramatically the nature of computing has changed over the last decade.
Smooth Primal-Dual Coordinate Descent Algorithms for Nonsmooth Convex Optimization
Alacaoglu, Ahmet, Tran-Dinh, Quoc, Fercoq, Olivier, Cevher, Volkan
We propose a new randomized coordinate descent method for a convex optimization template with broad applications. Our analysis relies on a novel combination of four ideas applied to the primal-dual gap function: smoothing, acceleration, homotopy, and coordinate descent with non-uniform sampling. As a result, our method features the first convergence rate guarantees among the coordinate descent methods, that are the best-known under a variety of common structure assumptions on the template. We provide numerical evidence to support the theoretical results with a comparison to state-of-the-art algorithms.
Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls
Allen-Zhu, Zeyuan, Hazan, Elad, Hu, Wei, Li, Yuanzhi
We propose a rank-$k$ variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball. Our algorithm replaces the top singular-vector computation ($1$-SVD) in Frank-Wolfe with a top-$k$ singular-vector computation ($k$-SVD), which can be done by repeatedly applying $1$-SVD $k$ times. Alternatively, our algorithm can be viewed as a rank-$k$ restricted version of projected gradient descent. We show that our algorithm has a linear convergence rate when the objective function is smooth and strongly convex, and the optimal solution has rank at most $k$. This improves the convergence rate and the total time complexity of the Frank-Wolfe method and its variants.
Dictionary-based Tensor Canonical Polyadic Decomposition
Cohen, Jérémy E., Gillis, Nicolas
To ensure interpretability of extracted sources in tensor decomposition, we introduce in this paper a dictionarybased tensor canonical polyadic decomposition which enforces one factor to belong exactly to a known dictionary. A new formulation of sparse coding is proposed which enables high dimensional tensors dictionary-based canonical polyadic decomposition. The benefits of using a dictionary in tensor decomposition models are explored both in terms of parameter identifiability and estimation accuracy. Performances of the proposed algorithms are evaluated on the decomposition of simulated data and the unmixing of hyperspectral images. Index Terms tensor, multiway analysis, sparse coding, constrained optimization, spectral unmixing.
Multi-Period Flexibility Forecast for Low Voltage Prosumers
Pinto, Rui, Bessa, Ricardo, Matos, Manuel
Near-future electric distribution grids operation will have to rely on demand-side flexibility, both by implementation of demand response strategies and by taking advantage of the intelligent management of increasingly common small-scale energy storage. The Home energy management system (HEMS), installed at low voltage residential clients, will play a crucial role on the flexibility provision to both system operators and market players like aggregators. Modeling and forecasting multi-period flexibility from residential prosumers, such as battery storage and electric water heater, while complying with internal constraints (comfort levels, data privacy) and uncertainty is a complex task. This papers describes a computational method that is capable of efficiently learn and define the feasibility flexibility space from controllable resources connected to a HEMS. An Evolutionary Particle Swarm Optimization (EPSO) algorithm is adopted and reshaped to derive a set of feasible temporal trajectories for the residential net-load, considering storage, flexible appliances, and predefined costumer preferences, as well as load and photovoltaic (PV) forecast uncertainty. A support vector data description (SVDD) algorithm is used to build models capable of classifying feasible and non-feasible HEMS operating trajectories upon request from an optimization/control algorithm operated by a DSO or market player.
Gaussian Lower Bound for the Information Bottleneck Limit
Painsky, Amichai, Tishby, Naftali
The Information Bottleneck (IB) is a conceptual method for extracting the most compact, yet informative, representation of a set of variables, with respect to the target. It generalizes the notion of minimal sufficient statistics from classical parametric statistics to a broader information-theoretic sense. The IB curve defines the optimal trade-off between representation complexity and its predictive power. Specifically, it is achieved by minimizing the level of mutual information (MI) between the representation and the original variables, subject to a minimal level of MI between the representation and the target. This problem is shown to be in general NP hard. One important exception is the multivariate Gaussian case, for which the Gaussian IB (GIB) is known to obtain an analytical closed form solution, similar to Canonical Correlation Analysis (CCA). In this work we introduce a Gaussian lower bound to the IB curve; we find an embedding of the data which maximizes its "Gaussian part", on which we apply the GIB. This embedding provides an efficient (and practical) representation of any arbitrary data-set (in the IB sense), which in addition holds the favorable properties of a Gaussian distribution. Importantly, we show that the optimal Gaussian embedding is bounded from above by non-linear CCA. This allows a fundamental limit for our ability to Gaussianize arbitrary data-sets and solve complex problems by linear methods.