Optimization
Google Brain Paper Demystifies Learned Optimizers
Learned optimizers are algorithms that can be trained to solve optimization problems. Although learned optimizers can outperform baseline optimizers in restricted settings, the ML research community understands remarkably little about their inner workings or why they work as well as they do. In a paper currently under review for ICLR 2021, a Google Brain research team attempts to shed some light on the matter. The researchers explain that optimization algorithms can be considered the basis of modern machine learning. A popular research area in recent years has focused on learning optimization algorithms by directly parameterizing and training an optimizer on a distribution of tasks. Research on learned optimizers aims to replace the baseline "hand-designed" optimizers with a parametric optimizer trained on a set of tasks, which can then be applied more generally.
Convergence Properties of Stochastic Hypergradients
Grazzi, Riccardo, Pontil, Massimiliano, Salzo, Saverio
Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems in the design of optimization algorithms for bilevel optimization is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk minimization on a large dataset. We provide iteration complexity bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation. Preliminary numerical experiments support our theoretical analysis.
Convex Optimization with an Interpolation-based Projection and its Application to Deep Learning
Akrour, Riad, Atamna, Asma, Peters, Jan
Convex optimizers have known many applications as differentiable layers within deep neural architectures. One application of these convex layers is to project points into a convex set. However, both forward and backward passes of these convex layers are significantly more expensive to compute than those of a typical neural network. We investigate in this paper whether an inexact, but cheaper projection, can drive a descent algorithm to an optimum. Specifically, we propose an interpolation-based projection that is computationally cheap and easy to compute given a convex, domain defining, function. We then propose an optimization algorithm that follows the gradient of the composition of the objective and the projection and prove its convergence for linear objectives and arbitrary convex and Lipschitz domain defining inequality constraints. In addition to the theoretical contributions, we demonstrate empirically the practical interest of the interpolation projection when used in conjunction with neural networks in a reinforcement learning and a supervised learning setting.
Improving Offline Contextual Bandits with Distributional Robustness
Sakhi, Otmane, Faury, Louis, Vasile, Flavian
This paper extends the Distributionally Robust Optimization (DRO) approach for offline contextual bandits. Specifically, we leverage this framework to introduce a convex reformulation of the Counterfactual Risk Minimization principle. Besides relying on convex programs, our approach is compatible with stochastic optimization, and can therefore be readily adapted tothe large data regime. Our approach relies on the construction of asymptotic confidence intervals for offline contextual bandits through the DRO framework. By leveraging known asymptotic results of robust estimators, we also show how to automatically calibrate such confidence intervals, which in turn removes the burden of hyper-parameter selection for policy optimization. We present preliminary empirical results supporting the effectiveness of our approach.
A differential evolution-based optimization tool for interplanetary transfer trajectory design
Zuo, Mingcheng, Dai, Guangming, Peng, Lei, Tang, Zhe
The difficulty of optimizing interplanetary trajectory design problem is caused by extremely sensitive and highly nonlinear search space [3]. Figure 1 shows a highly nonlinear search space of Messenger (full) problem around the current known best solution. The real interplanetary exploration mission usually has to consider two basic objectives: minimum mission period and minimum fuel consumption. However, even though only one objective function is considered for mission optimization, the optimum is still very hard to be found. In 2005, the GTOP database published by ESA includes some interplanetary transfer trajectory design problems, which can be simply divided by multiple gravity-assist (MGA) problems and multiple gravity assists with deep space manoeuvre (MGA-1DSM) problems. In the GTOP database, MGA problem includes Cassini1 and GTOC1; MGA-1DSM problem includes Sagas, Rosetta, Cassini2, Messenger (reduced) and Messenger (full). With the given search space and special boundaries on each variable, the GTOP problems have shown big challenges on global optimization. Three aspects of the challenge can be described as follows: - The attraction basin of global optimum is rather smaller than those of other local optima, and unluckily, this optimal basin is usually hidden in the "not promising" local spaces. During the evolutionary process, it is so difficult to locate these not promising local spaces when considering the population motivation based on greedy gradient information. This causes a result that the general swarm intelligent algorithms can only find sub-optimal results, and the difference between global optimum and found local optima is not acceptable.
How Many Samples is a Good Initial Point Worth in Low-rank Matrix Recovery?
Zhang, Gavin, Zhang, Richard Y.
Given a sufficiently large amount of labeled data, the non-convex low-rank matrix recovery problem contains no spurious local minima, so a local optimization algorithm is guaranteed to converge to a global minimum starting from any initial guess. However, the actual amount of data needed by this theoretical guarantee is very pessimistic, as it must prevent spurious local minima from existing anywhere, including at adversarial locations. In contrast, prior work based on good initial guesses have more realistic data requirements, because they allow spurious local minima to exist outside of a neighborhood of the solution. In this paper, we quantify the relationship between the quality of the initial guess and the corresponding reduction in data requirements. Using the restricted isometry constant as a surrogate for sample complexity, we compute a sharp "threshold" number of samples needed to prevent each specific point on the optimization landscape from becoming a spurious local minimum. Optimizing the threshold over regions of the landscape, we see that for initial points around the ground truth, a linear improvement in the quality of the initial guess amounts to a constant factor improvement in the sample complexity.
A Knowledge Representation Approach to Automated Mathematical Modelling
Ofoghi, Bahadorreza, Mak, Vicky, Yearwood, John
Mathematicians formulate complex mathematical models based on user requirements to solve a diverse range of problems in different domains. These models are, in most cases, represented through several mathematical equations and constraints. This modelling task comprises several time-intensive processes that require both mathematical expertise and (problem) domain knowledge. In an attempt to automate these processes, we have developed an ontology for Mixed Integer Linear Programming (MILP) problems to formulate expert mathematician knowledge and in this paper, we show how this new ontology can be utilized for modelling a relatively straightforward MILP problem, a Machine Scheduling example. We also show that more complex MILP problems, such as the Asymmetric Travelling Salesman Problem (ATSP), however, are not readily amenable to simple elicitation of user requirements and the utilization of the proposed mathematical model ontology. Therefore, an automatic mathematical modelling framework is proposed for such complex MILP problems, which includes a problem (requirement) elicitation module connected to a model extraction module through a translation engine that bridges between the non-expert problem domain and the expert mathematical model domain. This framework is argued to have the necessary components to effectively tackle the automation of modelling task of the more intricate MILP problems such as the ATSP.
Stochastic Hill Climbing in Python from Scratch
Stochastic Hill climbing is an optimization algorithm. It makes use of randomness as part of the search process. This makes the algorithm appropriate for nonlinear objective functions where other local search algorithms do not operate well. It is also a local search algorithm, meaning that it modifies a single solution and searches the relatively local area of the search space until the local optima is located. This means that it is appropriate on unimodal optimization problems or for use after the application of a global optimization algorithm.
Robust multi-stage model-based design of optimal experiments for nonlinear estimation
Mukkula, Anwesh Reddy Gottu, Mateáš, Michal, Fikar, Miroslav, Paulen, Radoslav
Recently it has also become increasingly important in marketing, medicine and political sciences. Process systems engineering community adopts mathematical models successfully in various endeavors such as product and plant design, control system design, operations optimization, etc. (Pantelides and Renfro, 2013; Fung et al., 2016; Safdarnejad et al., 2016). A mathematical model is usually an abstract representation of a true system via sets of equations (algebraic, ordinary differential or partial differential), inequalities (e.g., a range of model validity), and logical conditions. Model development is usually divided into three major steps a) identification of the model structure, b) design and realization of the experiments, and c) estimation of the unknown parameters. In the latter phase, one often realizes maximum-likelihood estimation via least-squares methodology as he/she assumes--knowingly or not--that the measurement error present in the measured data is statistically distributed as a white Gaussian noise. Once the parameter estimates are known, the experimenter commonly determines the quality of the obtained model. This can be done either by using some validation data--if available--or via assessing the joint-confidence regions of the estimated parameters (Beale, 1960; Bates and Watts, 1988; Rooney and Biegler, 2001; Seber and Wild, 2003).
Non-local Optimization: Imposing Structure on Optimization Problems by Relaxation
Müller, Nils, Glasmachers, Tobias
In stochastic optimization, particularly in evolutionary computation and reinforcement learning, the optimization of a function $f: \Omega \to \mathbb{R}$ is often addressed through optimizing a so-called relaxation $\theta \in \Theta \mapsto \mathbb{E}_\theta(f)$ of $f$, where $\Theta$ resembles the parameters of a family of probability measures on $\Omega$. We investigate the structure of such relaxations by means of measure theory and Fourier analysis, enabling us to shed light on the success of many stochastic optimization methods. The main structural traits we derive, and that allow fast and reliable optimization of relaxations, are the resemblance of optimal values of $f$, Lipschitzness of gradients, and convexity.