Optimization
Semi-discrete optimization through semi-discrete optimal transport: a framework for neural architecture search
Trillos, Nicolas Garcia, Morales, Javier
In this paper we introduce a theoretical framework for semi-discrete optimization using ideas from optimal transport. Our primary motivation is in the field of deep learning, and specifically in the task of neural architecture search. With this aim in mind, we discuss the geometric and theoretical motivation for new techniques for neural architecture search (in the companion work \cite{practical}; we show that algorithms inspired by our framework are competitive with contemporaneous methods). We introduce a Riemannian like metric on the space of probability measures over a semi-discrete space $\mathbb{R}^d \times \mathcal{G}$ where $\mathcal{G}$ is a finite weighted graph. With such Riemmanian structure in hand, we derive formal expressions for the gradient flow of a relative entropy functional, as well as second order dynamics for the optimization of said energy. Then, with the aim of providing a rigorous motivation for the gradient flow equations derived formally we also consider an iterative procedure known as minimizing movement scheme (i.e., Implicit Euler scheme, or JKO scheme) and apply it to the relative entropy with respect to a suitable cost function. For some specific choices of metric and cost, we rigorously show that the minimizing movement scheme of the relative entropy functional converges to the gradient flow process provided by the formal Riemannian structure. This flow coincides with a system of reaction-diffusion equations on $\mathbb{R}^d$.
Algorithm for Computing Approximate Nash equilibrium in Continuous Games with Application to Continuous Blotto
Successful algorithms have been developed for computing Nash equilibrium in a variety of finite game classes. However, solving continuous games---in which the pure strategy space is (potentially uncountably) infinite---is far more challenging. Nonetheless, many real-world domains have continuous action spaces, e.g., where actions refer to an amount of time, money, or other resource that is naturally modeled as being real-valued as opposed to integral. We present a new algorithm for computing Nash equilibrium strategies in continuous games. In addition to two-player zero-sum games, our algorithm also applies to multiplayer games and games of imperfect information. We experiment with our algorithm on a continuous imperfect-information Blotto game, in which two players distribute resources over multiple battlefields. Blotto games have frequently been used to model national security scenarios and have also been applied to electoral competition and auction theory. Experiments show that our algorithm is able to quickly compute close approximations of Nash equilibrium strategies for this game.
Understanding Notions of Stationarity in Non-Smooth Optimization
Li, Jiajin, So, Anthony Man-Cho, Ma, Wing-Kin
Many contemporary applications in signal processing and machine learning give rise to structured non-convex non-smooth optimization problems that can often be tackled by simple iterative methods quite effectively. One of the keys to understanding such a phenomenon---and, in fact, one of the very difficult conundrums even for experts---lie in the study of "stationary points" of the problem in question. Unlike smooth optimization, for which the definition of a stationary point is rather standard, there is a myriad of definitions of stationarity in non-smooth optimization. In this article, we give an introduction to different stationarity concepts for several important classes of non-convex non-smooth functions and discuss the geometric interpretations and further clarify the relationship among these different concepts. We then demonstrate the relevance of these constructions in some representative applications and how they could affect the performance of iterative methods for tackling these applications.
Does the $\ell_1$-norm Learn a Sparse Graph under Laplacian Constrained Graphical Models?
Ying, Jiaxi, Cardoso, Josรฉ Vinรญcius de M., Palomar, Daniel P.
We consider the problem of learning a sparse graph under Laplacian constrained Gaussian graphical models. This problem can be formulated as a penalized maximum likelihood estimation of the precision matrix under Laplacian structural constraints. Like in the classical graphical lasso problem, recent works made use of the $\ell_1$-norm regularization with the goal of promoting sparsity in Laplacian structural precision matrix estimation. However, we find that the widely used $\ell_1$-norm is not effective in imposing a sparse solution in this problem. Through empirical evidence, we observe that the number of nonzero graph weights grows with the increase of the regularization parameter. From a theoretical perspective, we prove that a large regularization parameter will surprisingly lead to a fully connected graph. To address this issue, we propose a nonconvex estimation method by solving a sequence of weighted $\ell_1$-norm penalized sub-problems and prove that the statistical error of the proposed estimator matches the minimax lower bound. To solve each sub-problem, we develop a projected gradient descent algorithm that enjoys a linear convergence rate. Numerical experiments involving synthetic and real-world data sets from the recent COVID-19 pandemic and financial stock markets demonstrate the effectiveness of the proposed method. An open source $\mathsf{R}$ package containing the code for all the experiments is available at https://github.com/mirca/sparseGraph.
Counterfactual explanation of machine learning survival models
Kovalev, Maxim S., Utkin, Lev V.
A method for counterfactual explanation of machine learning survival models is proposed. One of the difficulties of solving the counterfactual explanation problem is that the classes of examples are implicitly defined through outcomes of a machine learning survival model in the form of survival functions. A condition that establishes the difference between survival functions of the original example and the counterfactual is introduced. This condition is based on using a distance between mean times to event. It is shown that the counterfactual explanation problem can be reduced to a standard convex optimization problem with linear constraints when the explained black-box model is the Cox model. For other black-box models, it is proposed to apply the well-known Particle Swarm Optimization algorithm. A lot of numerical experiments with real and synthetic data demonstrate the proposed method.
Covariance-engaged Classification of Sets via Linear Programming
Ren, Zhao, Jung, Sungkyu, Qiao, Xingye
Set classification aims to classify a set of observations as a whole, as opposed to classifying individual observations separately. To formally understand the unfamiliar concept of binary set classification, we first investigate the optimal decision rule under the normal distribution, which utilizes the empirical covariance of the set to be classified. We show that the number of observations in the set plays a critical role in bounding the Bayes risk. Under this framework, we further propose new methods of set classification. For the case where only a few parameters of the model drive the difference between two classes, we propose a computationally-efficient approach to parameter estimation using linear programming, leading to the Covariance-engaged LInear Programming Set (CLIPS) classifier. Its theoretical properties are investigated for both independent case and various (short-range and long-range dependent) time series structures among observations within each set. The convergence rates of estimation errors and risk of the CLIPS classifier are established to show that having multiple observations in a set leads to faster convergence rates, compared to the standard classification situation in which there is only one observation in the set. The applicable domains in which the CLIPS performs better than competitors are highlighted in a comprehensive simulation study. Finally, we illustrate the usefulness of the proposed methods in classification of real image data in histopathology.
On Dissipative Symplectic Integration with Applications to Gradient-Based Optimization
Franรงa, Guilherme, Jordan, Michael I., Vidal, Renรฉ
Recently, continuous dynamical systems have proved useful in providing conceptual and quantitative insights into gradient-based optimization, widely used in modern machine learning and statistics. An important question that arises in this line of work is how to discretize the system in such a way that its stability and rates of convergence are preserved. In this paper we propose a geometric framework in which such discretizations can be realized systematically, enabling the derivation of "rate-matching" optimization algorithms without the need for a discrete convergence analysis. More specifically, we show that a generalization of symplectic integrators to dissipative Hamiltonian systems is able to preserve continuous rates of convergence up to a controlled error. Moreover, such methods preserve a perturbed Hamiltonian despite the absence of a conservation law, extending key results of symplectic integrators to dissipative cases. Our arguments rely on a combination of backward error analysis with fundamental results from symplectic geometry.
Empirical Study on the Benefits of Multiobjectivization for Solving Single-Objective Problems
Steinhoff, Vera, Kerschke, Pascal, Grimme, Christian
Whether it is in the field of production, logistics, in medicine or biology; everywhere the global optimal solution or the set of global optimal solutions is sought. However, most real-world problems are of nonlinear nature and naturally multimodal which poses severe problems to global optimization. Multimodality, the existence of multiple (local) optima, is regarded as one of the biggest challenges for continuous single-objective problems [23]. A lot of algorithms get stuck searching for the global optimum or are requiring many function evaluations to escape local optima. One of the most popular strategies for dealing with multimodal problems are population-based methods like evolutionary algorithms due to their global search abilities [2]. In this paper we will examine another approach of coping with local traps, namely multiobjectivization. By transforming a single-objective into a multi-objective problem, we aim at exploiting the properties of multi-objective landscapes. So far, the characteristics of single-objective optimization problems have often been directly transferred to the multiobjective domain.
Optimizing AI for Teamwork
Bansal, Gagan, Nushi, Besmira, Kamar, Ece, Horvitz, Eric, Weld, Daniel S.
In many high-stakes domains such as criminal justice, finance, and healthcare, AI systems may recommend actions to a human expert responsible for final decisions, a context known as AI-advised decision making. When AI practitioners deploy the most accurate system in these domains, they implicitly assume that the system will function alone in the world. We argue that the most accurate AI team-mate is not necessarily the em best teammate; for example, predictable performance is worth a slight sacrifice in AI accuracy. So, we propose training AI systems in a human-centered manner and directly optimizing for team performance. We study this proposal for a specific type of human-AI team, where the human overseer chooses to accept the AI recommendation or solve the task themselves. To optimize the team performance we maximize the team's expected utility, expressed in terms of quality of the final decision, cost of verifying, and individual accuracies. Our experiments with linear and non-linear models on real-world, high-stakes datasets show that the improvements in utility while being small and varying across datasets and parameters (such as cost of mistake), are real and consistent with our definition of team utility. We discuss the shortcoming of current optimization approaches beyond well-studied loss functions such as log-loss, and encourage future work on human-centered optimization problems motivated by human-AI collaborations.
The Effect of Optimization Methods on the Robustness of Out-of-Distribution Detection Approaches
Abdelzad, Vahdat, Czarnecki, Krzysztof, Salay, Rick
Deep neural networks (DNNs) have become the de facto learning mechanism in different domains. Their tendency to perform unreliably on out-of-distribution (OOD) inputs hinders their adoption in critical domains. Several approaches have been proposed for detecting OOD inputs. However, existing approaches still lack robustness. In this paper, we shed light on the robustness of OOD detection (OODD) approaches by revealing the important role of optimization methods. We show that OODD approaches are sensitive to the type of optimization method used during training deep models. Optimization methods can provide different solutions to a non-convex problem and so these solutions may or may not satisfy the assumptions (e.g., distributions of deep features) made by OODD approaches. Furthermore, we propose a robustness score that takes into account the role of optimization methods. This provides a sound way to compare OODD approaches. In addition to comparing several OODD approaches using our proposed robustness score, we demonstrate that some optimization methods provide better solutions for OODD approaches.