Optimization
Federated Learning with Compression: Unified Analysis and Sharp Guarantees
Haddadpour, Farzin, Kamani, Mohammad Mahdi, Mokhtari, Aryan, Mahdavi, Mehrdad
In federated learning, communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions. Two notable trends to deal with the communication overhead of federated algorithms are \emph{gradient compression} and \emph{local computation with periodic communication}. Despite many attempts, characterizing the relationship between these two approaches has proven elusive. We address this by proposing a set of algorithms with periodical compressed (quantized or sparsified) communication and analyze their convergence properties in both homogeneous and heterogeneous local data distributions settings. For the homogeneous setting, our analysis improves existing bounds by providing tighter convergence rates for both \emph{strongly convex} and \emph{non-convex} objective functions. To mitigate data heterogeneity, we introduce a \emph{local gradient tracking} scheme and obtain sharp convergence rates that match the best-known communication complexities without compression for convex, strongly convex, and nonconvex settings. We complement our theoretical results and demonstrate the effectiveness of our proposed methods by several experiments on real-world datasets.
Learning to search efficiently for causally near-optimal treatments
Håkansson, Samuel, Lindblom, Viktor, Gottesman, Omer, Johansson, Fredrik D.
Finding an effective medical treatment often requires a search by trial and error. Making this search more efficient by minimizing the number of unnecessary trials could lower both costs and patient suffering. We formalize this problem as learning a policy for finding a near-optimal treatment in a minimum number of trials using a causal inference framework. We give a model-based dynamic programming algorithm which learns from observational data while being robust to unmeasured confounding. To reduce time complexity, we suggest a greedy algorithm which bounds the near-optimality constraint. The methods are evaluated on synthetic and real-world healthcare data and compared to model-free reinforcement learning. We find that our methods compare favorably to the model-free baseline while offering a more transparent trade-off between search time and treatment efficacy.
BOSH: Bayesian Optimization by Sampling Hierarchically
Moss, Henry B., Leslie, David S., Rayson, Paul
Deployments of Bayesian Optimization (BO) for functions with stochastic evaluations, such as parameter tuning via cross validation and simulation optimization, typically optimize an average of a fixed set of noisy realizations of the objective function. However, disregarding the true objective function in this manner finds a high-precision optimum of the wrong function. To solve this problem, we propose Bayesian Optimization by Sampling Hierarchically (BOSH), a novel BO routine pairing a hierarchical Gaussian process with an information-theoretic framework to generate a growing pool of realizations as the optimization progresses. We demonstrate that BOSH provides more efficient and higher-precision optimization than standard BO across synthetic benchmarks, simulation optimization, reinforcement learning and hyper-parameter tuning tasks.
A Generic First-Order Algorithmic Framework for Bi-Level Programming Beyond Lower-Level Singleton
Liu, Risheng, Mu, Pan, Yuan, Xiaoming, Zeng, Shangzhi, Zhang, Jin
In recent years, a variety of gradient-based first-order methods have been developed to solve bi-level optimization problems for learning applications. However, theoretical guarantees of these existing approaches heavily rely on the simplification that for each fixed upper-level variable, the lower-level solution must be a singleton (a.k.a., Lower-Level Singleton, LLS). In this work, we first design a counter-example to illustrate the invalidation of such LLS condition. Then by formulating BLPs from the view point of optimistic bi-level and aggregating hierarchical objective information, we establish Bi-level Descent Aggregation (BDA), a flexible and modularized algorithmic framework for generic bi-level optimization. Theoretically, we derive a new methodology to prove the convergence of BDA without the LLS condition. Our investigations also demonstrate that BDA is indeed compatible to a verify of particular first-order computation modules. Additionally, as an interesting byproduct, we also improve these conventional first-order bi-level schemes (under the LLS simplification). Particularly, we establish their convergences with weaker assumptions. Extensive experiments justify our theoretical results and demonstrate the superiority of the proposed BDA for different tasks, including hyper-parameter optimization and meta learning.
Adversarial Example Games
Bose, Avishek Joey, Gidel, Gauthier, Berrard, Hugo, Cianflone, Andre, Vincent, Pascal, Lacoste-Julien, Simon, Hamilton, William L.
The existence of adversarial examples capable of fooling trained neural network classifiers calls for a much better understanding of possible attacks, in order to guide the development of safeguards against them. It includes attack methods in the highly challenging non-interactive blackbox setting, where adversarial attacks are generated without any access, including queries, to the target model. Prior works in this setting have relied mainly on algorithmic innovations derived from empirical observations (e.g., that momentum helps), and the field currently lacks a firm theoretical basis for understanding transferability in adversarial attacks. In this work, we address this gap and lay the theoretical foundations for crafting transferable adversarial examples to entire function classes. We introduce Adversarial Examples Games (AEG), a novel framework that models adversarial examples as two-player min-max games between an attack generator and a representative classifier. We prove that the saddle point of an AEG game corresponds to a generating distribution of adversarial examples against entire function classes. Training the generator only requires the ability to optimize a representative classifier from a given hypothesis class, enabling BlackBox transfer to unseen classifiers from the same class. We demonstrate the efficacy of our approach on the MNIST and CIFAR-10 datasets against both undefended and robustified models, achieving competitive performance with state-of-the-art BlackBox transfer approaches.
Bayesian Coresets: An Optimization Perspective
Zhang, Jacky Y., Khanna, Rajiv, Kyrillidis, Anastasios, Koyejo, Oluwasanmi
Bayesian coresets have emerged as a promising approach for scalable Bayesian inference [22, 12, 13, 11]. The key idea is to select a (weighted) subset of the data such that posterior inference using the selected subset closely approximates posterior inference using the full dataset. This creates a tradeoff, where using Bayesian coresets as opposed to the full dataset exchanges approximation accuracy for computational speedups. We study Bayesian coresets as they are easy to implement, effective in practice, and come with useful theoretical guarantees that relate the coreset size with the approximation quality. The main technical challenge in the Bayesian coreset problem lies in handling the combinatorial constraints - we desire to select a few data points out of many as the coreset. The state of the art approaches rely on two ideas: convexification and greedy methods. In convexification [13], the sparsity constraint - i.e., selection of k data samples - is relaxed into a convex l
Sparse Randomized Shortest Paths Routing with Tsallis Divergence Regularization
Leleux, Pierre, Courtain, Sylvain, Guex, Guillaume, Saerens, Marco
This work elaborates on the important problem of (1) designing optimal randomized routing policies for reaching a target node t from a source note s on a weighted directed graph G and (2) defining distance measures between nodes interpolating between the least cost (based on optimal movements) and the commute-cost (based on a random walk on G), depending on a temperature parameter T. To this end, the randomized shortest path formalism (RSP, [2,99,124]) is rephrased in terms of Tsallis divergence regularization, instead of Kullback-Leibler divergence. The main consequence of this change is that the resulting routing policy (local transition probabilities) becomes sparser when T decreases, therefore inducing a sparse random walk on G converging to the least-cost directed acyclic graph when T tends to 0. Experimental comparisons on node clustering and semi-supervised classification tasks show that the derived dissimilarity measures based on expected routing costs provide state-of-the-art results. The sparse RSP is therefore a promising model of movements on a graph, balancing sparse exploitation and exploration in an optimal way.
Can Global Optimization Strategy Outperform Myopic Strategy for Bayesian Parameter Estimation?
Bayesian adaptive inference is widely used in psychophysics to estimate psychometric parameters. Most applications used myopic one-step ahead strategy which only optimizes the immediate utility. The widely held expectation is that global optimization strategies that explicitly optimize over some horizon can largely improve the performance of the myopic strategy. With limited studies that compared myopic and global strategies, the expectation was not challenged and researchers are still investing heavily to achieve global optimization. Is that really worthwhile? This paper provides a discouraging answer based on experimental simulations comparing the performance improvement and computation burden between global and myopic strategies in parameter estimation of multiple models. The finding is that the added horizon in global strategies has negligible contributions to the improvement of optimal global utility other than the most immediate next steps (of myopic strategy). Mathematical recursion is derived to prove that the contribution of utility improvement of each added horizon step diminishes fast as that step moves further into the future.
A Survey on Recent Progress in the Theory of Evolutionary Algorithms for Discrete Optimization
Doerr, Benjamin, Neumann, Frank
The theory of evolutionary computation for discrete search spaces has made a lot of progress during the last ten years. This survey summarizes some of the most important recent results obtained in this research area. It reviews important methods such as drift analysis, discusses theoretical insight on parameter tuning and parameter control, and summarizes the advances made for stochastic and dynamic problems. Furthermore, the survey highlights important results in the area of combinatorial optimization with a focus on parameterized complexity and the optimization of submodular functions. Finally, it gives an overview on the large amount of new important results for estimation of distribution algorithms.
Overview of Gaussian process based multi-fidelity techniques with variable relationship between fidelities
Brevault, Loïc, Balesdent, Mathieu, Hebbal, Ali
The design process of complex systems such as new configurations of aircraft or launch vehicles is usually decomposed in different phases which are characterized for instance by the depth of the analyses in terms of number of design variables and fidelity of the physical models. At each phase, the designers have to compose with accurate but computationally intensive models as well as cheap but inaccurate models. Multi-fidelity modeling is a way to merge different fidelity models to provide engineers with accurate results with a limited computational cost. Within the context of multi-fidelity modeling, approaches relying on Gaussian Processes emerge as popular techniques to fuse information between the different fidelity models. The relationship between the fidelity models is a key aspect in multi-fidelity modeling. This paper provides an overview of Gaussian process-based multi-fidelity modeling techniques for variable relationship between the fidelity models (e.g., linearity, non-linearity, variable correlation). Each technique is described within a unified framework and the links between the different techniques are highlighted. All the approaches are numerically compared on a series of analytical test cases and four aerospace related engineering problems in order to assess their benefits and disadvantages with respect to the problem characteristics.