Optimization
Efficient Algorithms for Outlier-Robust Regression
Klivans, Adam, Kothari, Pravesh K., Meka, Raghu
We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn i.i.d. from distribution D and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to D, assuming that the marginal distribution of D over the input space is \emph{certifiably hypercontractive}. This natural property is satisfied by many well-studied distributions such as Gaussian, strongly log-concave distributions and, uniform distribution on the hypercube among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work. Our approach is based on the sum-of-squares (SoS) method and is inspired by the recent applications of the method for parameter recovery problems in unsupervised learning. Our algorithm can be seen as a natural convex relaxation of the following conceptually simple non-convex optimization problem: find a linear function and a large subset of the input corrupted sample such that the least squares loss of the function over the subset is minimized over all possible large subsets.
Densify uses AI to cut businesses' cloud spending by up to 80%
Densify, a company that helps enterprises make sure that they're using their compute resources to the fullest extent possible, announced a new service today that uses AI to cut down customers' cloud bills. The Cloud Learning Optimization Engine (Cloe for short) analyzes workloads using machine learning to determine how much CPU, RAM, and storage they need, then suggests ways to save money. Cloe has helped customers like Bank of America, Honda, and IBM save an average of 40 percent on their cloud bills, with some customers seeing savings of more than 80 percent. After analyzing the needs of each workload, it suggests compute instances companies can shut down, workloads that can sit on the same instance to save money, and ways to optimize which types of virtual machines customers use in order to reduce their spend. As more companies move their workloads to the public cloud, the sort of optimization work that Cloe helps with is an important component of ensuring that businesses aren't spending too much on their computing infrastructure.
Distributed Computation of Wasserstein Barycenters over Networks
Uribe, César A., Dvinskikh, Darina, Dvurechensky, Pavel, Gasnikov, Alexander, Nedić, Angelia
Optimal Transport distances (also known as earth mover's distances or Wasserstein distances) design an optimal plan to move "mass" from one probability distribution to another. This problem can be traced back to the early work of Monge [1] and Kantorovich [2] and has been of constant interest for allowing natural formulations to the problems of comparing, interpolating, and measuring distances of functions [3]. On the other hand, computational optimal transport has gain popularity for its applications in learning theory [4], computer vision [5], computer graphics [6], statistical inference [7], information fusion [8]; and its relative complexity advantages with respect to classical methods [9]. Particularly, large-scale optimal transport has been of recent interest for the latest applications where large quantities of data are available and efficient algorithms are required [10, 11, 12]. Comprehensive accounts of the optimal transport problem and its computational aspects can be found in [13, 14, 15, 3]. One of the common uses of the Wasserstein distance is the aggregation of distributions by considering their barycenter [16], which itself is another distribution [17]. Wasserstein Barycenters has been shown superior to traditional Euclidean-based methods in a range of application such as image processing [16], economics and finance [18] and condensed matter physics [19]. Figure 1 shows a sample of 100 images of the digit 7 from the MNIST dataset [20] and their respective Euclidean mean and Wasserstein mean.
Boosting Variational Inference: an Optimization Perspective
Locatello, Francesco, Khanna, Rajiv, Ghosh, Joydeep, Rätsch, Gunnar
Variational inference is a popular technique to approximate a possibly intractable Bayesian posterior with a more tractable one. Recently, boosting variational inference has been proposed as a new paradigm to approximate the posterior by a mixture of densities by greedily adding components to the mixture. However, as is the case with many other variational inference algorithms, its theoretical properties have not been studied. In the present work, we study the convergence properties of this approach from a modern optimization viewpoint by establishing connections to the classic Frank-Wolfe algorithm. Our analyses yields novel theoretical insights regarding the sufficient conditions for convergence, explicit rates, and algorithmic simplifications. Since a lot of focus in previous works for variational inference has been on tractability, our work is especially important as a much needed attempt to bridge the gap between probabilistic models and their corresponding theoretical properties.
Matched Filters for Noisy Induced Subgraph Detection
Sussman, Daniel L., Lyzinski, Vince, Park, Youngser, Priebe, Carey E.
We consider the problem of finding the vertex correspondence between two graphs with different number of vertices where the smaller graph is still potentially large. We propose a solution to this problem via a graph matching matched filter: padding the smaller graph in different ways and then using graph matching methods to align it to the larger network. Under a statistical model for correlated pairs of graphs, which yields a noisy copy of the small graph within the larger graph, the resulting optimization problem can be guaranteed to recover the true vertex correspondence between the networks, though there are currently no efficient algorithms for solving this problem. We consider an approach that exploits a partially known correspondence and show via varied simulations and applications to the Drosophila connectome that in practice this approach can achieve good performance.
Understanding Short-Horizon Bias in Stochastic Meta-Optimization
Wu, Yuhuai, Ren, Mengye, Liao, Renjie, Grosse, Roger
Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. But because the training procedure must be unrolled thousands of times, the meta-objective must be defined with an orders-of-magnitude shorter time horizon than is typical for neural net training. We show that such short-horizon meta-objectives cause a serious bias towards small step sizes, an effect we term short-horizon bias. We introduce a toy problem, a noisy quadratic cost function, on which we analyze short-horizon bias by deriving and comparing the optimal schedules for short and long time horizons. We then run meta-optimization experiments (both offline and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. We believe short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes.
Differentiable Submodular Maximization
Tschiatschek, Sebastian, Sahin, Aytunc, Krause, Andreas
We consider learning of submodular functions from data. These functions are important in machine learning and have a wide range of applications, e.g. data summarization, feature selection and active learning. Despite their combinatorial nature, submodular functions can be maximized approximately with strong theoretical guarantees in polynomial time. Typically, learning the submodular function and optimization of that function are treated separately, i.e. the function is first learned using a proxy objective and subsequently maximized. In contrast, we show how to perform learning and optimization jointly. By interpreting the output of greedy maximization algorithms as distributions over sequences of items and smoothening these distributions, we obtain a differentiable objective. In this way, we can differentiate through the maximization algorithms and optimize the model to work well with the optimization algorithm. We theoretically characterize the error made by our approach, yielding insights into the trade-off of smoothness and accuracy. We demonstrate the effectiveness of our approach for jointly learning and optimizing on synthetic maxcut data, and on a real world product recommendation application.
Tuning Over-Relaxed ADMM
França, Guilherme, Bento, José
The framework of Integral Quadratic Constraints (IQC) reduces the computation of upper bounds on the convergence rate of several optimization algorithms to a semi-definite program (SDP). In the case of over-relaxed Alternating Direction Method of Multipliers (ADMM), an explicit and closed form solution to this SDP was derived in our recent work [1]. The purpose of this paper is twofold. First, we summarize these results. Second, we explore one of its consequences which allows us to obtain general and simple formulas for optimal parameter selection. These results are valid for arbitrary strongly convex objective functions.
An Explicit Rate Bound for the Over-Relaxed ADMM
França, Guilherme, Bento, José
The framework of Integral Quadratic Constraints of Lessard et al. (2014) reduces the computation of upper bounds on the convergence rate of several optimization algorithms to semi-definite programming (SDP). Followup work by Nishihara et al. (2015) applies this technique to the entire family of over-relaxed Alternating Direction Method of Multipliers (ADMM). Unfortunately, they only provide an explicit error bound for sufficiently large values of some of the parameters of the problem, leaving the computation for the general case as a numerical optimization problem. In this paper we provide an exact analytical solution to this SDP and obtain a general and explicit upper bound on the convergence rate of the entire family of over-relaxed ADMM. Furthermore, we demonstrate that it is not possible to extract from this SDP a general bound better than ours. We end with a few numerical illustrations of our result and a comparison between the convergence rate we obtain for the ADMM with known convergence rates for the Gradient Descent.
Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms
Hazimeh, Hussein, Mazumder, Rahul
We consider the canonical $L_0$-regularized least squares problem (aka best subsets) which is generally perceived as a `gold-standard' for many sparse learning regimes. In spite of worst-case computational intractability results, recent work has shown that advances in mixed integer optimization can be used to obtain near-optimal solutions to this problem for instances where the number of features $p \approx 10^3$. While these methods lead to estimators with excellent statistical properties, often there is a price to pay in terms of a steep increase in computation times, especially when compared to highly efficient popular algorithms for sparse learning (e.g., based on $L_1$-regularization) that scale to much larger problem sizes. Bridging this gap is a main goal of this paper. We study the computational aspects of a family of $L_0$-regularized least squares problems with additional convex penalties. We propose a hierarchy of necessary optimality conditions for these problems. We develop new algorithms, based on coordinate descent and local combinatorial optimization schemes, and study their convergence properties. We demonstrate that the choice of an algorithm determines the quality of solutions obtained; and local combinatorial optimization-based algorithms generally result in solutions of superior quality. We show empirically that our proposed framework is relatively fast for problem instances with $p\approx 10^6$ and works well, in terms of both optimization and statistical properties (e.g., prediction, estimation, and variable selection), compared to simpler heuristic algorithms. A version of our algorithm reaches up to a three-fold speedup (with $p$ up to $10^6$) when compared to state-of-the-art schemes for sparse learning such as glmnet and ncvreg.