Streeter, Matthew
Automatically Bounding the Taylor Remainder Series: Tighter Bounds and New Applications
Streeter, Matthew, Dillon, Joshua V.
We present a new algorithm for automatically bounding the Taylor remainder series. In the special case of a scalar function $f: \mathbb{R} \to \mathbb{R}$, our algorithm takes as input a reference point $x_0$, trust region $[a, b]$, and integer $k \ge 1$, and returns an interval $I$ such that $f(x) - \sum_{i=0}^{k-1} \frac {1} {i!} f^{(i)}(x_0) (x - x_0)^i \in I (x - x_0)^k$ for all $x \in [a, b]$. As in automatic differentiation, the function $f$ is provided to the algorithm in symbolic form, and must be composed of known atomic functions. At a high level, our algorithm has two steps. First, for a variety of commonly-used elementary functions (e.g., $\exp$, $\log$), we use recently-developed theory to derive sharp polynomial upper and lower bounds on the Taylor remainder series. We then recursively combine the bounds for the elementary functions using an interval arithmetic variant of Taylor-mode automatic differentiation. Our algorithm can make efficient use of machine learning hardware accelerators, and we provide an open source implementation in JAX. We then turn our attention to applications. Most notably, in a companion paper we use our new machinery to create the first universal majorization-minimization optimization algorithms: algorithms that iteratively minimize an arbitrary loss using a majorizer that is derived automatically, rather than by hand. We also show that our automatically-derived bounds can be used for verified global optimization and numerical integration, and to prove sharper versions of Jensen's inequality.
Universal Majorization-Minimization Algorithms
Streeter, Matthew
Majorization-minimization (MM) is a family of optimization methods that iteratively reduce a loss by minimizing a locally-tight upper bound, called a majorizer. Traditionally, majorizers were derived by hand, and MM was only applicable to a small number of well-studied problems. We present optimizers that instead derive majorizers automatically, using a recent generalization of Taylor mode automatic differentiation. These universal MM optimizers can be applied to arbitrary problems and converge from any starting point, with no hyperparameter tuning.
Data-driven Science and Machine Learning Methods in Laser-Plasma Physics
Dรถpp, Andreas, Eberle, Christoph, Howard, Sunny, Irshad, Faran, Lin, Jinpu, Streeter, Matthew
Laser-plasma physics has developed rapidly over the past few decades as high-power lasers have become both increasingly powerful and more widely available. Early experimental and numerical research in this field was restricted to single-shot experiments with limited parameter exploration. However, recent technological improvements make it possible to gather an increasing amount of data, both in experiments and simulations. This has sparked interest in using advanced techniques from mathematics, statistics and computer science to deal with, and benefit from, big data. At the same time, sophisticated modeling techniques also provide new ways for researchers to effectively deal with situations in which still only sparse amounts of data are available. This paper aims to present an overview of relevant machine learning methods with focus on applicability to laser-plasma physics, including its important sub-fields of laser-plasma acceleration and inertial confinement fusion.
Online Learning of Assignments
Streeter, Matthew, Golovin, Daniel, Krause, Andreas
Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize value of information? These applications exhibit strong diminishing returns: Selection of redundant ads and information sources decreases their marginal utility. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model.
Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning
McMahan, Brendan, Streeter, Matthew
We analyze new online gradient descent algorithms for distributed systems with large delays between gradient computations and the corresponding updates. Using insights from adaptive gradient methods, we develop algorithms that adapt not only to the sequence of gradients, but also to the precise update delays that occur. We first give an impractical algorithm that achieves a regret bound that precisely quantifies the impact of the delays. We then analyze AdaptiveRevision, an algorithm that is efficiently implementable and achieves comparable guarantees. The key algorithmic technique is appropriately and efficiently revising the learning rate used for previous gradient steps.
Learning Effective Loss Functions Efficiently
Streeter, Matthew
We consider the problem of learning a loss function which, when minimized over a training dataset, yields a model that approximately minimizes a validation error metric. Though learning an optimal loss function is NP-hard, we present an anytime algorithm that is asymptotically optimal in the worst case, and is provably efficient in an idealized "easy" case. Experimentally, we show that this algorithm can be used to tune loss function hyperparameters orders of magnitude faster than state-of-the-art alternatives. We also show that our algorithm can be used to learn novel and effective loss functions on-the-fly during training.
Bayes Optimal Early Stopping Policies for Black-Box Optimization
Streeter, Matthew
We derive an optimal policy for adaptively restarting a randomized algorithm, based on observed features of the run-so-far, so as to minimize the expected time required for the algorithm to successfully terminate. Given a suitable Bayesian prior, this result can be used to select the optimal black-box optimization algorithm from among a large family of algorithms that includes random search, Successive Halving, and Hyperband. On CIFAR-10 and ImageNet hyperparameter tuning problems, the proposed policies offer up to a factor of 13 improvement over random search in terms of expected time to reach a given target accuracy, and up to a factor of 3 improvement over a baseline adaptive policy that terminates a run whenever its accuracy is below-median.
Learning Optimal Linear Regularizers
Streeter, Matthew
We present algorithms for efficiently learning regularizers that improve generalization. Our approach is based on the insight that regularizers can be viewed as upper bounds on the generalization gap, and that reducing the slack in the bound can improve performance on test data. For a broad class of regularizers, the hyperparameters that give the best upper bound can be computed using linear programming. Under certain Bayesian assumptions, solving the LP lets us "jump" to the optimal hyperparameters given very limited data. This suggests a natural algorithm for tuning regularization hyperparameters, which we show to be effective on both real and synthetic data.
Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning
McMahan, Brendan, Streeter, Matthew
We analyze new online gradient descent algorithms for distributed systems with large delays between gradient computations and the corresponding updates. Using insights from adaptive gradient methods, we develop algorithms that adapt not only to the sequence of gradients, but also to the precise update delays that occur. We first give an impractical algorithm that achieves a regret bound that precisely quantifies the impact of the delays. We then analyze AdaptiveRevision, an algorithm that is efficiently implementable and achieves comparable guarantees. The key algorithmic technique is appropriately and efficiently revising the learning rate used for previous gradient steps. Experimental results show when the delays grow large (1000 updates or more), our new algorithms perform significantly better than standard adaptive gradient methods.
No-Regret Algorithms for Unconstrained Online Convex Optimization
Mcmahan, Brendan, Streeter, Matthew
Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is R^n. Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point x* are known in advance. We present an algorithm that, without such prior knowledge, offers near-optimal regret bounds with respect to _any_ choice of x*. In particular, regret with respect to x* = 0 is _constant_. We then prove lower bounds showing that our algorithm's guarantees are optimal in this setting up to constant factors.