Goto

Collaborating Authors

 Abernethy, Jacob


Linear Separation via Optimism

arXiv.org Machine Learning

Binary linear classification has been explored since the very early days of the machine learning literature. Perhaps the most classical algorithm is the Perceptron, where a weight vector used to classify examples is maintained, and additive updates are made as incorrect examples are discovered. The Perceptron has been thoroughly studied and several versions have been proposed over many decades. The key theoretical fact about the Perceptron is that, so long as a perfect linear classifier exists with some margin $\gamma > 0$, the number of required updates to find such a perfect linear separator is bounded by $\frac{1}{\gamma^2}$. What has never been fully addressed is: does there exist an algorithm that can achieve this with fewer updates? In this paper we answer this in the affirmative: we propose the Optimistic Perceptron algorithm, a simple procedure that finds a separating hyperplane in no more than $\frac{1}{\gamma}$ updates. We also show experimentally that this procedure can significantly outperform Perceptron.


Understanding How Over-Parametrization Leads to Acceleration: A case of learning a single teacher neuron

arXiv.org Machine Learning

Over-parametrization has become a popular technique in deep learning. It is observed that by over-parametrization, a larger neural network needs a fewer training iterations than a smaller one to achieve a certain level of performance -- namely, over-parametrization leads to acceleration in optimization. However, despite that over-parametrization is widely used nowadays, little theory is available to explain the acceleration due to over-parametrization. In this paper, we propose understanding it by studying a simple problem first. Specifically, we consider the setting that there is a single teacher neuron with quadratic activation, where over-parametrization is realized by having multiple student neurons learn the data generated from the teacher neuron. We provably show that over-parametrization helps the iterate generated by gradient descent to enter the neighborhood of a global optimal solution that achieves zero testing error faster. On the other hand, we also point out an issue regarding the necessity of over-parametrization and study how the scaling of the output neurons affects the convergence time.


Provable Acceleration of Neural Net Training via Polyak's Momentum

arXiv.org Machine Learning

Incorporating a so-called "momentum" dynamic in gradient descent methods is widely used in neural net training as it has been broadly observed that, at least empirically, it often leads to significantly faster convergence. At the same time, there are very few theoretical guarantees in the literature to explain this apparent acceleration effect. In this paper we show that Polyak's momentum, in combination with over-parameterization of the model, helps achieve faster convergence in training a one-layer ReLU network on $n$ examples. We show specifically that gradient descent with Polyak's momentum decreases the initial training error at a rate much faster than that of vanilla gradient descent. We provide a bound for a fixed sample size $n$, and we show that gradient descent with Polyak's momentum converges at an accelerated rate to a small error that is controllable by the number of neurons $m$. Prior work [DZPS19] showed that using vanilla gradient descent, and with a similar method of over-parameterization, the error decays as $(1-\kappa_n)^t$ after $t$ iterations, where $\kappa_n$ is a problem-specific parameter. Our result shows that with the appropriate choice of parameters one has a rate of $(1-\sqrt{\kappa_n})^t$. This work establishes that momentum does indeed speed up neural net training.


Quickly Finding a Benign Region via Heavy Ball Momentum in Non-Convex Optimization

arXiv.org Machine Learning

The Heavy Ball Method (Polyak, 1964), proposed by Polyak over five decades ago, is a first-order method for optimizing continuous functions. While its stochastic counterpart has proven extremely popular in training deep networks, there are almost no known functions where deterministic Heavy Ball is provably faster than the simple and classical gradient descent algorithm in non-convex optimization. The success of Heavy Ball has thus far eluded theoretical understanding. Our goal is to address this gap, and in the present work we identify two non-convex problems where we provably show that the Heavy Ball momentum helps the iterate to enter a benign region that contains a global optimal point faster. We show that Heavy Ball exhibits simple dynamics that clearly reveal the benefit of using a larger value of momentum parameter for the problems. The first of these optimization problems is the phase retrieval problem, which has useful applications in physical science. The second of these optimization problems is the cubic-regularized minimization, a critical subroutine required by Nesterov-Polyak cubic-regularized method (Nesterov & Polyak (2006)) to find second-order stationary points in general smooth non-convex problems. Poylak's Heavy Ball method (Polyak (1964)) has been very popular in modern non-convex optimization and deep learning, and the stochastic version (a.k.a. SGD with momentum) has become the de facto algorithm for training neural nets.


Online Kernel based Generative Adversarial Networks

arXiv.org Machine Learning

One of the major breakthroughs in deep learning over the past five years has been the Generative Adversarial Network (GAN), a neural network-based generative model which aims to mimic some underlying distribution given a dataset of samples. In contrast to many supervised problems, where one tries to minimize a simple objective function of the parameters, GAN training is formulated as a min-max problem over a pair of network parameters. While empirically GANs have shown impressive success in several domains, researchers have been puzzled by unusual training behavior, including cycling so-called mode collapse. In this paper, we begin by providing a quantitative method to explore some of the challenges in GAN training, and we show empirically how this relates fundamentally to the parametric nature of the discriminator network. We propose a novel approach that resolves many of these issues by relying on a kernel-based non-parametric discriminator that is highly amenable to online training---we call this the Online Kernel-based Generative Adversarial Networks (OKGAN). We show empirically that OKGANs mitigate a number of training issues, including mode collapse and cycling, and are much more amenable to theoretical guarantees. OKGANs empirically perform dramatically better, with respect to reverse KL-divergence, than other GAN formulations on synthetic data; on classical vision datasets such as MNIST, SVHN, and CelebA, show comparable performance.


Adaptive Sampling to Reduce Disparate Performance

arXiv.org Machine Learning

Existing methods for reducing disparate performance of a classifier across different demographic groups assume that one has access to a large data set, thereby focusing on the algorithmic aspect of optimizing overall performance subject to additional constraints. However, poor data collection and imbalanced data sets can severely affect the quality of these methods. In this work, we consider a setting where data collection and optimization are performed simultaneously. In such a scenario, a natural strategy to mitigate the performance difference of the classifier is to provide additional training data drawn from the demographic groups that are worse off. In this paper, we propose to consistently follow this strategy throughout the whole training process and to guide the resulting classifier towards equal performance on the different groups by adaptively sampling each data point from the group that is currently disadvantaged. We provide a rigorous theoretical analysis of our approach in a simplified one-dimensional setting and an extensive experimental evaluation on numerous real-world data sets, including a case study on the data collected during the Flint water crisis.


Competing Against Equilibria in Zero-Sum Games with Evolving Payoffs

arXiv.org Machine Learning

We study the problem of repeated play in a zero-sum game in which the payoff matrix may change, in a possibly adversarial fashion, on each round; we call these Online Matrix Games. Finding the Nash Equilibrium (NE) of a two player zero-sum game is core to many problems in statistics, optimization, and economics, and for a fixed game matrix this can be easily reduced to solving a linear program. But when the payoff matrix evolves over time our goal is to find a sequential algorithm that can compete with, in a certain sense, the NE of the long-term-averaged payoff matrix. We design an algorithm with small NE regret--that is, we ensure that the long-term payoff of both players is close to minimax optimum in hindsight. Our algorithm achieves near-optimal dependence with respect to the number of rounds and depends poly-logarithmically on the number of available actions of the players. Additionally, we show that the naive reduction, where each player simply minimizes its own regret, fails to achieve the stated objective regardless of which algorithm is used. We also consider the so-called bandit setting, where the feedback is significantly limited, and we provide an algorithm with small NE regret using one-point estimates of each payoff matrix.


Last-iterate convergence rates for min-max optimization

arXiv.org Machine Learning

We study the problem of finding min-max solutions for smooth two-input objective functions. While classic results show average-iterate convergence rates for various algorithms, nonconvex applications such as training Generative Adversarial Networks require \emph{last-iterate} convergence guarantees, which are more difficult to prove. It has been an open problem as to whether any algorithm achieves non-asymptotic last-iterate convergence in settings beyond the bilinear and convex-strongly concave settings. In this paper, we study the Hamiltonian Gradient Descent (HGD) algorithm, and we show that HGD exhibits a \emph{linear} convergence rate in a variety of more general settings, including convex-concave settings that are "sufficiently bilinear." We also prove similar convergence rates for the Consensus Optimization (CO) algorithm of [MNG17] for some parameter settings of CO.


Acceleration through Optimistic No-Regret Dynamics

arXiv.org Machine Learning

We consider the problem of minimizing a smooth convex function by reducing the optimization to computing the Nash equilibrium of a particular zero-sum convex-concave game. Zero-sum games can be solved using no-regret learning dynamics, and the standard approach leads to a rate of $O(1/T)$. But we are able to show that the game can be solved at a rate of $O(1/T^2)$, extending recent works of \cite{RS13,SALS15} by using \textit{optimistic learning} to speed up equilibrium computation. The optimization algorithm that we can extract from this equilibrium reduction coincides \textit{exactly} with the well-known \NA \cite{N83a} method, and indeed the same story allows us to recover several variants of the Nesterov's algorithm via small tweaks. This methodology unifies a number of different iterative optimization methods: we show that the \HB algorithm is precisely the non-optimistic variant of \NA, and recent prior work already established a similar perspective on \FW \cite{AW17,ALLW18}.


ActiveRemediation: The Search for Lead Pipes in Flint, Michigan

arXiv.org Machine Learning

We detail our ongoing work in Flint, Michigan to detect pipes made of lead and other hazardous metals. After elevated levels of lead were detected in residents' drinking water, followed by an increase in blood lead levels in area children, the state and federal governments directed over $125 million to replace water service lines, the pipes connecting each home to the water system. In the absence of accurate records, and with the high cost of determining buried pipe materials, we put forth a number of predictive and procedural tools to aid in the search and removal of lead infrastructure. Alongside these statistical and machine learning approaches, we describe our interactions with government officials in recommending homes for both inspection and replacement, with a focus on the statistical model that adapts to incoming information. Finally, in light of discussions about increased spending on infrastructure development by the federal government, we explore how our approach generalizes beyond Flint to other municipalities nationwide.