Recht, Benjamin
The Marginal Value of Adaptive Gradient Methods in Machine Learning
Wilson, Ashia C., Roelofs, Rebecca, Stern, Mitchell, Srebro, Nathan, Recht, Benjamin
Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.
On the Gap Between Strict-Saddles and True Convexity: An Omega(log d) Lower Bound for Eigenvector Approximation
Simchowitz, Max, Alaoui, Ahmed El, Recht, Benjamin
We prove a \emph{query complexity} lower bound on rank-one principal component analysis (PCA). We consider an oracle model where, given a symmetric matrix $M \in \mathbb{R}^{d \times d}$, an algorithm is allowed to make $T$ \emph{exact} queries of the form $w^{(i)} = Mv^{(i)}$ for $i \in \{1,\dots,T\}$, where $v^{(i)}$ is drawn from a distribution which depends arbitrarily on the past queries and measurements $\{v^{(j)},w^{(j)}\}_{1 \le j \le i-1}$. We show that for a small constant $\epsilon$, any adaptive, randomized algorithm which can find a unit vector $\widehat{v}$ for which $\widehat{v}^{\top}M\widehat{v} \ge (1-\epsilon)\|M\|$, with even small probability, must make $T = \Omega(\log d)$ queries. In addition to settling a widely-held folk conjecture, this bound demonstrates a fundamental gap between convex optimization and "strict-saddle" non-convex optimization of which PCA is a canonical example: in the former, first-order methods can have dimension-free iteration complexity, whereas in PCA, the iteration complexity of gradient-based methods must necessarily grow with the dimension. Our argument proceeds via a reduction to estimating the rank-one spike in a deformed Wigner model. We establish lower bounds for this model by developing a "truncated" analogue of the $\chi^2$ Bayes-risk lower bound of Chen et al.
The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime
Simchowitz, Max, Jamieson, Kevin, Recht, Benjamin
We propose a novel technique for analyzing adaptive sampling called the {\em Simulator}. Our approach differs from the existing methods by considering not how much information could be gathered by any fixed sampling strategy, but how difficult it is to distinguish a good sampling strategy from a bad one given the limited amount of data collected up to any given time. This change of perspective allows us to match the strength of both Fano and change-of-measure techniques, without succumbing to the limitations of either method. For concreteness, we apply our techniques to a structured multi-arm bandit problem in the fixed-confidence pure exploration setting, where we show that the constraints on the means imply a substantial gap between the moderate-confidence sample complexity, and the asymptotic sample complexity as $\delta \to 0$ found in the literature. We also prove the first instance-based lower bounds for the top-k problem which incorporate the appropriate log-factors. Moreover, our lower bounds zero-in on the number of times each \emph{individual} arm needs to be pulled, uncovering new phenomena which are drowned out in the aggregate sample complexity. Our new analysis inspires a simple and near-optimal algorithm for the best-arm and top-k identification, the first {\em practical} algorithm of its kind for the latter problem which removes extraneous log factors, and outperforms the state-of-the-art in experiments.
The Power of Adaptivity in Identifying Statistical Alternatives
Jamieson, Kevin G., Haas, Daniel, Recht, Benjamin
This paper studies the trade-off between two different kinds of pure exploration: breadth versus depth. We focus on the most biased coin problem, asking how many total coin flips are required to identify a ``heavy'' coin from an infinite bag containing both ``heavy'' coins with mean $\theta_1 \in (0,1)$, and ``light" coins with mean $\theta_0 \in (0,\theta_1)$, where heavy coins are drawn from the bag with proportion $\alpha \in (0,1/2)$. When $\alpha,\theta_0,\theta_1$ are unknown, the key difficulty of this problem lies in distinguishing whether the two kinds of coins have very similar means, or whether heavy coins are just extremely rare. While existing solutions to this problem require some prior knowledge of the parameters $\theta_0,\theta_1,\alpha$, we propose an adaptive algorithm that requires no such knowledge yet still obtains near-optimal sample complexity guarantees. In contrast, we provide a lower bound showing that non-adaptive strategies require at least quadratically more samples. In characterizing this gap between adaptive and nonadaptive strategies, we make connections to anomaly detection and prove lower bounds on the sample complexity of differentiating between a single parametric distribution and a mixture of two such distributions.
Gradient Descent Learns Linear Dynamical Systems
Hardt, Moritz, Ma, Tengyu, Recht, Benjamin
We prove that gradient descent efficiently converges to the global optimizer of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system. Even though the objective function is non-convex, we provide polynomial running time and sample complexity bounds under strong but natural assumptions. Linear systems identification has been studied for many decades, yet, to the best of our knowledge, these are the first polynomial guarantees for the problem we consider.
CYCLADES: Conflict-free Asynchronous Machine Learning
Pan, Xinghao, Lam, Maximilian, Tu, Stephen, Papailiopoulos, Dimitris, Zhang, Ce, Jordan, Michael I., Ramchandran, Kannan, Re, Chris, Recht, Benjamin
We present CYCLADES, a general framework for parallelizing stochastic optimization algorithms in a shared memory setting. CYCLADES is asynchronous during shared model updates, and requires no memory locking mechanisms, similar to HOGWILD!-type algorithms. Unlike HOGWILD!, CYCLADES introduces no conflicts during the parallel execution, and offers a black-box analysis for provable speedups across a large family of algorithms. Due to its inherent conflict-free nature and cache locality, our multi-core implementation of CYCLADES consistently outperforms HOGWILD!-type algorithms on sufficiently sparse datasets, leading to up to 40% speedup gains compared to the HOGWILD! implementation of SGD, and up to 5x gains over asynchronous implementations of variance reduction algorithms.
Perturbed Iterate Analysis for Asynchronous Stochastic Optimization
Mania, Horia, Pan, Xinghao, Papailiopoulos, Dimitris, Recht, Benjamin, Ramchandran, Kannan, Jordan, Michael I.
We introduce and analyze stochastic optimization methods where the input to each gradient update is perturbed by bounded noise. We show that this framework forms the basis of a unified approach to analyze asynchronous implementations of stochastic optimization algorithms.In this framework, asynchronous stochastic optimization algorithms can be thought of as serial methods operating on noisy inputs. Using our perturbed iterate framework, we provide new analyses of the Hogwild! algorithm and asynchronous stochastic coordinate descent, that are simpler than earlier analyses, remove many assumptions of previous models, and in some cases yield improved upper bounds on the convergence rates. We proceed to apply our framework to develop and analyze KroMagnon: a novel, parallel, sparse stochastic variance-reduced gradient (SVRG) algorithm. We demonstrate experimentally on a 16-core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.
Best-of-K Bandits
Simchowitz, Max, Jamieson, Kevin, Recht, Benjamin
This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds based on a particular construction which force a learner to consider all N-choose-K subsets, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds.
Gradient Descent Converges to Minimizers
Lee, Jason D., Simchowitz, Max, Jordan, Michael I., Recht, Benjamin
Large Scale Kernel Learning using Block Coordinate Descent
Tu, Stephen, Roelofs, Rebecca, Venkataraman, Shivaram, Recht, Benjamin
We demonstrate that distributed block coordinate descent can quickly solve kernel regression and classification problems with millions of data points. Armed with this capability, we conduct a thorough comparison between the full kernel, the Nystr\"om method, and random features on three large classification tasks from various domains. Our results suggest that the Nystr\"om method generally achieves better statistical accuracy than random features, but can require significantly more iterations of optimization. Lastly, we derive new rates for block coordinate descent which support our experimental findings when specialized to kernel methods.