Goto

Collaborating Authors

 Carmon, Yair


Making SGD Parameter-Free

arXiv.org Artificial Intelligence

Stochastic convex optimization (SCO) is a cornerstone of both the theory and practice of machine learning. Consequently, there is intense interest in developing SCO algorithms that require little to no prior knowledge of the problem parameters, and hence little to no tuning [27, 23, 20, 2, 22, 39]. In this work we consider the fundamental problem of non-smooth SCO (in a potentially unbounded domain) and seek methods that are adaptive to a key problem parameter: the initial distance to optimality. Current approaches for tackling this problem focus on the more general online learning problem of parameter-free regret minimization [8, 10, 11, 12, 21, 24, 25, 30, 32, 37], where the goal is to to obtain regret guarantees that are valid for comparators with arbitrary norms. Research on parameter-free regret minimization has lead to practical algorithms for stochastic optimization [9, 27, 32], methods that are able to adapt to many problem parameters simultaneously [37] and methods that can work with any norm [12].


Malign Overfitting: Interpolation Can Provably Preclude Invariance

arXiv.org Artificial Intelligence

Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of ``benign overfitting," in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that -- even in the simplest of settings -- any interpolating learning rule (with arbitrarily small margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that -- in the same setting -- successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations on simulated data and the Waterbirds dataset.


Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization

arXiv.org Machine Learning

For machine learning systems to be reliable, we must understand their performance in unseen, out-of-distribution environments. In this paper, we empirically show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts. Specifically, we demonstrate strong correlations between in-distribution and out-of-distribution performance on variants of CIFAR-10 & ImageNet, a synthetic pose estimation task derived from YCB objects, satellite imagery classification in FMoW-WILDS, and wildlife classification in iWildCam-WILDS. The strong correlations hold across model architectures, hyperparameters, training set size, and training duration, and are more precise than what is expected from existing domain adaptation theory. To complete the picture, we also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS. Finally, we provide a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.


Large-Scale Methods for Distributionally Robust Optimization

arXiv.org Machine Learning

We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $\chi^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For $\chi^2$ uncertainty sets these are the first such guarantees in the literature, and for CVaR our guarantees scale linearly in the uncertainty level rather than quadratically as in previous work. We also provide lower bounds proving the worst-case optimality of our algorithms for CVaR and a penalized version of the $\chi^2$ problem. Our primary technical contributions are novel bounds on the bias of batch robust risk estimation and the variance of a multilevel Monte Carlo gradient estimator due to [Blanchet & Glynn, 2015]. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.


Second-Order Information in Non-Convex Stochastic Optimization: Power and Limitations

arXiv.org Machine Learning

We design an algorithm which finds an $\epsilon$-approximate stationary point (with $\|\nabla F(x)\|\le \epsilon$) using $O(\epsilon^{-3})$ stochastic gradient and Hessian-vector products, matching guarantees that were previously available only under a stronger assumption of access to multiple queries with the same random seed. We prove a lower bound which establishes that this rate is optimal and---surprisingly---that it cannot be improved using stochastic $p$th order methods for any $p\ge 2$, even when the first $p$ derivatives of the objective are Lipschitz. Together, these results characterize the complexity of non-convex stochastic optimization with second-order methods and beyond. Expanding our scope to the oracle complexity of finding $(\epsilon,\gamma)$-approximate second-order stationary points, we establish nearly matching upper and lower bounds for stochastic second-order methods. Our lower bounds here are novel even in the noiseless case.


Unlabeled Data Improves Adversarial Robustness

arXiv.org Machine Learning

The past few years have seen an intense research interest in making models robust to adversarial examples [37]. Yet despite a wide range of proposed defenses, the state-of-the-art in adversarial robustness is far from satisfactory. Recent work points towards sample complexity as a possible reason for the small gains in robustness: Schmidt et al. [35] show that in a simple model, learning a classifier with nontrivial adversarially robust accuracy requires substantially more samples than achieving good "standard" accuracy. Furthermore, recent empirical work obtains promising gains in robustness via transfer learning of a robust classifier from a larger labeled dataset [15]. While both theory and experiments suggest that more training data leads to greater robustness, following this suggestion can be difficult due to the cost of gathering additional data and especially obtaining high-quality labels.


A Rank-1 Sketch for Matrix Multiplicative Weights

arXiv.org Machine Learning

We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys the same regret bounds as MMW, up to a small constant factor. Unlike MMW, where every step requires full matrix exponentiation, our steps require only a single product of the form $e^A b$, which the Lanczos method approximates efficiently. Our key technique is to view the sketch as a randomized mirror projection, and perform mirror descent analysis on the expected projection. Our sketch solves the online eigenvector problem, improving the best known complexity bounds. We also apply this sketch to a simple no-regret scheme for semidefinite programming in saddle-point form, where it matches the best known guarantees.


Analysis of Krylov Subspace Solutions of Regularized Non-Convex Quadratic Problems

Neural Information Processing Systems

We provide convergence rates for Krylov subspace solutions to the trust-region and cubic-regularized (nonconvex) quadratic problems. Such solutions may be efficiently computed by the Lanczos method and have long been used in practice. We prove error bounds of the form $1/t^2$ and $e^{-4t/\sqrt{\kappa}}$, where $\kappa$ is a condition number for the problem, and $t$ is the Krylov subspace order (number of Lanczos iterations). We also provide lower bounds showing that our analysis is sharp.


Analysis of Krylov Subspace Solutions of Regularized Non-Convex Quadratic Problems

Neural Information Processing Systems

We provide convergence rates for Krylov subspace solutions to the trust-region and cubic-regularized (nonconvex) quadratic problems. Such solutions may be efficiently computed by the Lanczos method and have long been used in practice. We prove error bounds of the form $1/t^2$ and $e^{-4t/\sqrt{\kappa}}$, where $\kappa$ is a condition number for the problem, and $t$ is the Krylov subspace order (number of Lanczos iterations). We also provide lower bounds showing that our analysis is sharp.


No bad local minima: Data independent training error guarantees for multilayer neural networks

arXiv.org Machine Learning

We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.