The Dimension Strikes Back with Gradients: Generalization of Gradient Methods in Stochastic Convex Optimization

Schliserman, Matan, Sherman, Uri, Koren, Tomer

arXiv.org Artificial Intelligence 

The study of generalization properties of stochastic optimization algorithms has been at the heart of contemporary machine learning research. While in the more classical frameworks studies largely focused on the learning problem (e.g., Alon et al., 1997; Blumer et al., 1989), in the past decade it has become clear that in modern scenarios the particular algorithm used to learn the model plays a vital role in its generalization performance. As a prominent example, heavily over-parameterized deep neural networks trained by first order methods output models that generalize well, despite the fact that an arbitrarily chosen Empirical Risk Minimizer (ERM) may perform poorly (Zhang et al., 2017; Neyshabur et al., 2014, 2017). The present paper aims at understanding the generalization behavior of gradient methods, specifically in connection with the problem dimension, in the fundamental Stochastic Convex Optimization (SCO) learning setup; a well studied, theoretical framework widely used to study stochastic optimization algorithms. The seminal work of Shalev-Shwartz et al. (2010) was the first to show that uniform convergence, the canonical condition for generalization in statistical learning (e.g., Vapnik, 1971; Bartlett and Mendelson, 2002) may not hold in high-dimensional SCO: they demonstrated learning problems where there exist certain ERMs that overfit the training data (i.e., exhibit large population risk), while models produced by e.g., Stochastic Gradient Descent (SGD) or regularized empirical risk minimization generalize well. The construction presented by Shalev-Shwartz et al. (2010), however, featured a learning problem with dimension exponential in the number of training