Not enough data to create a plot.
Try a different view from the menu above.
Schliserman, Matan
Better Rates for Random Task Orderings in Continual Linear Models
Evron, Itay, Levinstein, Ran, Schliserman, Matan, Sherman, Uri, Koren, Tomer, Soudry, Daniel, Srebro, Nathan
We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. For linear models, we prove that fitting a task is equivalent to a single stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup, and apply them to derive new results for continual learning. Focusing on random orderings over $T$ tasks, we establish universal forgetting rates, whereas existing rates depend on the problem dimensionality or complexity. Specifically, in continual regression with replacement, we improve the best existing rate from $O((d-r)/k)$ to $O(\min(k^{-1/4}, \sqrt{d-r}/k, \sqrt{Tr}/k))$, where $d$ is the dimensionality and $r$ the average task rank. Furthermore, we establish the first rates for random task orderings without replacement. The obtained rate of $O(\min(T^{-1/4}, (d-r)/T))$ proves for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task sequences. Finally, we prove a similar $O(k^{-1/4})$ universal rate for the forgetting in continual linear classification on separable data. Our universal rates apply for broader projection methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d and one-pass orderings.
Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization
Schliserman, Matan, Koren, Tomer
We study the problem of learning vector-valued linear predictors: these are prediction rules parameterized by a matrix that maps an $m$-dimensional feature vector to a $k$-dimensional target. We focus on the fundamental case with a convex and Lipschitz loss function, and show several new theoretical results that shed light on the complexity of this problem and its connection to related learning models. First, we give a tight characterization of the sample complexity of Empirical Risk Minimization (ERM) in this setting, establishing that $\smash{\widetilde{\Omega}}(k/\epsilon^2)$ examples are necessary for ERM to reach $\epsilon$ excess (population) risk; this provides for an exponential improvement over recent results by Magen and Shamir (2023) in terms of the dependence on the target dimension $k$, and matches a classical upper bound due to Maurer (2016). Second, we present a black-box conversion from general $d$-dimensional Stochastic Convex Optimization (SCO) to vector-valued linear prediction, showing that any SCO problem can be embedded as a prediction problem with $k=\Theta(d)$ outputs. These results portray the setting of vector-valued linear prediction as bridging between two extensively studied yet disparate learning models: linear models (corresponds to $k=1$) and general $d$-dimensional SCO (with $k=\Theta(d)$).
The Dimension Strikes Back with Gradients: Generalization of Gradient Methods in Stochastic Convex Optimization
Schliserman, Matan, Sherman, Uri, Koren, Tomer
The study of generalization properties of stochastic optimization algorithms has been at the heart of contemporary machine learning research. While in the more classical frameworks studies largely focused on the learning problem (e.g., Alon et al., 1997; Blumer et al., 1989), in the past decade it has become clear that in modern scenarios the particular algorithm used to learn the model plays a vital role in its generalization performance. As a prominent example, heavily over-parameterized deep neural networks trained by first order methods output models that generalize well, despite the fact that an arbitrarily chosen Empirical Risk Minimizer (ERM) may perform poorly (Zhang et al., 2017; Neyshabur et al., 2014, 2017). The present paper aims at understanding the generalization behavior of gradient methods, specifically in connection with the problem dimension, in the fundamental Stochastic Convex Optimization (SCO) learning setup; a well studied, theoretical framework widely used to study stochastic optimization algorithms. The seminal work of Shalev-Shwartz et al. (2010) was the first to show that uniform convergence, the canonical condition for generalization in statistical learning (e.g., Vapnik, 1971; Bartlett and Mendelson, 2002) may not hold in high-dimensional SCO: they demonstrated learning problems where there exist certain ERMs that overfit the training data (i.e., exhibit large population risk), while models produced by e.g., Stochastic Gradient Descent (SGD) or regularized empirical risk minimization generalize well. The construction presented by Shalev-Shwartz et al. (2010), however, featured a learning problem with dimension exponential in the number of training
Tight Risk Bounds for Gradient Descent on Separable Data
Schliserman, Matan, Koren, Tomer
We study the generalization properties of unregularized gradient methods applied to separable linear classification -- a setting that has received considerable attention since the pioneering work of Soudry et al. (2018). We establish tight upper and lower (population) risk bounds for gradient descent in this setting, for any smooth loss function, expressed in terms of its tail decay rate. Our bounds take the form $\Theta(r_{\ell,T}^2 / \gamma^2 T + r_{\ell,T}^2 / \gamma^2 n)$, where $T$ is the number of gradient steps, $n$ is size of the training set, $\gamma$ is the data margin, and $r_{\ell,T}$ is a complexity term that depends on the (tail decay rate) of the loss function (and on $T$). Our upper bound matches the best known upper bounds due to Shamir (2021); Schliserman and Koren (2022), while extending their applicability to virtually any smooth loss function and relaxing technical assumptions they impose. Our risk lower bounds are the first in this context and establish the tightness of our upper bounds for any given tail decay rate and in all parameter regimes. The proof technique used to show these results is also markedly simpler compared to previous work, and is straightforward to extend to other gradient methods; we illustrate this by providing analogous results for Stochastic Gradient Descent.