Plotting

 Sherman, Uri


Better Rates for Random Task Orderings in Continual Linear Models

arXiv.org Machine Learning

We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. For linear models, we prove that fitting a task is equivalent to a single stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup, and apply them to derive new results for continual learning. Focusing on random orderings over $T$ tasks, we establish universal forgetting rates, whereas existing rates depend on the problem dimensionality or complexity. Specifically, in continual regression with replacement, we improve the best existing rate from $O((d-r)/k)$ to $O(\min(k^{-1/4}, \sqrt{d-r}/k, \sqrt{Tr}/k))$, where $d$ is the dimensionality and $r$ the average task rank. Furthermore, we establish the first rates for random task orderings without replacement. The obtained rate of $O(\min(T^{-1/4}, (d-r)/T))$ proves for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task sequences. Finally, we prove a similar $O(k^{-1/4})$ universal rate for the forgetting in continual linear classification on separable data. Our universal rates apply for broader projection methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d and one-pass orderings.


Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

arXiv.org Machine Learning

Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can be applied effectively only when the class of policies being optimized over satisfies strong closure conditions, which is typically not the case when working with parametric policy classes in large-scale environments. In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a strictly weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.


Rate-Optimal Policy Optimization for Linear Markov Decision Processes

arXiv.org Artificial Intelligence

Policy Optimization (PO) algorithms are a class of methods in Reinforcement Learning(RL; Sutton and Barto, 2018; Mannor et al., 2022) where the agent's policy is iteratively updated according to the (possibly preconditioned) gradient of the value function w.r.t.


The Dimension Strikes Back with Gradients: Generalization of Gradient Methods in Stochastic Convex Optimization

arXiv.org Artificial Intelligence

The study of generalization properties of stochastic optimization algorithms has been at the heart of contemporary machine learning research. While in the more classical frameworks studies largely focused on the learning problem (e.g., Alon et al., 1997; Blumer et al., 1989), in the past decade it has become clear that in modern scenarios the particular algorithm used to learn the model plays a vital role in its generalization performance. As a prominent example, heavily over-parameterized deep neural networks trained by first order methods output models that generalize well, despite the fact that an arbitrarily chosen Empirical Risk Minimizer (ERM) may perform poorly (Zhang et al., 2017; Neyshabur et al., 2014, 2017). The present paper aims at understanding the generalization behavior of gradient methods, specifically in connection with the problem dimension, in the fundamental Stochastic Convex Optimization (SCO) learning setup; a well studied, theoretical framework widely used to study stochastic optimization algorithms. The seminal work of Shalev-Shwartz et al. (2010) was the first to show that uniform convergence, the canonical condition for generalization in statistical learning (e.g., Vapnik, 1971; Bartlett and Mendelson, 2002) may not hold in high-dimensional SCO: they demonstrated learning problems where there exist certain ERMs that overfit the training data (i.e., exhibit large population risk), while models produced by e.g., Stochastic Gradient Descent (SGD) or regularized empirical risk minimization generalize well. The construction presented by Shalev-Shwartz et al. (2010), however, featured a learning problem with dimension exponential in the number of training


Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

arXiv.org Artificial Intelligence

Reinforcement Learning (RL; Sutton and Barto, 2018; Mannor et al., 2022) studies online decision making problems in which an agent learns through experience within a dynamic environment, with the goal to minimize a loss function associated with the agent-environment interaction. Modern applications of RL such as robotics(Schulman et al., 2015; Lillicrap et al., 2015; Akkaya et al., 2019), game playing (Mnih et al., 2013; Silver et al., 2018) and autonomous driving (Kiran et al., 2021), almost invariably consist of large scale environments where function approximation techniques are necessary to allow the agent to generalize across different states. Furthermore, some form of agent robustness is usually required to cope with environment irregularities that cannot be faithfully represented by stochasticity assumptions (see e.g., Dulac-Arnold et al., 2021). Theoretical foundations for RL with function approximation (e.g., Jiang et al., 2017; Yang and Wang, 2019; Jin et al., 2020b; Agarwal et al., 2020) have been steadily coming into fruition.


Benign Underfitting of Stochastic Gradient Descent

arXiv.org Artificial Intelligence

We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.


Regret Minimization and Convergence to Equilibria in General-sum Markov Games

arXiv.org Artificial Intelligence

An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for swap regret, and thus, along the way, imply convergence to a correlated equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of weighted regret minimization, with unknown weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees.


Optimal Rates for Random Order Online Optimization

arXiv.org Machine Learning

We study online convex optimization in the random order model, recently proposed by \citet{garber2020online}, where the loss functions may be chosen by an adversary, but are then presented to the online algorithm in a uniformly random order. Focusing on the scenario where the cumulative loss function is (strongly) convex, yet individual loss functions are smooth but might be non-convex, we give algorithms that achieve the optimal bounds and significantly outperform the results of \citet{garber2020online}, completely removing the dimension dependence and improving their scaling with respect to the strong convexity parameter. Our analysis relies on novel connections between algorithmic stability and generalization for sampling without-replacement analogous to those studied in the with-replacement i.i.d.~setting, as well as on a refined average stability analysis of stochastic gradient descent.


Lazy OCO: Online Convex Optimization on a Switching Budget

arXiv.org Machine Learning

We study a variant of online convex optimization where the player is permitted to switch decisions at most $S$ times in expectation throughout $T$ rounds. Similar problems have been addressed in prior work for the discrete decision set setting, and more recently in the continuous setting but only with an adaptive adversary. In this work, we aim to fill the gap and present computationally efficient algorithms in the more prevalent oblivious setting, establishing a regret bound of $O(T/S)$ for general convex losses and $\widetilde O(T/S^2)$ for strongly convex losses. In addition, for stochastic i.i.d.~losses, we present a simple algorithm that performs $\log T$ switches with only a multiplicative $\log T$ factor overhead in its regret in both the general and strongly convex settings. Finally, we complement our algorithms with lower bounds that match our upper bounds in some of the cases we consider.