Lee, Chung-Wei
Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms
Cai, Yang, Farina, Gabriele, Grand-Clément, Julien, Kroer, Christian, Lee, Chung-Wei, Luo, Haipeng, Zheng, Weiqiang
Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games, both in theory and practice. Particularly popular algorithms include optimistic multiplicative weights update (OMWU) and optimistic gradient-descent-ascent (OGDA). While both algorithms enjoy $O(1/T)$ ergodic convergence to Nash equilibrium in two-player zero-sum games, OMWU offers several advantages including logarithmic dependence on the size of the payoff matrix and $\widetilde{O}(1/T)$ convergence to coarse correlated equilibria even in general-sum games. However, in terms of last-iterate convergence in two-player zero-sum games, an increasingly popular topic in this area, OGDA guarantees that the duality gap shrinks at a rate of $O(1/\sqrt{T})$, while the best existing last-iterate convergence for OMWU depends on some game-dependent constant that could be arbitrarily large. This begs the question: is this potentially slow last-iterate convergence an inherent disadvantage of OMWU, or is the current analysis too loose? Somewhat surprisingly, we show that the former is true. More generally, we prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue: for any arbitrarily small $\delta>0$, there exists a $2\times 2$ matrix game such that the algorithm admits a constant duality gap even after $1/\delta$ rounds. This class of algorithms includes OMWU and other standard optimistic follow-the-regularized-leader algorithms.
Context-lumpable stochastic bandits
Lee, Chung-Wei, Liu, Qinghua, Abbasi-Yadkori, Yasin, Jin, Chi, Lattimore, Tor, Szepesvári, Csaba
We consider a contextual bandit problem with $S$ contexts and $K$ actions. In each round $t=1,2,\dots$, the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into $r\le \min\{S,K\}$ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an $\epsilon$-optimal policy after using at most $\widetilde O(r (S +K )/\epsilon^2)$ samples with high probability and provide a matching $\Omega(r(S+K)/\epsilon^2)$ lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $\widetilde O(\sqrt{r^3(S+K)T})$. To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and $\widetilde O(\sqrt{{poly}(r)(S+K)T})$ minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.
Regret Matching+: (In)Stability and Fast Convergence in Games
Farina, Gabriele, Grand-Clément, Julien, Kroer, Christian, Lee, Chung-Wei, Luo, Haipeng
Regret Matching+ (RM+) and its variants are important algorithms for solving large-scale games. However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability. In this paper, we first give counterexamples showing that RM+ and its predictive version can be unstable, which might cause other players to suffer large regret. We then provide two fixes: restarting and chopping off the positive orthant that RM+ works in. We show that these fixes are sufficient to get $O(T^{1/4})$ individual regret and $O(1)$ social regret in normal-form games via RM+ with predictions. We also apply our stabilizing techniques to clairvoyant updates in the uncoupled learning setting for RM+ and prove desirable results akin to recent works for Clairvoyant online mirror descent. Our experiments show the advantages of our algorithms over vanilla RM+-based algorithms in matrix and extensive-form games.
Practical Knowledge Distillation: Using DNNs to Beat DNNs
Lee, Chung-Wei, Apostolopulos, Pavlos Athanasios, Markov, Igor L.
For tabular data sets, we explore data and model distillation, as well as data denoising. These techniques improve both gradient-boosting models and a specialized DNN architecture. While gradient boosting is known to outperform DNNs on tabular data, we close the gap for datasets with 100K+ rows and give DNNs an advantage on small data sets. We extend these results with input-data distillation and optimized ensembling to help DNN performance match or exceed that of gradient boosting. As a theoretical justification of our practical method, we prove its equivalence to classical cross-entropy knowledge distillation. We also qualitatively explain the superiority of DNN ensembles over XGBoost on small data sets. For an industry end-to-end real-time ML platform with 4M production inferences per second, we develop a model-training workflow based on data sampling that distills ensembles of models into a single gradient-boosting model favored for high-performance real-time inference, without performance loss. Empirical evaluation shows that the proposed combination of methods consistently improves model accuracy over prior best models across several production applications deployed worldwide.
Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses
Luo, Haipeng, Wei, Chen-Yu, Lee, Chung-Wei
Policy optimization methods are among the most widely-used methods in reinforcement learning. Its empirical success has been demonstrated in various domains such as computer games [Schulman et al., 2017] and robotics [Levine and Koltun, 2013]. However, due to its local-search nature, global optimality guarantees of policy optimization often rely on unrealistic assumptions to ensure global exploration (see e.g., [Abbasi-Yadkori et al., 2019, Agarwal et al., 2020b, Neu and Olkhovskaya, 2020, Wei et al., 2021]), making it theoretically less appealing compared to other methods. Motivated by this issue, a line of recent works [Cai et al., 2020, Shani et al., 2020, Agarwal et al., 2020a, Zanette et al., 2021] equip policy optimization with global exploration by adding exploration bonuses to the update, and prove favorable guarantees even without making extra exploratory assumptions. Moreover, they all demonstrate some robustness aspect of policy optimization (such as being able to handle adversarial losses or a certain degree of model misspecification). Despite these important progresses, however, many limitations still exist, including worse regret rates comparing to the best value-based or model-based approaches [Shani et al., 2020, Agarwal et al., 2020a, Zanette et al., 2021], or requiring full-information feedback on the entire loss function (as opposed to the more realistic bandit feedback) [Cai et al., 2020]. To address these issues, in this work, we propose a new type of exploration bonuses called dilated bonuses, which satisfies a certain dilated Bellman equation and provably leads to improved exploration compared to existing works (Section 3). We apply this general idea to advance the state-of-the-art of policy optimization for learning finite-horizon episodic MDPs with adversarial losses and bandit feedback. More specifically, our main results are: - First, in the tabular setting, addressing the main open question left in [Shani et al., 2020], we improve their Õ(T
Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games
Wei, Chen-Yu, Lee, Chung-Wei, Zhang, Mengxiao, Luo, Haipeng
We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithm is based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn the policies, with a critic that slowly learns the value of each state. To the best of our knowledge, this is the first algorithm in this setting that is simultaneously rational (converging to the opponent's best response when it uses a stationary policy), convergent (converging to the set of Nash equilibria under self-play), agnostic (no need to know the actions played by the opponent), symmetric (players taking symmetric roles in the algorithm), and enjoying a finite-time last-iterate convergence guarantee, all of which are desirable properties of decentralized algorithms.
Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs
Lee, Chung-Wei, Luo, Haipeng, Wei, Chen-Yu, Zhang, Mengxiao
We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary. While existing approaches all require carefully constructing optimistic and biased loss estimators, our approach uses standard unbiased estimators and relies on a simple increasing learning rate schedule, together with the help of logarithmically homogeneous self-concordant barriers and a strengthened Freedman's inequality. Besides its simplicity, our approach enjoys several advantages. First, the obtained high-probability regret bounds are data-dependent and could be much smaller than the worst-case bounds, which resolves an open problem asked by Neu (2015). Second, resolving another open problem of Bartlett et al. (2008) and Abernethy and Rakhlin (2009), our approach leads to the first general and efficient algorithm with a high-probability regret bound for adversarial linear bandits, while previous methods are either inefficient or only applicable to specific action sets. Finally, our approach can also be applied to learning adversarial Markov Decision Processes and provides the first algorithm with a high-probability small-loss bound for this problem.
Linear Last-iterate Convergence in Constrained Saddle-point Optimization
Wei, Chen-Yu, Lee, Chung-Wei, Zhang, Mengxiao, Luo, Haipeng
Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU) for saddle-point optimization have received growing attention due to their favorable last-iterate convergence. However, their behaviors for simple bilinear games over the probability simplex are still not fully understood -- previous analysis lacks explicit convergence rates, only applies to an exponentially small learning rate, or requires additional assumptions such as the uniqueness of the optimal solution. In this work, we significantly expand the understanding of last-iterate convergence for OGDA and OMWU in the constrained setting. Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achievable with a constant learning rate, which improves the result of (Daskalakis & Panageas, 2019) under the same assumption. We then significantly extend the results to more general objectives and feasible sets for the projected OGDA algorithm, by introducing a sufficient condition under which OGDA exhibits concrete last-iterate convergence rates with a constant learning rate. We show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption. Our condition also holds for strongly-convex-strongly-concave functions, recovering the result of (Hsieh et al., 2019). Finally, we provide experimental results to further support our theory.
A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free
Chen, Yifang, Lee, Chung-Wei, Luo, Haipeng, Wei, Chen-Yu
We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves dynamic regret $\mathcal{O}(\min\{\sqrt{ST}, \Delta^{\frac{1}{3}}T^{\frac{2}{3}}\})$ for a contextual bandit problem with $T$ rounds, $S$ switches and $\Delta$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $\Delta$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O}(\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, \Delta^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al, 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce replay phases, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.