Goto

Collaborating Authors

 oftrl


Unified

Neural Information Processing Systems

Policy optimization, i.e. algorithms that learn to make sequential decisions by local search on the agent's policy directly, is a widely used class of algorithms in reinforcement learning [40, 44, 45].


Scale-Invariant Fast Convergence in Games

Tsuchiya, Taira, Luo, Haipeng, Ito, Shinji

arXiv.org Machine Learning

Scale-invariance in games has recently emerged as a widely valued desirable property. Yet, almost all fast convergence guarantees in learning in games require prior knowledge of the utility scale. To address this, we develop learning dynamics that achieve fast convergence while being both scale-free, requiring no prior information about utilities, and scale-invariant, remaining unchanged under positive rescaling of utilities. For two-player zero-sum games, we obtain scale-free and scale-invariant dynamics with external regret bounded by $\tilde{O}(A_{\mathrm{diff}})$, where $A_{\mathrm{diff}}$ is the payoff range, which implies an $\tilde{O}(A_{\mathrm{diff}} / T)$ convergence rate to Nash equilibrium after $T$ rounds. For multiplayer general-sum games with $n$ players and $m$ actions, we obtain scale-free and scale-invariant dynamics with swap regret bounded by $O(U_{\mathrm{max}} \log T)$, where $U_{\mathrm{max}}$ is the range of the utilities, ignoring the dependence on the number of players and actions. This yields an $O(U_{\mathrm{max}} \log T / T)$ convergence rate to correlated equilibrium. Our learning dynamics are based on optimistic follow-the-regularized-leader with an adaptive learning rate that incorporates the squared path length of the opponents' gradient vectors, together with a new stopping-time analysis that exploits negative terms in regret bounds without scale-dependent tuning. For general-sum games, scale-free learning is enabled also by a technique called doubling clipping, which clips observed gradients based on past observations.




15d45097f9806983f0629a77e93ee60f-Paper-Conference.pdf

Neural Information Processing Systems

Indeed, the no-regret framework addresses thefundamental question ofhowindependent anddecentralized agentscan"learn" with only limited feedback from their environment, and has led to celebrated connections with gametheoretic equilibrium concepts [Hart and Mas-Colell, 2000, Foster and Vohra, 1997].


Learning from Delayed Feedback in Games via Extra Prediction

Fujimoto, Yuma, Abe, Kenshi, Ariu, Kaito

arXiv.org Artificial Intelligence

This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of social regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that social regret is constant in general-sum normal-form games, and the strategies last-iterate converge to the Nash equilibrium in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.



A Preliminaries on Self Concordant Barriers

Neural Information Processing Systems

In this section we provide the necessary background on self-concordant barriers. Let f be a self-concordant function. Let ω ( s) = s log(1 + s). Then, the optimization problem (4) has a unique solution. A.3 Self-Concordant Barriers Next, we introduce the concept of a self-concordant barrier .