Iwasaki, Atsushi
The Power of Perturbation under Sampling in Solving Extensive-Form Games
Masaka, Wataru, Sakamoto, Mitsuki, Abe, Kenshi, Ariu, Kaito, Sandholm, Tuomas, Iwasaki, Atsushi
This paper investigates how perturbation does and does not improve the Follow-the-Regularized-Leader (FTRL) algorithm in imperfect-information extensive-form games. Perturbing the expected payoffs guarantees that the FTRL dynamics reach an approximate equilibrium, and proper adjustments of the magnitude of the perturbation lead to a Nash equilibrium (\textit{last-iterate convergence}). This approach is robust even when payoffs are estimated using sampling -- as is the case for large games -- while the optimistic approach often becomes unstable. Building upon those insights, we first develop a general framework for perturbed FTRL algorithms under \textit{sampling}. We then empirically show that in the last-iterate sense, the perturbed FTRL consistently outperforms the non-perturbed FTRL. We further identify a divergence function that reduces the variance of the estimates for perturbed payoffs, with which it significantly outperforms the prior algorithms on Leduc poker (whose structure is more asymmetric in a sense than that of the other benchmark games) and consistently performs smooth convergence behavior on all the benchmark games.
Approximate State Abstraction for Markov Games
Ishibashi, Hiroki, Abe, Kenshi, Iwasaki, Atsushi
This paper introduces state abstraction for two-player zero-sum Markov games (TZMGs), where the payoffs for the two players are determined by the state representing the environment and their respective actions, with state transitions following Markov decision processes. For example, in games like soccer, the value of actions changes according to the state of play, and thus such games should be described as Markov games. In TZMGs, as the number of states increases, computing equilibria becomes more difficult. Therefore, we consider state abstraction, which reduces the number of states by treating multiple different states as a single state. There is a substantial body of research on finding optimal policies for Markov decision processes using state abstraction. However, in the multi-player setting, the game with state abstraction may yield different equilibrium solutions from those of the ground game. To evaluate the equilibrium solutions of the game with state abstraction, we derived bounds on the duality gap, which represents the distance from the equilibrium solutions of the ground game. Finally, we demonstrate our state abstraction with Markov Soccer, compute equilibrium policies, and examine the results.
Learning Fair Division from Bandit Feedback
Yamada, Hakuei, Komiyama, Junpei, Abe, Kenshi, Iwasaki, Atsushi
This work addresses learning online fair division under uncertainty, where a central planner sequentially allocates items without precise knowledge of agents' values or utilities. Departing from conventional online algorithm, the planner here relies on noisy, estimated values obtained after allocating items. We introduce wrapper algorithms utilizing \textit{dual averaging}, enabling gradual learning of both the type distribution of arriving items and agents' values through bandit feedback. This approach enables the algorithms to asymptotically achieve optimal Nash social welfare in linear Fisher markets with agents having additive utilities. We establish regret bounds in Nash social welfare and empirically validate the superior performance of our proposed algorithms across synthetic and empirical datasets.
Slingshot Perturbation to Learning in Monotone Games
Abe, Kenshi, Ariu, Kaito, Sakamoto, Mitsuki, Iwasaki, Atsushi
This paper addresses the problem of learning Nash equilibria in {\it monotone games} where the gradient of the payoff functions is monotone in the strategy profile space, potentially containing additive noise. The optimistic family of learning algorithms, exemplified by optimistic Follow-the-Regularized-Leader and optimistic Mirror Descent, successfully achieves last-iterate convergence in scenarios devoid of noise, leading the dynamics to a Nash equilibrium. A recent emerging trend underscores the promise of the perturbation approach, where payoff functions are perturbed based on the distance from an anchoring, or {\it slingshot}, strategy. In response, we first establish a unified framework for learning equilibria in monotone games, accommodating both full and noisy feedback. Second, we construct the convergence rates toward an approximated equilibrium, irrespective of noise presence. Thirdly, we introduce a twist by updating the slingshot strategy, anchoring the current strategy at finite intervals. This innovation empowers us to identify the exact Nash equilibrium of the underlying game with guaranteed rates. The proposed framework is all-encompassing, integrating existing payoff-perturbed algorithms. Finally, empirical demonstrations affirm that our algorithms, grounded in this framework, exhibit significantly accelerated convergence.
Last-Iterate Convergence with Full and Noisy Feedback in Two-Player Zero-Sum Games
Abe, Kenshi, Ariu, Kaito, Sakamoto, Mitsuki, Toyoshima, Kentaro, Iwasaki, Atsushi
This paper proposes Mutation-Driven Multiplicative Weights Update (M2WU) for learning an equilibrium in two-player zero-sum normal-form games and proves that it exhibits the last-iterate convergence property in both full and noisy feedback settings. In the former, players observe their exact gradient vectors of the utility functions. In the latter, they only observe the noisy gradient vectors. Even the celebrated Multiplicative Weights Update (MWU) and Optimistic MWU (OMWU) algorithms may not converge to a Nash equilibrium with noisy feedback. On the contrary, M2WU exhibits the last-iterate convergence to a stationary point near a Nash equilibrium in both feedback settings. We then prove that it converges to an exact Nash equilibrium by iteratively adapting the mutation term. We empirically confirm that M2WU outperforms MWU and OMWU in exploitability and convergence rates.
Mutation-Driven Follow the Regularized Leader for Last-Iterate Convergence in Zero-Sum Games
Abe, Kenshi, Sakamoto, Mitsuki, Iwasaki, Atsushi
In this study, we consider a variant of the Follow the Regularized Leader (FTRL) dynamics in two-player zero-sum games. FTRL is guaranteed to converge to a Nash equilibrium when time-averaging the strategies, while a lot of variants suffer from the issue of limit cycling behavior, i.e., lack the last-iterate convergence guarantee. To this end, we propose mutant FTRL (M-FTRL), an algorithm that introduces mutation for the perturbation of action probabilities. We then investigate the continuous-time dynamics of M-FTRL and provide the strong convergence guarantees toward stationary points that approximate Nash equilibria under full-information feedback. Furthermore, our simulation demonstrates that M-FTRL can enjoy faster convergence rates than FTRL and optimistic FTRL under full-information feedback and surprisingly exhibits clear convergence under bandit feedback.
Anytime Capacity Expansion in Medical Residency Match by Monte Carlo Tree Search
Abe, Kenshi, Komiyama, Junpei, Iwasaki, Atsushi
This paper considers the capacity expansion problem in two-sided matchings, where the policymaker is allowed to allocate some extra seats as well as the standard seats. In medical residency match, each hospital accepts a limited number of doctors. Such capacity constraints are typically given in advance. However, such exogenous constraints can compromise the welfare of the doctors; some popular hospitals inevitably dismiss some of their favorite doctors. Meanwhile, it is often the case that the hospitals are also benefited to accept a few extra doctors. To tackle the problem, we propose an anytime method that the upper confidence tree searches the space of capacity expansions, each of which has a resident-optimal stable assignment that the deferred acceptance method finds. Constructing a good search tree representation significantly boosts the performance of the proposed method. Our simulation shows that the proposed method identifies an almost optimal capacity expansion with a significantly smaller computational budget than exact methods based on mixed-integer programming.
Approximately Stable Matchings With Budget Constraints
Kawase, Yasushi (Tokyo Institute of Technology) | Iwasaki, Atsushi (RIKEN AIP Center)
This paper examines two-sided matching with budget constraints where one side (a firm or hospital) can make monetary transfers (offer wages) to the other (a worker or doctor). In a standard model, while multiple doctors can be matched to a single hospital, a hospital has a maximum quota; thus, the number of doctors assigned to a hospital cannot exceed a certain limit. In our model, in contrast, a hospital has a fixed budget; that is, the total amount of wages allocated by each hospital to doctors is constrained. With budget constraints, stable matchings may fail to exist and checking for the existence is hard. To deal with the nonexistence of stable matchings, we extend the "matching with contracts" model of Hatfield and Milgrom so that it deals with approximately stable matchings where each of the hospitals' utilities after deviation can increase by a factor up to a certain amount. We then propose two novel mechanisms that efficiently return a stable matching that exactly satisfies the budget constraints. Specifically, by sacrificing strategy-proofness, our first mechanism achieves the best possible bound. We also explore a special case on which a simple mechanism is strategy-proof for doctors, while maintaining the best possible bound of the general case.
Achieving Sustainable Cooperation in Generalized Prisoner's Dilemma with Observation Errors
Shigenaka, Fuuki (Kyushu University) | Sekiguchi, Tadashi (Kyoto Universtiy) | Iwasaki, Atsushi (University of Electro-Communications) | Yokoo, Makoto (Kyushu University)
A repeated game is a formal model for analyzing cooperation in long-term relationships, e.g., in the prisoner's dilemma. Although the case where each player observes her opponent's action with some observation errors (imperfect private monitoring) is difficult to analyze, a special type of an equilibrium called belief-free equilibrium is identified to make the analysis in private monitoring tractable. However, existing works using a belief-free equilibrium show that cooperative relations can be sustainable only in ideal situations. We deal with a generic problem that can model both the prisoner's dilemma and the team production problem. We examine a situation with an additional action that is dominated by another action. To our surprise, by adding this seemingly irrelevant action, players can achieve sustainable cooperative relations far beyond the ideal situations. More specifically, we identify a class of strategies called one-shot punishment strategy that can constitute a belief-free equilibrium in a wide range of parameters. Moreover, for a two-player case, the obtained welfare matches a theoretical upper bound.
How Is Cooperation/Collusion Sustained in Repeated Multimarket Contact with Observation Errors?
Iwasaki, Atsushi (University of Electro-Communications) | Sekiguchi, Tadashi (Kyoto University) | Yamamoto, Shun (Kyushu University) | Yokoo, Makoto (Kyushu University)
This paper analyzes repeated multimarket contact with observation errors where two players operate in multiple markets simultaneously. Multimarket contact has received much attention from the literature of economics,management, and information systems. Despite vast empirical studies that examine whether multimarket contact fosters cooperation/collusion, little is theoretically known as to how players behave in an equilibrium when each player receives a noisy observation of other firms’ actions. This paper tackles an essentially realistic situation where the players do not share common information; each player may observe a different signal (private monitoring). Thus, players have difficulty in having a common understanding about which market their opponent should be punished in and when punishment should be started and ended. We first theoretically show that an extension of 1-period mutual punishment (1MP) for an arbitrary number of markets can be an equilibrium. Second, by applying a verification method, we identify a simple equilibrium strategy called "locally cautioning (LC)" that restores collusion after observation error or deviation. We then numerically reveal that LC significantly outperforms 1MP and achieves the highest degree of collusion.