Agents
A Other related works
Let us discuss in passing additional prior works on learning equilibrium solutions in MARL, which have attracted an explosion of interest in recent years. Roughly speaking, previous NE-finding algorithms for two-player zero-sum Markov games can be categorized into model-based algorithms [52, 79, 43], value-based algorithms [4, 5, 73, 54, 31, 15], and policy-based algorithms [10, 22, 71, 82, 14, 81, 11]. In particular, Bai et al. [5], Jin et al. [31] developed the first algorithms to beat the curse of multiple agents in two-player zero-sum MGs, while Jin et al. [31], Daskalakis et al. [23], Mao and Ba sar [44], Song et al. [63] further demonstrated how to accomplish the same goal when learning other computationally tractable solution concepts (e.g., coarse correlated equilibria) in general-sum multi-player Markov games. The recent works Cui and Du [17, 18], Y an et al. [74] studied how to alleviate the sample size scaling with the number of agents in the presence of offline data, with Cui and Du [18] providing a sample-efficient algorithm that also learns NEs in multi-agent Markov games (despite computational intractability). We shall also briefly remark on the prior works that concern RL with a generative model.
Minimax-Optimal Multi-Agent RL in Markov Games With a Generative Model Gen Li UPenn Y uejie Chi CMU Y uting Wei UPenn Y uxin Chen UPenn
All prior results suffer from at least one of the two obstacles: the curse of multiple agents and the barrier of long horizon, regardless of the sampling protocol in use. We take a step towards settling this problem, assuming access to a flexible sampling mechanism: the generative model. Focusing on non-stationary finite-horizon Markov games, we develop a fast learning algorithm called Q-FTRL and an adaptive sampling scheme that leverage the optimism principle in online adversarial learning (particularly the Follow-the-Regularized-Leader (FTRL) method).