mf-marl
Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL
Mean-Field Multi-Agent Reinforcement Learning (MF-MARL) is attractive in the applications involving a large population of homogeneous agents, as it exploits the permutation invariance of agents and avoids the curse of many agents. Most existing results only focus on online settings, in which agents can interact with the environment during training. In some applications such as social welfare optimization, however, the interaction during training can be prohibitive or even unethical in the societal systems. To bridge such a gap, we propose a SAFARI (peSsimistic meAn-Field vAlue iteRatIon) algorithm for off-line MF-MARL, which only requires a handful of pre-collected experience data. Theoretically, under a weak coverage assumption that the experience dataset contains enough information about the optimal policy, we prove that for an episodic mean-field MDP with a horizon $H$ and $N$ training trajectories, SAFARI attains a sub-optimality gap of $\mathcal{O}(H^2d_{\rm eff} /\sqrt{N})$, where $d_{\rm eff}$ is the effective dimension of the function class for parameterizing the value function, but independent on the number of agents.
Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL
Mean-Field Multi-Agent Reinforcement Learning (MF-MARL) is attractive in the applications involving a large population of homogeneous agents, as it exploits the permutation invariance of agents and avoids the curse of many agents. Most existing results only focus on online settings, in which agents can interact with the environment during training. In some applications such as social welfare optimization, however, the interaction during training can be prohibitive or even unethical in the societal systems. To bridge such a gap, we propose a SAFARI (peSsimistic meAn-Field vAlue iteRatIon) algorithm for off-line MF-MARL, which only requires a handful of pre-collected experience data. Theoretically, under a weak coverage assumption that the experience dataset contains enough information about the optimal policy, we prove that for an episodic mean-field MDP with a horizon H and N training trajectories, SAFARI attains a sub-optimality gap of \mathcal{O}(H 2d_{\rm eff} /\sqrt{N}), where d_{\rm eff} is the effective dimension of the function class for parameterizing the value function, but independent on the number of agents.
Distributed Power Control for Large Energy Harvesting Networks: A Multi-Agent Deep Reinforcement Learning Approach
Sharma, Mohit K., Zappone, Alessio, Assaad, Mohamad, Debbah, Merouane, Vassilaras, Spyridon
In this paper, we develop a multi-agent reinforcement learning (MARL) framework to obtain online power control policies for a large energy harvesting (EH) multiple access channel, when only the causal information about the EH process and wireless channel is available. In the proposed framework, we model the online power control problem as a discrete-time mean-field game (MFG), and leverage the deep reinforcement learning technique to learn the stationary solution of the game in a distributed fashion. We analytically show that the proposed procedure converges to the unique stationary solution of the MFG. Using the proposed framework, the power control policies are learned in a completely distributed fashion. In order to benchmark the performance of the distributed policies, we also develop a deep neural network (DNN) based centralized as well as distributed online power control schemes. Our simulation results show the efficacy of the proposed power control policies. In particular, the DNN based centralized power control policies provide a very good performance for large EH networks for which the design of optimal policies is intractable using the conventional methods such as Markov decision processes. Further, performance of both the distributed policies is close to the throughput achieved by the centralized policies. The work in this paper will appear in part at IEEE ICASSP 2019 [1] and IEEE WiOpt 2019 [2]. This research has been partly supported by the ERC-PoC 727682 CacheMire project. I. INTRODUCTION Internet-of-things (IoT) [3] networks connect a large number of low power sensors whose lifespan is typically limited by the energy that can be stored in their batteries. In this context, the advent of the energy harvesting (EH) technology [4] promises to prolong the lifespan of IoT networks by enabling the nodes to operate by harvesting energy from environmental sources, e.g., the sun, the wind, etc.