Social dilemma situations bring out the conflict between individual and group rationality. When individuals act rationally in such situations, the group suffers sub-optimal outcomes. The Iterative Prisoner's Dilemma (IPD) is a two-player game that offers a theoretical framework to model and study such social situations. In the Prisoner's Dilemma, individualistic behavior leads to mutual defection and sub-optimal outcomes. This result is in contrast to what one observes in human groups, where humans often sacrifice individualistic behavior for the good of the collective. It is interesting to study how and why such cooperative and individually irrational behavior emerges in human groups. To this end, recent work models this problem by treating each player as a Deep Reinforcement Learning (RL) agent and evolves cooperative behavioral policies through internal information or reward sharing mechanisms. We propose an approach to evolve cooperative behavior between RL agents playing the IPD game without sharing rewards, internal details (weights, gradients), or a communication channel. We introduce a Status-Quo loss (SQLoss) that incentivizes cooperative behavior by encouraging policy stationarity. We also describe an approach to transform a two-player game (with visual inputs) into its IPD formulation through self-supervised skill discovery (IPDistill).We show how our approach outperforms existing approaches in the Iterative Prisoner's Dilemma and the two-player Coin game.
Recent developments in deep reinforcement learning are concerned with creating decision-making agents which can perform well in various complex domains. A particular approach which has received increasing attention is multi-agent reinforcement learning, in which multiple agents learn concurrently to coordinate their actions. In such multi-agent environments, additional learning problems arise due to the continually changing decision-making policies of agents. This paper surveys recent works that address the non-stationarity problem in multi-agent deep reinforcement learning. The surveyed methods range from modifications in the training procedure, such as centralized training, to learning representations of the opponent's policy, meta-learning, communication, and decentralized learning. The survey concludes with a list of open problems and possible lines of future research.
Multi-agent reinforcement learning has received significant interest in recent years notably due to the advancements made in deep reinforcement learning which have allowed for the developments of new architectures and learning algorithms. Using social dilemmas as the training ground, we present a novel learning architecture, Learning through Probing (LTP), where agents utilize a probing mechanism to incorporate how their opponent's behavior changes when an agent takes an action. We use distinct training phases and adjust rewards according to the overall outcome of the experiences accounting for changes to the opponents behavior. We introduce a parameter η to determine the significance of these future changes to opponent behavior. When applied to the Iterated Prisoner's Dilemma, LTP agents demonstrate that they can learn to cooperate with each other, achieving higher average cumulative rewards than other reinforcement learning methods while also maintaining good performance in playing against static agents that are present in Axelrod tournaments. We compare this method with traditional reinforcement learning algorithms and agent-tracking techniques to highlight key differences and potential applications. We also draw attention to the differences between solving games and societal-like interactions and analyze the training of Q-learning agents in makeshift societies. This is to emphasize how cooperation may emerge in societies and demonstrate this using environments where interactions with opponents are determined through a random encounter format of the iterated prisoner's dilemma.
In the future, artificial learning agents are likely to become increasingly widespread in our society. They will interact with both other learning agents and humans in a variety of complex settings including social dilemmas. We consider the problem of how an external agent can promote cooperation between artificial learners by distributing additional rewards and punishments based on observing the learners' actions. We propose a rule for automatically learning how to create right incentives by considering the players' anticipated parameter updates. Using this learning rule leads to cooperation with high social welfare in matrix games in which the agents would otherwise learn to defect with high probability. We show that the resulting cooperative outcome is stable in certain games even if the planning agent is turned off after a given number of episodes, while other games require ongoing intervention to maintain mutual cooperation. However, even in the latter case, the amount of necessary additional incentives decreases over time.
Deep reinforcement learning (DRL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. In this context, first, this article provides a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Second, it provides guidelines to complement this emerging area by (i) showcasing examples on how methods and algorithms from DRL and multiagent learning (MAL) have helped solve problems in MDRL and (ii) providing general lessons learned from these works. We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists in both areas (DRL and MAL) in a joint effort to promote fruitful research in the multiagent community.