Multi-agent reinforcement learning has received significant interest in recent years notably due to the advancements made in deep reinforcement learning which have allowed for the developments of new architectures and learning algorithms. Using social dilemmas as the training ground, we present a novel learning architecture, Learning through Probing (LTP), where agents utilize a probing mechanism to incorporate how their opponent's behavior changes when an agent takes an action. We use distinct training phases and adjust rewards according to the overall outcome of the experiences accounting for changes to the opponents behavior. We introduce a parameter η to determine the significance of these future changes to opponent behavior. When applied to the Iterated Prisoner's Dilemma, LTP agents demonstrate that they can learn to cooperate with each other, achieving higher average cumulative rewards than other reinforcement learning methods while also maintaining good performance in playing against static agents that are present in Axelrod tournaments. We compare this method with traditional reinforcement learning algorithms and agent-tracking techniques to highlight key differences and potential applications. We also draw attention to the differences between solving games and societal-like interactions and analyze the training of Q-learning agents in makeshift societies. This is to emphasize how cooperation may emerge in societies and demonstrate this using environments where interactions with opponents are determined through a random encounter format of the iterated prisoner's dilemma.
Many scenarios involve a tension between individual interest and the interests of others. Such situations are called social dilemmas. Because of their ubiquity in economic and social interactions constructing agents that can solve social dilemmas is of prime importance to researchers interested in multi-agent systems. We discuss why social dilemmas are particularly difficult, propose a way to measure the 'success' of a strategy, and review recent work on using deep reinforcement learning to construct agents that can do well in both perfect and imperfect information bilateral social dilemmas.
Social dilemmas, where mutual cooperation can lead to high payoffs but participants face incentives to cheat, are ubiquitous in multi-agent interaction. We wish to construct agents that cooperate with pure cooperators, avoid exploitation by pure defectors, and incentivize cooperation from the rest. However, often the actions taken by a partner are (partially) unobserved or the consequences of individual actions are hard to predict. We show that in a large class of games good strategies can be constructed by conditioning one's behavior solely on outcomes (ie. one's past rewards). We call this consequentialist conditional cooperation. We show how to construct such strategies using deep reinforcement learning techniques and demonstrate, both analytically and experimentally, that they are effective in social dilemmas beyond simple matrix games. We also show the limitations of relying purely on consequences and discuss the need for understanding both the consequences of and the intentions behind an action.
Santos, Fernando P. (INESC-ID and Instituto Superior Técnico, Universidade de Lisboa) | Pacheco, Jorge M. (Centro de Biologia Molecular e Ambiental and Universidade do Minho) | Santos, Francisco C. (INESC-ID and Instituto Superior Técnico, Universidade de Lisboa)
Social norms regulate actions in artificial societies, steering collective behavior towards desirable states. In real societies, social norms can solve cooperation dilemmas, constituting a key ingredient in systems of indirect reciprocity: reputations of agents are assigned following social norms that identify their actions as good or bad. This, in turn, implies that agents can discriminate between the different actions of others and that the behaviors of each agent are known to the population at large. This is only possible if the agents report their interactions. Reporting constitutes, this way, a fundamental ingredient of indirect reciprocity, as in its absence cooperation in a multiagent system may collapse. Yet, in most studies to date, reporting is assumed to be cost-free, which collides with many life situations, where reporting can easily incur a cost (costly reputation building). Here we develop a new model of indirect reciprocity that allows reputation building to be costly. We show that only two norms can sustain cooperation under costly reputation building, a feature that requires agents to be able to anticipate the reporting intentions of their opponents, depending sensitively on both the cost of reporting and the accuracy level of reporting anticipation.
In the future, artificial learning agents are likely to become increasingly widespread in our society. They will interact with both other learning agents and humans in a variety of complex settings including social dilemmas. We consider the problem of how an external agent can promote cooperation between artificial learners by distributing additional rewards and punishments based on observing the learners' actions. We propose a rule for automatically learning how to create right incentives by considering the players' anticipated parameter updates. Using this learning rule leads to cooperation with high social welfare in matrix games in which the agents would otherwise learn to defect with high probability. We show that the resulting cooperative outcome is stable in certain games even if the planning agent is turned off after a given number of episodes, while other games require ongoing intervention to maintain mutual cooperation. However, even in the latter case, the amount of necessary additional incentives decreases over time.