Multi-agent reinforcement learning has received significant interest in recent years notably due to the advancements made in deep reinforcement learning which have allowed for the developments of new architectures and learning algorithms. Using social dilemmas as the training ground, we present a novel learning architecture, Learning through Probing (LTP), where agents utilize a probing mechanism to incorporate how their opponent's behavior changes when an agent takes an action. We use distinct training phases and adjust rewards according to the overall outcome of the experiences accounting for changes to the opponents behavior. We introduce a parameter η to determine the significance of these future changes to opponent behavior. When applied to the Iterated Prisoner's Dilemma, LTP agents demonstrate that they can learn to cooperate with each other, achieving higher average cumulative rewards than other reinforcement learning methods while also maintaining good performance in playing against static agents that are present in Axelrod tournaments. We compare this method with traditional reinforcement learning algorithms and agent-tracking techniques to highlight key differences and potential applications. We also draw attention to the differences between solving games and societal-like interactions and analyze the training of Q-learning agents in makeshift societies. This is to emphasize how cooperation may emerge in societies and demonstrate this using environments where interactions with opponents are determined through a random encounter format of the iterated prisoner's dilemma.
Learning by observation can be of key importance whenever agents sharing similar features want to learn from each other. This paper presents an agent architecture that enables software agents to learn by direct observation of the actions executed by expert agents while they are performing a task. This is possible because the proposed architecture displays information that is essential for observation, making it possible for software agents to observe each other. The agent architecture supports a learning process that covers all aspects of learning by observation, such as discovering and observing experts, learning from the observed data, applying the acquired knowledge and evaluating the agent's progress. The evaluation provides control over the decision to obtain new knowledge or apply the acquired knowledge to new problems.
Inspired by recent work in attention models for image captioning and question answering, we present a soft attention model for the reinforcement learning domain. This model bottlenecks the view of an agent by a soft, top-down attention mechanism, forcing the agent to focus on task-relevant information by sequentially querying its view of the environment. The output of the attention mechanism allows direct observation of the information used by the agent to select its actions, enabling easier interpretation of this model than of traditional models. We analyze the different strategies the agents learn and show that a handful of strategies arise repeatedly across different games. We also show that the model learns to query separately about space and content ( where'' vs. what'').
We discuss the role of coordination as a direct learning objective in multi-agent reinforcement learning (MARL) domains. To this end, we present a novel means of quantifying coordination in multi-agent systems, and discuss the implications of using such a measure to optimize coordinated agent policies. This concept has important implications for adversary-aware RL, which we take to be a sub-domain of multi-agent learning.
As a classic example of imperfect information games, Heads-Up No-limit Texas Holdem (HUNL), has been studied extensively in recent years. While state-of-the-art approaches based on Nash equilibrium have been successful, they lack the ability to model and exploit opponents effectively. This paper presents an evolutionary approach to discover opponent models based Long Short Term Memory neural networks and on Pattern Recognition Trees. Experimental results showed that poker agents built in this method can adapt to opponents they have never seen in training and exploit weak strategies far more effectively than Slumbot 2017, one of the cutting-edge Nash-equilibrium-based poker agents. In addition, agents evolved through playing against relatively weak rule-based opponents tied statistically with Slumbot in heads-up matches. Thus, the proposed approach is a promising new direction for building high-performance adaptive agents in HUNL and other imperfect information games.