Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical RL, generative adversarial networks and decentralised optimisation. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes an additional term that accounts for the impact of one agent's policy on the anticipated parameter update of the other agents. Preliminary results show that the encounter of two LOLA agents leads to the emergence of tit-for-tat and therefore cooperation in the iterated prisoners' dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents can successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the policy gradient estimator, making the method suitable for model-free RL. This method thus scales to large parameter and input spaces and nonlinear function approximators. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies and opponent modelling. Again, by explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest.
In poker, players tend to play sub-optimally due to theuncertainty in the game. Payoffs can be maximized byexploiting these sub-optimal tendencies. One way of realizingthis is to acquire the opponent strategy by recognizingthe key patterns in its style of play. Existing studieson opponent modeling in poker aim at predicting opponent’sfuture actions or estimating opponent’s hand.In this study, we propose a machine learning methodfor acquiring the opponent’s behavior for the purpose ofpredicting opponent’s future actions.We derived a numberof features to be used in modeling opponent’s strategy.Then, an ensemble learning method is proposed forgeneralizing the model. The proposed approach is testedon a set of test scenarios and shown to be effective.
An MDP consists of the set of states S, the set of actions A, the transition function, P (s null s, a), which is the probability of the next state, s null, given the current state, s, and the action, a, and the reward function, r ( s null, a, s), that returns a scalar value conditioned on two consecutive states and the intermediate action. A policy function is used to choose an action given a state, which can be stochastic a π (a s) or deterministic a µ(s). Given a policy π, the state value function is defined as V ( s t) E π[null H i t γ i t r t s s t] and the state-action value (Q-value) Q(s t, a t) E π[ null H i t γ i t r t s s t,a a t], where 0 γ 1 is the discount factor and H is the finite horizon of the episode. The goal of RL is to compute the policy that maximizes state value function V, when the transition and the reward functions are unknown. There is a large number of RL algorithms; however, in this work, we focus on two actor-critic algorithms; the synchronous Advantage Actor-Critic (A2C) [Mnih et al., 2016, Dhariwal et al., 2017] and the Deep Deterministic Policy Gradient (DDPG) [Silver et al., 2014, Lillicrap et al., 2015]. DDPG is an off-policy algorithm, using an experience replay for breaking the correlation between consecutive samples and target networks for stabilizing the training [Mnih et al., 2015]. Given an actor network with parameters θ and a critic network with parameter φ, the gradient updates are performed using the following update rules.
Opponent modeling is essential to exploit sub-optimal opponents in strategic interactions. Most previous works focus on building explicit models to directly predict the opponents' styles or strategies, which require a large amount of data to train the model and lack adaptability to unknown opponents. In this work, we propose a novel Learning to Exploit (L2E) framework for implicit opponent modeling. L2E acquires the ability to exploit opponents by a few interactions with different opponents during training, thus can adapt to new opponents with unknown styles during testing quickly. We propose a novel opponent strategy generation algorithm that produces effective opponents for training automatically. We evaluate L2E on two poker games and one grid soccer game, which are the commonly used benchmarks for opponent modeling. Comprehensive experimental results indicate that L2E quickly adapts to diverse styles of unknown opponents.
Jaderberg, Max, Czarnecki, Wojciech M., Dunning, Iain, Marris, Luke, Lever, Guy, Castaneda, Antonio Garcia, Beattie, Charles, Rabinowitz, Neil C., Morcos, Ari S., Ruderman, Avraham, Sonnerat, Nicolas, Green, Tim, Deason, Louise, Leibo, Joel Z., Silver, David, Hassabis, Demis, Kavukcuoglu, Koray, Graepel, Thore
Recent progress in artificial intelligence through reinforcement learning (RL) has shown great success on increasingly complex single-agent environments and two-player turn-based games. However, the real-world contains multiple agents, each learning and acting independently to cooperate and compete with other agents, and environments reflecting this degree of complexity remain an open challenge. In this work, we demonstrate for the first time that an agent can achieve human-level in a popular 3D multiplayer first-person video game, Quake III Arena Capture the Flag, using only pixels and game points as input. These results were achieved by a novel two-tier optimisation process in which a population of independent RL agents are trained concurrently from thousands of parallel matches with agents playing in teams together and against each other on randomly generated environments. Each agent in the population learns its own internal reward signal to complement the sparse delayed reward from winning, and selects actions using a novel temporally hierarchical representation that enables the agent to reason at multiple timescales. During game-play, these agents display human-like behaviours such as navigating, following, and defending based on a rich learned representation that is shown to encode high-level game knowledge. In an extensive tournament-style evaluation the trained agents exceeded the win-rate of strong human players both as teammates and opponents, and proved far stronger than existing state-of-the-art agents. These results demonstrate a significant jump in the capabilities of artificial agents, bringing us closer to the goal of human-level intelligence.