Reinforcement Learning
Opponent Learning Awareness and Modelling in Multi-Objective Normal Form Games
Rădulescu, Roxana, Verstraeten, Timothy, Zhang, Yijie, Mannion, Patrick, Roijers, Diederik M., Nowé, Ann
Many real-world multi-agent interactions consider multiple distinct criteria, i.e. the payoffs are multi-objective in nature. However, the same multi-objective payoff vector may lead to different utilities for each participant. Therefore, it is essential for an agent to learn about the behaviour of other agents in the system. In this work, we present the first study of the effects of such opponent modelling on multi-objective multi-agent interactions with non-linear utilities. Specifically, we consider two-player multi-objective normal form games with non-linear utility functions under the scalarised expected returns optimisation criterion. We contribute novel actor-critic and policy gradient formulations to allow reinforcement learning of mixed strategies in this setting, along with extensions that incorporate opponent policy reconstruction and learning with opponent learning awareness (i.e., learning while considering the impact of one's policy when anticipating the opponent's learning step). Empirical results in five different MONFGs demonstrate that opponent learning awareness and modelling can drastically alter the learning dynamics in this setting. When equilibria are present, opponent modelling can confer significant benefits on agents that implement it. When there are no Nash equilibria, opponent learning awareness and modelling allows agents to still converge to meaningful solutions that approximate equilibria.
OpenAI proposes using reciprocity to encourage AI agents to work together
Many real-world problems require complex coordination between multiple agents -- e.g., people or algorithms. A machine learning technique called multi-agent reinforcement learning (MARL) has shown success with respect to this, mainly in two-team games like Go, DOTA 2, Starcraft, hide-and-seek, and capture the flag. But the human world is far messier than games. That's because humans face social dilemmas at multiple scales, from the interpersonal to the international, and they must decide not only how to cooperate but when to cooperate. To address this challenge, researchers at OpenAI propose training AI agents with what they call randomized uncertain social preferences (RUSP), an augmentation that expands the distribution of environments in which reinforcement learning agents train.
Gym Tutorial: The Frozen Lake
In this article, we are going to learn how to create and explore the Frozen Lake environment using the Gym library, an open source project created by OpenAI used for reinforcement learning experiments. The Gym library defines a uniform interface for environments what makes the integration between algorithms and environment easier for developers. Among many ready-to-use environments, the default installation includes a text-mode version of the Frozen Lake game, used as example in our last post. The first step to create the game is to import the Gym library and create the environment. The next line calls the method gym.make() to create the Frozen Lake environment and then we call the method env.reset() to put it on its initial state.
Towards Human-Level Learning of Complex Physical Puzzles
Ota, Kei, Jha, Devesh K., Romeres, Diego, van Baar, Jeroen, Smith, Kevin A., Semitsu, Takayuki, Oiki, Tomoaki, Sullivan, Alan, Nikovski, Daniel, Tenenbaum, Joshua B.
Humans quickly solve tasks in novel systems with complex dynamics, without requiring much interaction. While deep reinforcement learning algorithms have achieved tremendous success in many complex tasks, these algorithms need a large number of samples to learn meaningful policies. In this paper, we present a task for navigating a marble to the center of a circular maze. While this system is very intuitive and easy for humans to solve, it can be very difficult and inefficient for standard reinforcement learning algorithms to learn meaningful policies. We present a model that learns to move a marble in the complex environment within minutes of interacting with the real system. Learning consists of initializing a physics engine with parameters estimated using data from the real system. The error in the physics engine is then corrected using Gaussian process regression, which is used to model the residual between real observations and physics engine simulations. The physics engine equipped with the residual model is then used to control the marble in the maze environment using a model-predictive feedback over a receding horizon. We contrast the learning behavior against the time taken by humans to solve the problem to show comparable behavior. To the best of our knowledge, this is the first time that a hybrid model consisting of a full physics engine along with a statistical function approximator has been used to control a complex physical system in real-time using nonlinear model-predictive control (NMPC). Codes for the simulation environment can be downloaded here https://www.merl.com/research/license/CME . A video describing our method could be found here https://youtu.be/xaxNCXBovpc .
PLAS: Latent Action Space for Offline Reinforcement Learning
Zhou, Wenxuan, Bajracharya, Sujay, Held, David
The goal of offline reinforcement learning is to learn a policy from a fixed dataset, without further interactions with the environment. This setting will be an increasingly more important paradigm for real-world applications of reinforcement learning such as robotics, in which data collection is slow and potentially dangerous. Existing off-policy algorithms have limited performance on static datasets due to extrapolation errors from out-of-distribution actions. This leads to the challenge of constraining the policy to select actions within the support of the dataset during training. We propose to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement is naturally satisfied. We evaluate our method on continuous control benchmarks in simulation and a deformable object manipulation task with a physical robot. We demonstrate that our method provides competitive performance consistently across various continuous control tasks and different types of datasets, outperforming existing offline reinforcement learning methods with explicit constraints. Videos and code are available at https://sites.google.com/view/latent-policy.
Phoebe: Reuse-Aware Online Caching with Reinforcement Learning for Emerging Storage Models
With data durability, high access speed, low power efficiency and byte addressability, NVMe and SSD, which are acknowledged representatives of emerging storage technologies, have been applied broadly in many areas. However, one key issue with high-performance adoption of these technologies is how to properly define intelligent cache layers such that the performance gap between emerging technologies and main memory can be well bridged. To this end, we propose Phoebe, a reuse-aware reinforcement learning framework for the optimal online caching that is applicable for a wide range of emerging storage models. By continuous interacting with the cache environment and the data stream, Phoebe is capable to extract critical temporal data dependency and relative positional information from a single trace, becoming ever smarter over time. To reduce training overhead during online learning, we utilize periodical training to amortize costs. Phoebe is evaluated on a set of Microsoft cloud storage workloads. Experiment results show that Phoebe is able to close the gap of cache miss rate from LRU and a state-of-the-art online learning based cache policy to the Belady's optimal policy by 70.3% and 52.6%, respectively.
DeepMind Lab2D
Beattie, Charles, Köppe, Thomas, Duéñez-Guzmán, Edgar A., Leibo, Joel Z.
We present DeepMind Lab2D, a scalable environment simulator for artificial intelligence research that facilitates researcher-led experimentation with environment design. DeepMind Lab2D was built with the specific needs of multi-agent deep reinforcement learning researchers in mind, but it may also be useful beyond that particular subfield.
Robotic self-representation improves manipulation skills and transfer learning
Nguyen, Phuong D. H., Eppe, Manfred, Wermter, Stefan
Cognitive science suggests that the self-representation is critical for learning and problem-solving. However, there is a lack of computational methods that relate this claim to cognitively plausible robots and reinforcement learning. In this paper, we bridge this gap by developing a model that learns bidirectional action-effect associations to encode the representations of body schema and the peripersonal space from multisensory information, which is named multimodal BidAL. Through three different robotic experiments, we demonstrate that this approach significantly stabilizes the learning-based problem-solving under noisy conditions and that it improves transfer learning of robotic manipulation skills.
Reinforcement Learning: 10 Real Reward & Punishment Applications
In Reinforcement Learning (RL), agents are trained on a reward and punishment mechanism. The agent is rewarded for correct moves and punished for the wrong ones. In doing so, the agent tries to minimize wrong moves and maximize the right ones. Various papers have proposed Deep Reinforcement Learning for autonomous driving. In self-driving cars, there are various aspects to consider, such as speed limits at various places, drivable zones, avoiding collisions -- just to mention a few.
Baseline for Policy Gradients that All Deep Learning Enthusists Must Know
Deep reinforcement learning has a variety of different algorithms that solves many types of complex problems in various situations, one class of these algorithms is policy gradient (PG), which applies to a wide range of problems in both discrete and continuous action spaces, but applying it naively is inefficient, because of its poor sample complexity and high variance, which result in slower learning, to mitigate this we can use a baseline. The cause of the high variance problem is the reward scale, we think of policy gradient as it increases the probability of taking good actions and decreases it for bad actions, but mostly this is not the case, imagine a situation where the "good" episode return was 10 and the "bad" one was 5, then both probabilities of the actions in those episodes will be increased, which is not what we want, this problem is what baselines can solve. Mathematically, a baseline is a function when added to an expectation, does not change the expected value (or does not introduce bias), but at the same time, it can significantly affect the variance. Following this definition, we want a baseline for the policy gradient that can reduce its high variance and does not change its direction, a natural thing to do is to take the actions that are better than average, increase their probability, and decrease the probability of the actions that are worse than average, this is implemented by calculating the average reward over the trajectory and subtract it from the reward at the current timestep, this kind of baselines is called the average reward baseline. Now, we will show how baselines do not change the expected value, and we can choose any baselines we want.