Goto

Collaborating Authors

 Reinforcement Learning


Sim-to-Real Transfer of Robot Learning with Variable Length Inputs

arXiv.org Machine Learning

Current end-to-end deep Reinforcement Learning (RL) approaches require jointly learning perception, decision-making and low-level control from very sparse reward signals and high-dimensional inputs, with little capability of incorporating prior knowledge. This results in prohibitively long training times for use on real-world robotic tasks. Existing algorithms capable of extracting task-level representations from high-dimensional inputs, e.g. object detection, often produce outputs of varying lengths, restricting their use in RL methods due to the need for neural networks to have fixed length inputs. In this work, we propose a framework that combines deep sets encoding, which allows for variable-length abstract representations, with modular RL that utilizes these representations, decoupling high-level decision making from low-level control. We successfully demonstrate our approach on the robot manipulation task of object sorting, showing that this method can learn effective policies within mere minutes of highly simplified simulation. The learned policies can be directly deployed on a robot without further training, and generalize to variations of the task unseen during training.


Model-based Reinforcement Learning for Predictions and Control for Limit Order Books

arXiv.org Artificial Intelligence

We build a profitable electronic trading agent with Reinforcement Learning that places buy and sell orders in the stock market. An environment model is built only with historical observational data, and the RL agent learns the trading policy by interacting with the environment model instead of with the real-market to minimize the risk and potential monetary loss. Trained in unsupervised and self-supervised fashion, our environment model learned a temporal and causal representation of the market in latent space through deep neural networks. We demonstrate that the trading policy trained entirely within the environment model can be transferred back into the real market and maintain its profitability. We believe that this environment model can serve as a robust simulator that predicts market movement as well as trade impact for further studies.


Multiple-objective Reinforcement Learning for Inverse Design and Identification

arXiv.org Artificial Intelligence

The aim of the inverse chemical design is to develop new molecules with given optimized molecular properties or objectives. Recently, generative deep learning (DL) networks are considered as the state-of-the-art in inverse chemical design and have achieved early success in generating molecular structures with desired properties in the pharmaceutical and material chemistry fields. However, satisfying a large number (larger than 10 objectives) of molecular objectives is a limitation of current generative models. To improve the model's ability to handle a large number of molecule design objectives, we developed a Reinforcement Learning (RL) based generative framework to optimize chemical molecule generation. Our use of Curriculum Learning (CL) to fine-tune the pre-trained generative network allowed the model to satisfy up to 21 objectives and increase the generative network's robustness. The experiments show that the proposed multiple-objective RL-based generative model can correctly identify unknown molecules with an 83 to 100 percent success rate, compared to the baseline approach of 0 percent. Additionally, this proposed generative model is not limited to just chemistry research challenges; we anticipate that problems that utilize RL with multiple-objectives will benefit from this framework.


Investigation on the generalization of the Sampled Policy Gradient algorithm

arXiv.org Artificial Intelligence

The Sampled Policy Gradient (SPG) algorithm is a new offline actor-critic variant that samples in the action space to approximate the policy gradient. It does so by using the critic to evaluate the sampled actions. SPG offers theoretical promise over similar algorithms such as DPG as it searches the action-Q-value space independently of the local gradient, enabling it to avoid local minima. This paper aims to compare SPG to two similar actor-critic algorithms, CACLA and DPG. The comparison is made across two different environments, two different network architectures, as well as training on on-policy transitions in contrast to using an experience buffer. Results seem to show that although SPG does often not perform the worst, it doesn't always match the performance of the best performing algorithm at a particular task. Further experiments are required to get a better estimate of the qualities of SPG.


Machine Learning Used To Uncover The Secrets Of Pompeii Scrolls

#artificialintelligence

The human brain often recalls past memories (seemingly) unprompted. As we go throughout our day, we have spontaneous flashes of memory from our lives. While this spontaneous conjuration of memories has long been of interest to neuroscientists, AI research company DeepMind recently published a paper detailing how an AI of theirs replicated this strange pattern of recall. The conjuration of memories in the brain, neural replay, is tightly linked with the hippocampus. The hippocampus is a seahorse-shaped formation in the brain that belongs to the limbic system, and it is associated with the formation of new memories, as well as the emotions that memories spark. Current theories on the role of the hippocampi (there is one in each hemisphere of the brain), state that different regions of the hippocampus are responsible for the handling of different types of memories.


Self-Paced Contextual Reinforcement Learning

arXiv.org Machine Learning

Generalization and adaptation of learned skills to novel situations is a core requirement for intelligent autonomous robots. Although contextual reinforcement learning provides a principled framework for learning and generalization of behaviors across related tasks, it generally relies on uninformed sampling of environments from an unknown, uncontrolled context distribution, thus missing the benefits of structured, sequential learning. We introduce a novel relative entropy reinforcement learning algorithm that gives the agent the freedom to control the intermediate task distribution, allowing for its gradual progression towards the target context distribution. Empirical evaluation shows that the proposed curriculum learning scheme drastically improves sample efficiency and enables learning in scenarios with both broad and sharp target context distributions in which classical approaches perform sub-optimally.


Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

arXiv.org Machine Learning

In this paper, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines. Our goal is an algorithm that utilizes only simple and convergent maximum likelihood loss functions, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. The method is simple and general, can accommodate continuous and discrete actions, and can be implemented in just a few lines of code on top of standard supervised learning methods. We provide a theoretical motivation for AWR and analyze its properties when incorporating off-policy data from experience replay. We evaluate AWR on a suite of standard OpenAI Gym benchmark tasks, and show that it achieves competitive performance compared to a number of well-established state-of-the-art RL algorithms. AWR is also able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions. Furthermore, we demonstrate our algorithm on challenging continuous control tasks with highly complex simulated characters.


Tactical Reward Shaping: Bypassing Reinforcement Learning with Strategy-Based Goals

arXiv.org Artificial Intelligence

Deep Reinforcement Learning (DRL) has shown its promising capabilities to learn optimal policies directly from trial and error. However, learning can be hindered if the goal of the learning, defined by the reward function, is "not optimal". We demonstrate that by setting the goal/target of competition in a counter-intuitive but intelligent way, instead of heuristically trying solutions through many hours the DRL simulation can quickly converge into a winning strategy. The ICRA-DJI RoboMaster AI Challenge is a game of cooperation and competition between robots in a partially observable environment, quite similar to the Counter-Strike game. Unlike the traditional approach to games, where the reward is given at winning the match or hitting the enemy, our DRL algorithm rewards our robots when in a geometric-strategic advantage, which implicitly increases the winning chances. Furthermore, we use Deep Q Learning (DQL) to generate multi-agent paths for moving, which improves the cooperation between two robots by avoiding the collision. Finally, we implement a variant A* algorithm with the same implicit geometric goal as DQL and compare results. We conclude that a well-set goal can put in question the need for learning algorithms, with geometric-based searches outperforming DQL in many orders of magnitude.


Combining No-regret and Q-learning

arXiv.org Artificial Intelligence

Combining No-regret and Q-learning Ian A. Kash University of Illinois, Chicago, IL Michael Sullins University of Illinois, Chicago, IL Katja Hofmann Microsoft Research, Cambridge, UK Abstract Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average. 1 Introduction V ersions of counterfactual regret minimization (CFR) [50] have found success in playing poker at human expert level [10, 41] as well as fully solving nontrivial versions of it [8]. CFR more generally can solve extensive form games of incomplete information. It works by using a no-regret algorithm to select actions. In particular, one copy of such an algorithm is used at each information set, which corresponds to the full history of play observed by a single agent. The resulting algorithm satisfies a global no-regret guarantee, so at least in two-player zero-sum games is guaranteed to converge to an optimal strategy through sufficient self-play. However, CFR does have limitations. It makes two strong assumptions which are natural for games such as poker, but limit applicability to further settings. First, it assumes that the agent has perfect recall, which in a more general context means that the state representation captures the full history of states visited (and so imposes a tree structure). Current RL domains may rarely repeat states due to their large state spaces, but they certainly do not encode the full history of states and actions. Second, it assumes that a terminal state is eventually reached and performs updates only after this occurs.


Policies Modulating Trajectory Generators

arXiv.org Artificial Intelligence

Abstract: We propose an architecture for learning complex controllable behaviors by having simple Policies Modulate Trajectory Generators (PMTG), a powerful combination that can provide both memory and prior knowledge to the controller. The result is a flexible architecture that is applicable to a class of problems with periodic motion for which one has an insight into the class of trajectories that might lead to a desired behavior. We illustrate the basics of our architecture using a synthetic control problem, then go on to learn speed-controlled locomotion for a quadrupedal robot by using Deep Reinforcement Learning and Evolutionary Strategies. We demonstrate that a simple linear policy, when paired with a parametric Trajectory Generator for quadrupedal gaits, can induce walking behaviors with controllable speed from 4 -dimensional IMU observations alone, and can be learned in under 1000 rollouts. We also transfer these policies to a real robot and show locomotion with controllable forward velocity. Keywords: Reinforcement Learning, Control, Locomotion 1 Introduction The recent success of Deep Learning (DL) on simulated robotic tasks has opened an exciting research direction. Nevertheless, many robotic tasks such as locomotion still remain an open problem for learning-based methods due to their complexity or dynamics. From a Deep Learning (DL) perspective, one way to tackle these complex problems is by using more and more complex policies (such as recurrent networks). Unfortunately, more complex policies are harder to train and require even more training data which is often problematic for robotics.