AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Sim-to-Real Transfer of Robot Learning with Variable Length Inputs

Dasagi, Vibhavari, Lee, Robert, Mou, Serena, Bruce, Jake, Sünderhauf, Niko, Leitner, Jürgen

arXiv.org Machine LearningOct-8-2019

Current end-to-end deep Reinforcement Learning (RL) approaches require jointly learning perception, decision-making and low-level control from very sparse reward signals and high-dimensional inputs, with little capability of incorporating prior knowledge. This results in prohibitively long training times for use on real-world robotic tasks. Existing algorithms capable of extracting task-level representations from high-dimensional inputs, e.g. object detection, often produce outputs of varying lengths, restricting their use in RL methods due to the need for neural networks to have fixed length inputs. In this work, we propose a framework that combines deep sets encoding, which allows for variable-length abstract representations, with modular RL that utilizes these representations, decoupling high-level decision making from low-level control. We successfully demonstrate our approach on the robot manipulation task of object sorting, showing that this method can learn effective policies within mere minutes of highly simplified simulation. The learned policies can be directly deployed on a robot without further training, and generalize to variations of the task unseen during training.

learning, reinforcement, robot, (14 more...)

arXiv.org Machine Learning

1809.0748

Country:

Oceania > Australia > Queensland > Brisbane (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Model-based Reinforcement Learning for Predictions and Control for Limit Order Books

Wei, Haoran, Wang, Yuanbo, Mangu, Lidia, Decker, Keith

arXiv.org Artificial IntelligenceOct-8-2019

We build a profitable electronic trading agent with Reinforcement Learning that places buy and sell orders in the stock market. An environment model is built only with historical observational data, and the RL agent learns the trading policy by interacting with the environment model instead of with the real-market to minimize the risk and potential monetary loss. Trained in unsupervised and self-supervised fashion, our environment model learned a temporal and causal representation of the market in latent space through deep neural networks. We demonstrate that the trading policy trained entirely within the environment model can be transferred back into the real market and maintain its profitability. We believe that this environment model can serve as a robust simulator that predicts market movement as well as trade impact for further studies.

agent, environment model, rl agent, (13 more...)

arXiv.org Artificial Intelligence

1910.03743

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multiple-objective Reinforcement Learning for Inverse Design and Identification

Wei, Haoran, Olarte, Mariefel, Goh, Garrett B.

arXiv.org Artificial IntelligenceOct-8-2019

The aim of the inverse chemical design is to develop new molecules with given optimized molecular properties or objectives. Recently, generative deep learning (DL) networks are considered as the state-of-the-art in inverse chemical design and have achieved early success in generating molecular structures with desired properties in the pharmaceutical and material chemistry fields. However, satisfying a large number (larger than 10 objectives) of molecular objectives is a limitation of current generative models. To improve the model's ability to handle a large number of molecule design objectives, we developed a Reinforcement Learning (RL) based generative framework to optimize chemical molecule generation. Our use of Curriculum Learning (CL) to fine-tune the pre-trained generative network allowed the model to satisfy up to 21 objectives and increase the generative network's robustness. The experiments show that the proposed multiple-objective RL-based generative model can correctly identify unknown molecules with an 83 to 100 percent success rate, compared to the baseline approach of 0 percent. Additionally, this proposed generative model is not limited to just chemistry research challenges; we anticipate that problems that utilize RL with multiple-objectives will benefit from this framework.

agent model, constraint, molecule, (15 more...)

arXiv.org Artificial Intelligence

1910.03741

Country: Asia > Middle East > Republic of Türkiye > Bingoel Province > Bingol (0.05)

Genre: Research Report > New Finding (0.94)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Investigation on the generalization of the Sampled Policy Gradient algorithm

Ansó, Nil Stolt

arXiv.org Artificial IntelligenceOct-8-2019

The Sampled Policy Gradient (SPG) algorithm is a new offline actor-critic variant that samples in the action space to approximate the policy gradient. It does so by using the critic to evaluate the sampled actions. SPG offers theoretical promise over similar algorithms such as DPG as it searches the action-Q-value space independently of the local gradient, enabling it to avoid local minima. This paper aims to compare SPG to two similar actor-critic algorithms, CACLA and DPG. The comparison is made across two different environments, two different network architectures, as well as training on on-policy transitions in contrast to using an experience buffer. Results seem to show that although SPG does often not perform the worst, it doesn't always match the performance of the best performing algorithm at a particular task. Further experiments are required to get a better estimate of the qualities of SPG.

agar, algorithm, transition, (15 more...)

arXiv.org Artificial Intelligence

1910.03728

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Czechia > Prague (0.04)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Machine Learning Used To Uncover The Secrets Of Pompeii Scrolls

#artificialintelligenceOct-7-2019, 10:50:06 GMT

The human brain often recalls past memories (seemingly) unprompted. As we go throughout our day, we have spontaneous flashes of memory from our lives. While this spontaneous conjuration of memories has long been of interest to neuroscientists, AI research company DeepMind recently published a paper detailing how an AI of theirs replicated this strange pattern of recall. The conjuration of memories in the brain, neural replay, is tightly linked with the hippocampus. The hippocampus is a seahorse-shaped formation in the brain that belongs to the limbic system, and it is associated with the formation of new memories, as well as the emotions that memories spark. Current theories on the role of the hippocampi (there is one in each hemisphere of the brain), state that different regions of the hippocampus are responsible for the handling of different types of memories.

imagination replay method, reinforcement, replay method, (14 more...)

#artificialintelligence

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.30)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Self-Paced Contextual Reinforcement Learning

Klink, Pascal, Abdulsamad, Hany, Belousov, Boris, Peters, Jan

arXiv.org Machine LearningOct-7-2019

Generalization and adaptation of learned skills to novel situations is a core requirement for intelligent autonomous robots. Although contextual reinforcement learning provides a principled framework for learning and generalization of behaviors across related tasks, it generally relies on uninformed sampling of environments from an unknown, uncontrolled context distribution, thus missing the benefits of structured, sequential learning. We introduce a novel relative entropy reinforcement learning algorithm that gives the agent the freedom to control the intermediate task distribution, allowing for its gradual progression towards the target context distribution. Empirical evaluation shows that the proposed curriculum learning scheme drastically improves sample efficiency and enables learning in scenarios with both broad and sharp target context distributions in which classical approaches perform sub-optimally.

algorithm, context distribution, experiment, (16 more...)

arXiv.org Machine Learning

1910.02826

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)

Genre: Research Report (0.64)

Industry:

Education (0.48)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Peng, Xue Bin, Kumar, Aviral, Zhang, Grace, Levine, Sergey

arXiv.org Machine LearningOct-7-2019

In this paper, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines. Our goal is an algorithm that utilizes only simple and convergent maximum likelihood loss functions, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. The method is simple and general, can accommodate continuous and discrete actions, and can be implemented in just a few lines of code on top of standard supervised learning methods. We provide a theoretical motivation for AWR and analyze its properties when incorporating off-policy data from experience replay. We evaluate AWR on a suite of standard OpenAI Gym benchmark tasks, and show that it achieves competitive performance compared to a number of well-established state-of-the-art RL algorithms. AWR is also able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions. Furthermore, we demonstrate our algorithm on challenging continuous control tasks with highly complex simulated characters.

algorithm, proceedings, value function, (14 more...)

arXiv.org Machine Learning

1910.00177

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Sweden > Stockholm > Stockholm (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Tactical Reward Shaping: Bypassing Reinforcement Learning with Strategy-Based Goals

Zhang, Yizheng, Rosendo, Andre

arXiv.org Artificial IntelligenceOct-7-2019

Deep Reinforcement Learning (DRL) has shown its promising capabilities to learn optimal policies directly from trial and error. However, learning can be hindered if the goal of the learning, defined by the reward function, is "not optimal". We demonstrate that by setting the goal/target of competition in a counter-intuitive but intelligent way, instead of heuristically trying solutions through many hours the DRL simulation can quickly converge into a winning strategy. The ICRA-DJI RoboMaster AI Challenge is a game of cooperation and competition between robots in a partially observable environment, quite similar to the Counter-Strike game. Unlike the traditional approach to games, where the reward is given at winning the match or hitting the enemy, our DRL algorithm rewards our robots when in a geometric-strategic advantage, which implicitly increases the winning chances. Furthermore, we use Deep Q Learning (DQL) to generate multi-agent paths for moving, which improves the cooperation between two robots by avoiding the collision. Finally, we implement a variant A* algorithm with the same implicit geometric goal as DQL and compare results. We conclude that a well-set goal can put in question the need for learning algorithms, with geometric-based searches outperforming DQL in many orders of magnitude.

algorithm, enemy robot, robot, (12 more...)

arXiv.org Artificial Intelligence

1910.03144

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States (0.04)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

Add feedback

Combining No-regret and Q-learning

Kash, Ian A., Sullins, Michael, Hofmann, Katja

arXiv.org Artificial IntelligenceOct-7-2019

Combining No-regret and Q-learning Ian A. Kash University of Illinois, Chicago, IL Michael Sullins University of Illinois, Chicago, IL Katja Hofmann Microsoft Research, Cambridge, UK Abstract Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average. 1 Introduction V ersions of counterfactual regret minimization (CFR) [50] have found success in playing poker at human expert level [10, 41] as well as fully solving nontrivial versions of it [8]. CFR more generally can solve extensive form games of incomplete information. It works by using a no-regret algorithm to select actions. In particular, one copy of such an algorithm is used at each information set, which corresponds to the full history of play observed by a single agent. The resulting algorithm satisfies a global no-regret guarantee, so at least in two-player zero-sum games is guaranteed to converge to an optimal strategy through sufficient self-play. However, CFR does have limitations. It makes two strong assumptions which are natural for games such as poker, but limit applicability to further settings. First, it assumes that the agent has perfect recall, which in a more general context means that the state representation captures the full history of states visited (and so imposes a tree structure). Current RL domains may rarely repeat states due to their large state spaces, but they certainly do not encode the full history of states and actions. Second, it assumes that a terminal state is eventually reached and performs updates only after this occurs.

algorithm, convergence, no-regret algorithm, (13 more...)

arXiv.org Artificial Intelligence

1910.03094

Country:

North America > United States > Illinois > Cook County > Chicago (0.44)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.24)
North America > United States > Texas (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Policies Modulating Trajectory Generators

Iscen, Atil, Caluwaerts, Ken, Tan, Jie, Zhang, Tingnan, Coumans, Erwin, Sindhwani, Vikas, Vanhoucke, Vincent

arXiv.org Artificial IntelligenceOct-7-2019

Abstract: We propose an architecture for learning complex controllable behaviors by having simple Policies Modulate Trajectory Generators (PMTG), a powerful combination that can provide both memory and prior knowledge to the controller. The result is a flexible architecture that is applicable to a class of problems with periodic motion for which one has an insight into the class of trajectories that might lead to a desired behavior. We illustrate the basics of our architecture using a synthetic control problem, then go on to learn speed-controlled locomotion for a quadrupedal robot by using Deep Reinforcement Learning and Evolutionary Strategies. We demonstrate that a simple linear policy, when paired with a parametric Trajectory Generator for quadrupedal gaits, can induce walking behaviors with controllable speed from 4 -dimensional IMU observations alone, and can be learned in under 1000 rollouts. We also transfer these policies to a real robot and show locomotion with controllable forward velocity. Keywords: Reinforcement Learning, Control, Locomotion 1 Introduction The recent success of Deep Learning (DL) on simulated robotic tasks has opened an exciting research direction. Nevertheless, many robotic tasks such as locomotion still remain an open problem for learning-based methods due to their complexity or dynamics. From a Deep Learning (DL) perspective, one way to tackle these complex problems is by using more and more complex policies (such as recurrent networks). Unfortunately, more complex policies are harder to train and require even more training data which is often problematic for robotics.

architecture, controller, robot, (12 more...)

arXiv.org Artificial Intelligence

1910.02812

Country:

North America > United States > California > Santa Clara County > Mountain View (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback