AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Amazon Wants You to Code the AI Brain for This Little Car

WIREDNov-28-2018, 21:07:52 GMT

Two years ago, Alphabet researchers made computing history when their artificial intelligence software AlphaGo defeated a world champion at the complex board game Go. Amazon now hopes to democratize the AI technique behind that milestone--with a pint-size self-driving car. The 1/18th-scale vehicle is called DeepRacer, and it can be preordered for $249; it will later cost $399. It's designed to make it easier for programmers to get started with reinforcement learning, the technique that powered AlphaGo's victory and is loosely inspired by how animals learn from feedback on their behavior. Although the approach has produced notable research stunts, such as bots that can play Go, chess, and complicated multiplayer electronic games, it isn't as widely used as the pattern-matching learning techniques used in speech recognition and image analysis.

amazon, machine learning, reinforcement learning, (10 more...)

WIRED

Industry:

Information Technology (1.00)
Leisure & Entertainment > Games > Go (0.94)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)

Add feedback

A Structure-aware Online Learning Algorithm for Markov Decision Processes

Roy, Arghyadip, Borkar, Vivek, Karandikar, Abhay, Chaporkar, Prasanna

arXiv.org Machine LearningNov-28-2018

To overcome the curse of dimensionality and curse of modeling in Dynamic Programming (DP) methods for solving classical Markov Decision Process (MDP) problems, Reinforcement Learning (RL) algorithms are popular. In this paper, we consider an infinite-horizon average reward MDP problem and prove the optimality of the threshold policy under certain conditions. Traditional RL techniques do not exploit the threshold nature of optimal policy while learning. In this paper, we propose a new RL algorithm which utilizes the known threshold structure of the optimal policy while learning by reducing the feasible policy space. We establish that the proposed algorithm converges to the optimal policy. It provides a significant improvement in convergence speed and computational and storage complexity over traditional RL algorithms. The proposed technique can be applied to a wide variety of optimization problems that include energy efficient data transmission and management of queues. We exhibit the improvement in convergence speed of the proposed algorithm over other RL algorithms through simulations.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

1811.11646

Country: Europe (0.15)

Genre: Research Report (0.64)

Industry:

Energy (0.46)
Education > Educational Setting > Online (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.71)

Add feedback

Understanding the impact of entropy on policy optimization

Ahmed, Zafarali, Roux, Nicolas Le, Norouzi, Mohammad, Schuurmans, Dale

arXiv.org Machine LearningNov-28-2018

Entropy regularization is commonly used to improve policy optimization in reinforcement learning. It is believed to help with exploration by encouraging the selection of more stochastic policies. In this work, we analyze this claim and, through new visualizations of the optimization landscape, we observe that incorporating entropy in policy optimization serves as a regularizer. We show that even with access to the exact gradient, policy optimization is difficult due to the geometry of the objective function. We qualitatively show that, in some environments, entropy regularization can make the optimization landscape smoother, thereby connecting local optima and enabling the use of larger learning rates. This manuscript presents new tools for understanding the underlying optimization landscape and highlights the challenge of designing general-purpose policy optimization algorithms in reinforcement learning.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

1811.11214

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Experience Replay for Continual Learning

Rolnick, David, Ahuja, Arun, Schwarz, Jonathan, Lillicrap, Timothy P., Wayne, Greg

arXiv.org Artificial IntelligenceNov-28-2018

Continual learning is the problem of learning new tasks or knowledge while protecting old knowledge and ideally generalizing from old experience to learn new tasks faster. Neural networks trained by stochastic gradient descent often degrade on old tasks when trained successively on new tasks with different data distributions. This phenomenon, referred to as catastrophic forgetting, is considered a major hurdle to learning with non-stationary data or sequences of new tasks, and prevents networks from continually accumulating knowledge and skills. We examine this issue in the context of reinforcement learning, in a setting where an agent is exposed to tasks in a sequence. Unlike most other work, we do not provide an explicit indication to the model of task boundaries, which is the most general circumstance for a learning agent exposed to continuous experience. While various methods to counteract catastrophic forgetting have recently been proposed, we explore a straightforward, general, and seemingly overlooked solution - that of using experience replay buffers for all past events - with a mixture of on- and off-policy learning, leveraging behavioral cloning. We show that this strategy can still learn new tasks quickly yet can substantially reduce catastrophic forgetting in both Atari and DMLab domains, even matching the performance of methods that require task identities. When buffer storage is constrained, we confirm that a simple mechanism for randomly discarding data allows a limited size buffer to perform almost as well as an unbounded one.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

1811.11682

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Efficiently Combining Human Demonstrations and Interventions for Safe Training of Autonomous Systems in Real-Time

Goecks, Vinicius G., Gremillion, Gregory M., Lawhern, Vernon J., Valasek, John, Waytowich, Nicholas R.

arXiv.org Artificial IntelligenceNov-28-2018

This paper investigates how to utilize different forms of human interaction to safely train autonomous systems in real-time by learning from both human demonstrations and interventions. We implement two components of the Cycle-of-Learning for Autonomous Systems, which is our framework for combining multiple modalities of human interaction. The current effort employs human demonstrations to teach a desired behavior via imitation learning, then leverages intervention data to correct for undesired behaviors produced by the imitation learner to teach novel tasks to an autonomous agent safely, after only minutes of training. We demonstrate this method in an autonomous perching task using a quadrotor with continuous roll, pitch, yaw, and throttle commands and imagery captured from a downward-facing camera in a high-fidelity simulated environment. Our method improves task completion performance for the same amount of human interaction when compared to learning from demonstrations alone, while also requiring on average 32% less data to achieve that performance. This provides evidence that combining multiple modes of human interaction can increase both the training speed and overall performance of policies for autonomous systems.

demonstration, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

1810.11545

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry: Government > Military > Army (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Target Driven Visual Navigation with Hybrid Asynchronous Universal Successor Representations

Siriwardhana, Shamane, Weerasekera, Rivindu, Nanayakkara, Suranga

arXiv.org Artificial IntelligenceNov-27-2018

Being able to navigate to a target with minimal supervision and prior knowledge is critical to creating human-like assistive agents. Prior work on map-based and map-less approaches have limited generalizability. In this paper, we present a novel approach, Hybrid Asynchronous Universal Successor Representations (HAUSR), which overcomes the problem of generalizability to new goals by adapting recent work on Universal Successor Representations with Asynchronous Actor-Critic Agents. We show that the agent was able to successfully reach novel goals and we were able to quickly fine-tune the network for adapting to new scenes. This opens up novel application scenarios where intelligent agents could learn from and adapt to a wide range of environments with minimal human input.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

1811.11312

Country: Oceania > New Zealand (0.15)

Genre:

Overview > Innovation (0.54)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Grammars and reinforcement learning for molecule optimization

Kraev, Egor

arXiv.org Machine LearningNov-27-2018

An important challenge in drug discovery is to find molecules with desired chemical properties. While ultimate usefulness as a drug can only be determined in a laboratory or clinical context, that process is expensive, and it is thus advantageous to pre-select likely candidates in software. While deep learning has been extensively investigated for molecular graph encoding ([Duvenaud et al., 2015], [Kearnes et al., 2016], [Gilmer et al., 2017]), molecule generation is still subject of active research. The simplest natural approach to candidate molecule generation is to generate some sort of a linear representation,such as a string of characters in the SMILES format [Weininger, 1988], using an encoder-decoder network architecture similar to that used in machine translation, as done in [Gómez-Bombarelli et al., 2016]. This approach's performance was comparatively poor because a molecule's structure is not linear, but rather a graph which typically includes cycles, so it falls to the model to learn how to generate SMILES strings that correspond to chemically valid molecules - a nontrivial task that leaves the model with little spare capacity to additionally optimize a given chemical metric of the molecules produced. A way to partially remedy that involves generating not the actual SMILES strings, but a sequence ofproduction rules of a context-free grammar (CFG) for SMILES, as done by [Kusner et al., 2017]. That guarantees that the SMILES strings produced are grammatically valid, putting less burden on the model to ensure validity and thereby achieving better metrics. However, [Kusner et al., 2017] give two reasons why this is still not guaranteed to produce chemically valid molecules: firstly, a grammatically valid SMILES string is not guaranteed to be chemically possible (because of atom valences being wrong, for example), and secondly, because a

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Machine Learning

1811.11222

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Unsupervised Control Through Non-Parametric Discriminative Rewards

Warde-Farley, David, Van de Wiele, Tom, Kulkarni, Tejas, Ionescu, Catalin, Hansen, Steven, Mnih, Volodymyr

arXiv.org Artificial IntelligenceNov-27-2018

Learning to control an environment without hand-crafted rewards or expert data remains challenging and is at the frontier of reinforcement learning research. We present an unsupervised learning algorithm to train agents to achieve perceptually-specified goals using only a stream of observations and actions. Our agent simultaneously learns a goal-conditioned policy and a goal achievement reward function that measures how similar a state is to the goal state. This dual optimization leads to a co-operative game, giving rise to a learned reward function that reflects similarity in controllable aspects of the environment instead of distance in the space of observations. We demonstrate the efficacy of our agent to learn, in an unsupervised manner, to reach a diverse set of goals on three domains -- Atari, the DeepMind Control Suite and DeepMind Lab.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

1811.11359

Country: North America (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment > Games (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Prioritizing Starting States for Reinforcement Learning

Tavakoli, Arash, Levdik, Vitaly, Islam, Riashat, Kormushev, Petar

arXiv.org Artificial IntelligenceNov-27-2018

Online, off-policy reinforcement learning algorithms are able to use an experience memory to remember and replay past experiences. In prior work, this approach was used to stabilize training by breaking the temporal correlations of the updates and avoiding the rapid forgetting of possibly rare experiences. In this work, we propose a conceptually simple framework that uses an experience memory to help exploration by prioritizing the starting states from which the agent starts acting in the environment, importantly, in a fashion that is also compatible with on-policy algorithms. Given the capacity to restart the agent in states corresponding to its past observations, we achieve this objective by (i) enabling the agent to restart in states belonging to significant past experiences (e.g., nearby goals), and (ii) promoting faster coverage of the state space through starting from a more diverse set of states. While, using a good measure of priority to identify significant past transitions, we expect case (i) to more considerably help exploration in certain problems (e.g., sparse reward tasks), we hypothesize that case (ii) will generally be beneficial, even without any prioritization. We show empirically that our approach improves learning performance for both off-policy and on-policy deep reinforcement learning methods, with the most notable improvement in a significantly sparse reward task.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

1811.11298

Country: North America > Canada > Quebec > Montreal (0.28)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Scaling Configuration of Energy Harvesting Sensors with Reinforcement Learning

Fraternali, Francesco, Balaji, Bharathan, Gupta, Rajesh

arXiv.org Artificial IntelligenceNov-27-2018

With the advent of the Internet of Things (IoT), an increasing number of energy harvesting methods are being used to supplement or supplant battery based sensors. Energy harvesting sensors need to be configured according to the application, hardware, and environmental conditions to maximize their usefulness. As of today, the configuration of sensors is either manual or heuristics based, requiring valuable domain expertise. Reinforcement learning (RL) is a promising approach to automate configuration and efficiently scale IoT deployments, but it is not yet adopted in practice. We propose solutions to bridge this gap: reduce the training phase of RL so that nodes are operational within a short time after deployment and reduce the computational requirements to scale to large deployments. We focus on configuration of the sampling rate of indoor solar panel based energy harvesting sensors. We created a simulator based on 3 months of data collected from 5 sensor nodes subject to different lighting conditions. Our simulation results show that RL can effectively learn energy availability patterns and configure the sampling rate of the sensor nodes to maximize the sensing data while ensuring that energy storage is not depleted. The nodes can be operational within the first day by using our methods. We show that it is possible to reduce the number of RL policies by using a single policy for nodes that share similar lighting conditions.

machine learning, node, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3279755.3279760

1811.11259

Country: North America > United States > California (0.46)

Genre:

Research Report > New Finding (0.34)
Research Report > Promising Solution (0.34)

Industry:

Energy > Energy Storage (1.00)
Energy > Renewable > Solar (0.49)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback