AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Zhang, Kaiqing, Kakade, Sham M., Başar, Tamer, Yang, Lin F.

arXiv.org Machine LearningOct-9-2020

Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $\tilde O(|S||A||B|(1-\gamma)^{-3}\epsilon^{-2})$ for finding the Nash equilibrium (NE) value up to some $\epsilon$ error, and the $\epsilon$-NE policies with a smooth planning oracle, where $\gamma$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2007.07461

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Illinois (0.04)
North America > United States > Massachusetts (0.04)
(2 more...)

Genre: Research Report (0.84)

Industry:

Leisure & Entertainment > Games (0.67)
Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Parameterized Reinforcement Learning for Optical System Optimization

Wankerl, Heribert, Stern, Maike L., Mahdavi, Ali, Eichler, Christoph, Lang, Elmar W.

arXiv.org Artificial IntelligenceOct-9-2020

Designing a multi-layer optical system with designated optical characteristics is an inverse design problem in which the resulting design is determined by several discrete and continuous parameters. In particular, we consider three design parameters to describe a multi-layer stack: Each layer's dielectric material and thickness as well as the total number of layers. Such a combination of both, discrete and continuous parameters is a challenging optimization problem that often requires a computationally expensive search for an optimal system design. Hence, most methods merely determine the optimal thicknesses of the system's layers. To incorporate layer material and the total number of layers as well, we propose a method that considers the stacking of consecutive layers as parameterized actions in a Markov decision process. We propose an exponentially transformed reward signal that eases policy optimization and adapt a recent variant of Q-learning for inverse design optimization. We demonstrate that our method outperforms human experts and a naive reinforcement learning algorithm concerning the achieved optical characteristics. Moreover, the learned Q-values contain information about the optical properties of multi-layer optical systems, thereby allowing physical interpretation or what-if analysis.

machine learning, parameterized action, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2010.05769

Country: Europe > Germany > Bavaria > Regensburg (0.05)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

LaND: Learning to Navigate from Disengagements

Kahn, Gregory, Abbeel, Pieter, Levine, Sergey

arXiv.org Artificial IntelligenceOct-9-2020

Consistently testing autonomous mobile robots in real world scenarios is a necessary aspect of developing autonomous navigation systems. Each time the human safety monitor disengages the robot's autonomy system due to the robot performing an undesirable maneuver, the autonomy developers gain insight into how to improve the autonomy system. However, we believe that these disengagements not only show where the system fails, which is useful for troubleshooting, but also provide a direct learning signal by which the robot can learn to navigate. We present a reinforcement learning approach for learning to navigate from disengagements, or LaND. LaND learns a neural network model that predicts which actions lead to disengagements given the current sensory observation, and then at test time plans and executes actions that avoid disengagements. Our results demonstrate LaND can successfully learn to navigate in diverse, real world sidewalk environments, outperforming both imitation learning and reinforcement learning approaches. Videos, code, and other material are available on our website https://sites.google.com/view/sidewalk-learning

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2010.04689

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > New Finding (0.69)

Industry: Transportation > Ground > Road (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Deep RL With Information Constrained Policies: Generalization in Continuous Control

Malloy, Tyler, Sims, Chris R., Klinger, Tim, Liu, Miao, Riemer, Matthew, Tesauro, Gerald

arXiv.org Artificial IntelligenceOct-9-2020

Biological agents learn and act intelligently in spite of a highly limited capacity to process and store information. Many real-world problems involve continuous control, which represents a difficult task for artificial intelligence agents. In this paper we explore the potential learning advantages a natural constraint on information flow might confer onto artificial agents in continuous control tasks. We focus on the model-free reinforcement learning (RL) setting and formalize our approach in terms of an information-theoretic constraint on the complexity of learned policies. We show that our approach emerges in a principled fashion from the application of rate-distortion theory. We implement a novel Capacity-Limited Actor-Critic (CLAC) algorithm and situate it within a broader family of RL algorithms such as the Soft Actor Critic (SAC) and Mutual Information Reinforcement Learning (MIRL) algorithm. Our experiments using continuous control tasks show that compared to alternative approaches, CLAC offers improvements in generalization between training and modified test environments. This is achieved in the CLAC model while displaying the high sample efficiency of similar methods.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2010.04646

Country: North America > United States > New York > Rensselaer County > Troy (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Instance Weighted Incremental Evolution Strategies for Reinforcement Learning in Dynamic Environments

Wang, Zhi, Chen, Chunlin, Dong, Daoyi

arXiv.org Artificial IntelligenceOct-9-2020

Evolution strategies (ES), as a family of black-box optimization algorithms, recently emerge as a scalable alternative to reinforcement learning (RL) approaches such as Q-learning or policy gradient, and are much faster when many central processing units (CPUs) are available due to better parallelization. In this paper, we propose a systematic incremental learning method for ES in dynamic environments. The goal is to adjust previously learned policy to a new one incrementally whenever the environment changes. We incorporate an instance weighting mechanism with ES to facilitate its learning adaptation, while retaining scalability of ES. During parameter updating, higher weights are assigned to instances that contain more new knowledge, thus encouraging the search distribution to move towards new promising areas of parameter space. We propose two easy-to-implement metrics to calculate the weights: instance novelty and instance quality. Instance novelty measures an instance's difference from the previous optimum in the original environment, while instance quality corresponds to how well an instance performs in the new environment. The resulting algorithm, Instance Weighted Incremental Evolution Strategies (IW-IES), is verified to achieve significantly improved performance on a suite of robot navigation tasks. This paper thus introduces a family of scalable ES algorithms for RL domains that enables rapid learning adaptation to dynamic environments.

dynamic environment, evolutionary algorithm, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2010.04605

Country:

Oceania > Australia > New South Wales (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Learning Intrinsic Symbolic Rewards in Reinforcement Learning

Sheikh, Hassam, Khadka, Shauharda, Miret, Santiago, Majumdar, Somdeb

arXiv.org Artificial IntelligenceOct-9-2020

Learning effective policies for sparse objectives is a key challenge in Deep Reinforcement Learning (RL). A common approach is to design task-related dense rewards to improve task learnability. While such rewards are easily interpreted, they rely on heuristics and domain expertise. Alternate approaches that train neural networks to discover dense surrogate rewards avoid heuristics, but are high-dimensional, black-box solutions offering little interpretability. In this paper, we present a method that discovers dense rewards in the form of low-dimensional symbolic trees - thus making them more tractable for analysis. The trees use simple functional operators to map an agent's observations to a scalar reward, which then supervises the policy gradient learning of a neural network policy. We test our method on continuous action spaces in Mujoco and discrete action spaces in Atari and Pygame environments. We show that the discovered dense rewards are an effective signal for an RL policy to solve the benchmark tasks. Notably, we significantly outperform a widely used, contemporary neural-network based reward-discovery algorithm in all environments considered.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2010.03694

Country: South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning

Ahmed, Ossama, Träuble, Frederik, Goyal, Anirudh, Neitz, Alexander, Wüthrich, Manuel, Bengio, Yoshua, Schölkopf, Bernhard, Bauer, Stefan

arXiv.org Machine LearningOct-8-2020

Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments. To facilitate research addressing this problem, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment. The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer. Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures. The key strength of CausalWorld is that it provides a combinatorial family of such tasks with common causal structure and underlying factors (including, e.g., robot and object masses, colors, sizes). The user (or the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are. One can thus easily define training and evaluation distributions of a desired difficulty level, targeting a specific form of generalization (e.g., only changes in appearance or object mass). Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task. While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to very challenging, all of which require long-horizon planning as well as precise low-level motor control. Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark.

arxiv preprint arxiv, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

2010.04296

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry:

Education (1.00)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.84)

Add feedback

Deep Bayesian Quadrature Policy Optimization

Tej, Akella Ravi, Azizzadenesheli, Kamyar, Ghavamzadeh, Mohammad, Anandkumar, Anima, Yue, Yisong

arXiv.org Machine LearningOct-8-2020

We study the problem of obtaining accurate policy gradient estimates using a finite number of samples. Monte-Carlo methods have been the default choice for policy gradient estimation, despite suffering from high variance in the gradient estimates. On the other hand, more sample efficient alternatives like Bayesian quadrature methods are less scalable due to their high computational complexity. In this work, we propose deep Bayesian quadrature policy gradient (DBQPG), a computationally efficient high-dimensional generalization of Bayesian quadrature, for policy gradient estimation. We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks. In comparison to Monte-Carlo estimation, DBQPG provides (i) more accurate gradient estimates with a significantly lower variance, (ii) a consistent improvement in the sample complexity and average return for several deep policy gradient algorithms, and, (iii) the uncertainty in gradient estimation that can be incorporated to further improve the performance.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2006.15637

Country:

Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(4 more...)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback

Trajectory Inspection: A Method for Iterative Clinician-Driven Design of Reinforcement Learning Studies

Ji, Christina X., Oberst, Michael, Kanjilal, Sanjat, Sontag, David

arXiv.org Artificial IntelligenceOct-8-2020

Treatment policies learned via reinforcement learning (RL) from observational health data are sensitive to subtle choices in study design. We highlight a simple approach, trajectory inspection, to bring clinicians into an iterative design process for model-based RL studies. We inspect trajectories where the model recommends unexpectedly aggressive treatments or believes its recommendations would lead to much more positive outcomes. Then, we examine clinical trajectories simulated with the learned model and policy alongside the actual hospital course to uncover possible modeling issues. To demonstrate that this approach yields insights, we apply it to recent work on RL for inpatient sepsis management. We find that a design choice around maximum trajectory length leads to a model bias towards discharge, that the RL policy preference for high vasopressor doses may be linked to small sample sizes, and that the model has a clinically implausible expectation of discharge without weaning off vasopressors.

machine learning, reinforcement learning, trajectory, (18 more...)

arXiv.org Artificial Intelligence

2010.04279

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Information-Driven Adaptive Sensing Based on Deep Reinforcement Learning

Murad, Abdulmajid, Kraemer, Frank Alexander, Bach, Kerstin, Taylor, Gavin

arXiv.org Artificial IntelligenceOct-8-2020

In order to make better use of deep reinforcement learning in the creation of sensing policies for resource-constrained IoT devices, we present and study a novel reward function based on the Fisher information value. This reward function enables IoT sensor devices to learn to spend available energy on measurements at otherwise unpredictable moments, while conserving energy at times when measurements would provide little new information. This is a highly general approach, which allows for a wide range of use cases without significant human design effort or hyper-parameter tuning. We illustrate the approach in a scenario of workplace noise monitoring, where results show that the learned behavior outperforms a uniform sampling strategy and comes close to a near-optimal oracle solution.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3410992.3411001

2010.04112

Country:

Europe > Sweden > Skåne County > Malmö (0.05)
Europe > Norway > Central Norway > Trøndelag > Trondheim (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Maryland > Anne Arundel County > Annapolis (0.04)

Genre: Research Report > New Finding (0.34)

Industry:

Energy (1.00)
Information Technology (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback