Goto

Collaborating Authors

 Reinforcement Learning


Objective Mismatch in Model-based Reinforcement Learning

arXiv.org Machine Learning

Model-based reinforcement learning (MBRL) has been shown to be a powerful framework for data-efficiently learning control of continuous tasks. Recent work in MBRL has mostly focused on using more advanced function approximators and planning schemes, with little development of the general framework. In this paper, we identify a fundamental issue of the standard MBRL framework -- what we call the objective mismatch issue. Objective mismatch arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In the context of MBRL, we characterize the objective mismatch between training the forward dynamics model w.r.t.~the likelihood of the one-step ahead prediction, and the overall goal of improving performance on a downstream control task. For example, this issue can emerge with the realization that dynamics models effective for a specific task do not necessarily need to be globally accurate, and vice versa globally accurate models might not be sufficiently accurate locally to obtain good control performance on a specific task. In our experiments, we study this objective mismatch issue and demonstrate that the likelihood of one-step ahead predictions is not always correlated with control performance. This observation highlights a critical limitation in the MBRL framework which will require further research to be fully understood and addressed. We propose an initial method to mitigate the mismatch issue by re-weighting dynamics model training. Building on it, we conclude with a discussion about other potential directions of research for addressing this issue.


Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning

arXiv.org Machine Learning

Off-policy evaluation of sequential decision policies from observational data is necessary in applications of batch reinforcement learning such as education and healthcare. In such settings, however, observed actions are often confounded with transitions by unobserved variables, rendering exact evaluation of new policies impossible, i.e., unidentifiable. We develop a robust approach that estimates sharp bounds on the (unidentifiable) value of a given policy in an infinite-horizon problem given data from another policy with unobserved confounding subject to a sensitivity model. We phrase the problem precisely as computing the support function of the set of all stationary state-occupancy ratios that agree with both the data and the sensitivity model. We show how to express this set using a new partially identified estimating equation and prove convergence to the sharp bounds, as we collect more confounded data. We prove that membership in the set can be checked by solving a linear program, while the support function is given by a difficult nonconvex optimization problem. We leverage an analytical solution for the finite-state-space case to develop approximations based on nonconvex projected gradient descent. We demonstrate the resulting bounds empirically.


Learning to Switch Between Machines and Humans

arXiv.org Machine Learning

Reinforcement learning algorithms have been mostly developed and evaluated under the assumption that they will operate in a fully autonomous manner---they will take all actions. However, in safety critical applications, full autonomy faces a variety of technical, societal and legal challenges, which have precluded the use of reinforcement learning policies in real-world systems. In this work, our goal is to develop algorithms that, by learning to switch control between machines and humans, allow existing reinforcement learning policies to operate under different automation levels. More specifically, we first formally define the learning to switch problem using finite horizon Markov decision processes. Then, we show that, if the human policy is known, we can find the optimal switching policy directly by solving a set of recursive equations using backwards induction. However, in practice, the human policy is often unknown. To overcome this, we develop an algorithm that uses upper confidence bounds on the human policy to find a sequence of switching policies whose total regret with respect to the optimal switching policy is sublinear. Simulation experiments on two important tasks in autonomous driving---lane keeping and obstacle avoidance---demonstrate the effectiveness of the proposed algorithms and illustrate our theoretical findings.


Hyper-Meta Reinforcement Learning with Sparse Reward

arXiv.org Artificial Intelligence

Despite their success, existing meta reinforcement learning methods still have difficulty in learning a meta policy effectively for RL problems with sparse reward. To this end, we develop a novel meta reinforcement learning framework, Hyper-Meta RL (HMRL), for sparse reward RL problems. It consists of meta state embedding, meta reward shaping and meta policy learning modules: The cross-environment meta state embedding module constructs a common meta state space to adapt to different environments; The meta state based environment-specific meta reward shaping effectively extends the original sparse reward trajectory by cross-environmental knowledge complementarity; As a consequence, the meta policy then achieves better generalization and efficiency with the shaped meta reward. Experiments with sparse reward show the superiority of HMRL on both transferability and policy learning efficiency.



Deep Reinforcement Learning With TensorFlow 2.1 Roman Ring

#artificialintelligence

In this tutorial, I will give an overview of the TensorFlow 2.x features through the lens of deep reinforcement learning (DRL) by implementing an advantage actor-critic (A2C) agent, solving the classic CartPole-v0 environment. While the goal is to showcase TensorFlow 2.x, I will do my best to make DRL approachable as well, including a birds-eye overview of the field. In fact, since the main focus of the 2.x release is making life easier for the developers, it's a great time to get into DRL with TensorFlow. For example, the source code for this blog post is under 150 lines, including comments! Code is available on GitHub here and as a notebook on Google Colab here.


Dopamine and temporal difference learning: A fruitful relationship between neuroscience and AI

#artificialintelligence

Learning and motivation are driven by internal and external rewards. Many of our day-to-day behaviours are guided by predicting, or anticipating, whether a given action will result in a positive (that is, rewarding) outcome. The study of how organisms learn from experience to correctly anticipate rewards has been a productive research field for well over a century, since Ivan Pavlov's seminal psychological work. In his most famous experiment, dogs were trained to expect food some time after a buzzer sounded. These dogs began salivating as soon as they heard the sound, before the food had arrived, indicating they'd learned to predict the reward.


Mastering Machine Learning Algorithms: Expert techniques for implementing popular machine learning algorithms, fine-tuning your models, and understanding how they work, 2nd Edition: Giuseppe Bonaccorso: 9781838820299: Amazon.com: Books

#artificialintelligence

Giuseppe Bonaccorso is Head of Data Science in a large multinational company. He received his M.Sc.Eng. in Electronics in 2005 from University of Catania, Italy, and continued his studies at University of Rome Tor Vergata, and University of Essex, UK. His main interests include machine/deep learning, reinforcement learning, big data, and bio-inspired adaptive systems. He is author of several publications including Machine Learning Algorithms and Hands-On Unsupervised Learning with Python, published by Packt.


Provably Efficient Adaptive Approximate Policy Iteration

arXiv.org Machine Learning

Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains, including games and robotics. However, the theoretical understanding of such algorithms is limited, and existing results are largely focused on episodic or discounted Markov decision processes (MDPs). In this work, we present adaptive approximate policy iteration (AAPI), a learning scheme which enjoys a O(T^{2/3}) regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of O(T^{3/4}) for the average-reward case with function approximation. Our algorithm and analysis rely on adversarial online learning techniques, where value functions are treated as losses. The main technical novelty is the use of a data-dependent adaptive learning rate coupled with a so-called optimistic prediction of upcoming losses. In addition to theoretical guarantees, we demonstrate the advantages of our approach empirically on several environments.


Proficiency Aware Multi-Agent Actor-Critic for Mixed Aerial and Ground Robot Teaming

arXiv.org Artificial Intelligence

Mixed Cooperation and competition are the actual scenarios of deploying multi-robot systems, such as the multi-UAV/UGV teaming for tracking criminal vehicles and protecting important individuals. Types and the total number of robot are all important factors that influence mixed cooperation quality. In various real-world environments, such as open space, forest, and urban building clusters, robot deployments have been influenced largely, as different robots have different configurations to support different environments. For example, UGVs are good at moving on the urban roads and reach the forest area while UAVs are good at flying in open space and around the high building clusters. However, it is challenging to design the collective behaviors for robot cooperation according to the dynamic changes in robot capabilities, working status, and environmental constraints. To solve this question, we proposed a novel proficiency-aware mixed environment multi-agent deep reinforcement learning (Mix-DRL). In Mix-DRL, robot capability and environment factors are formalized into the model to update the policy to model the nonlinear relations between heterogeneous team deployment strategies and the real-world environmental conditions. Mix-DRL can largely exploit robot capability while staying aware of the environment limitations. With the validation of a heterogeneous team with 2 UAVs and 2 UGVs in tasks, such as social security for criminal vehicle tracking, the Mix-DRL's effectiveness has been evaluated with $14.20\%$ of cooperation improvement. Given the general setting of Mix-DRL, it can be used to guide the general cooperation of UAVs and UGVs for multi-target tracking.