Reinforcement Learning
Using Machine Teaching to Investigate Human Assumptions when Teaching Reinforcement Learners
Chuang, Yun-Shiuan, Zhang, Xuezhou, Ma, Yuzhe, Ho, Mark K., Austerweil, Joseph L., Zhu, Xiaojin
Successful teaching requires an assumption of how the learner learns - how the learner uses experiences from the world to update their internal states. We investigate what expectations people have about a learner when they teach them in an online manner using rewards and punishment. We focus on a common reinforcement learning method, Q-learning, and examine what assumptions people have using a behavioral experiment. To do so, we first establish a normative standard, by formulating the problem as a machine teaching optimization problem. To solve the machine teaching optimization problem, we use a deep learning approximation method which simulates learners in the environment and learns to predict how feedback affects the learner's internal states. What do people assume about a learner's learning and discount rates when they teach them an idealized exploration-exploitation task? In a behavioral experiment, we find that people can teach the task to Q-learners in a relatively efficient and effective manner when the learner uses a small value for its discounting rate and a large value for its learning rate. However, they still are suboptimal. We also find that providing people with real-time updates of how possible feedback would affect the Q-learner's internal states weakly helps them teach. Our results reveal how people teach using evaluative feedback and provide guidance for how engineers should design machine agents in a manner that is intuitive for people.
Critic Regularized Regression
Wang, Ziyu, Novikov, Alexander, Zolna, Konrad, Springenberg, Jost Tobias, Reed, Scott, Shahriari, Bobak, Siegel, Noah, Merel, Josh, Gulcehre, Caglar, Heess, Nicolas, de Freitas, Nando
Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learning from a fixed dataset. In this paper, we propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR). We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces -- outperforming several state-of-the-art offline RL algorithms by a significant margin on a wide range of benchmark tasks.
Analysis of Social Robotic Navigation approaches: CNN Encoder and Incremental Learning as an alternative to Deep Reinforcement Learning
Ferreira, Janderson, Júnior, Agostinho A. F., Castro, Letícia, Galvão, Yves M., Barros, Pablo, Fernandes, Bruno J. T.
Dealing with social tasks in robotic scenarios is difficult, as having humans in the learning loop is incompatible with most of the state-of-the-art machine learning algorithms. This is the case when exploring Incremental learning models, in particular the ones involving reinforcement learning. In this work, we discuss this problem and possible solutions by analysing a previous study on adaptive convolutional encoders for a social navigation task.
A Hybrid PAC Reinforcement Learning Algorithm
Zehfroosh, Ashkan, Tanner, Herbert G.
This paper offers a new hybrid probably asymptotically correct (PAC) reinforcement learning (RL) algorithm for Markov decision processes (MDPs) that intelligently maintains favorable features of its parents. The designed algorithm, referred to as the Dyna-Delayed Q-learning (DDQ) algorithm, combines model-free and model-based learning approaches while outperforming both in most cases. The paper includes a PAC analysis of the DDQ algorithm and a derivation of its sample complexity. Numerical results that support the claim regarding the new algorithm's sample efficiency compared to its parents are showcased in a small grid-world example.
Using Reinforcement Learning to Design Missed Thrust Resilient Trajectories - ASC- 2020 - Gereshes
Note: This post is adapted from my conference paper, that I presented at the Astrodynamics Specialists Conference in Summer 2020. You can read the full paper here. From ion thrusters to solar sails, spacecraft continue to adopt new and more efficient forms of propulsion. As these low-thrust propulsion methods have become more prevalent, new challenges have arisen. Depending on the mission, low-thrust propulsion elements may need to thrust continuously for days/months.
Machine Learning Strategy and Intro to Reinforcement Learning
NOTE: This course is a continuation of XCS229i: Machine Learning. Though not strictly required, it is highly recommended to take XCS229i before enrolling in XCS229ii, as assignments assume knowledge of topics in the first course. As machine learning models grow in sophistication, it is increasingly important for its practitioners to be comfortable navigating their many tuning parameters. Through video lectures and hands-on exercises, this course will equip you with the knowledge to get the most out of your data. You will learn the concepts and techniques you need to guide teams of ML practitioners.
Policy Gradient Reinforcement Learning for Policy Represented by Fuzzy Rules: Application to Simulations of Speed Control of an Automobile
Ishihara, Seiji, Igarashi, Harukazu
A method of a fusion of fuzzy inference and policy gradient reinforcement learning has been proposed that directly learns, as maximizes the expected value of the reward per episode, parameters in a policy function represented by fuzzy rules with weights. A study has applied this method to a task of speed control of an automobile and has obtained correct policies, some of which control speed of the automobile appropriately but many others generate inappropriate vibration of speed. In general, the policy is not desirable that causes sudden time change or vibration in the output value, and there would be many cases where the policy giving smooth time change in the output value is desirable. In this paper, we propose a fusion method using the objective function, that introduces defuzzification with the center of gravity model weighted stochastically and a constraint term for smoothness of time change, as an improvement measure in order to suppress sudden change of the output value of the fuzzy controller. Then we show the learning rule in the fusion, and also consider the effect by reward functions on the fluctuation of the output value. As experimental results of an application of our method on speed control of an automobile, it was confirmed that the proposed method has the effect of suppressing the undesirable fluctuation in time-series of the output value. Moreover, it was also showed that the difference between reward functions might adversely affect the results of learning.
Visualizing the Loss Landscape of Actor Critic Methods with Applications in Inventory Optimization
Bekci, Recep Yusuf, Gümüş, Mehmet
Continuous control is a widely applicable area of reinforcement learning. The main players of this area are actor-critic methods that utilize policy gradients of neural approximators as a common practice. The focus of our study is to show the characteristics of the actor loss function which is the essential part of the optimization. We exploit low dimensional visualizations of the loss function and provide comparisons for loss landscapes of various algorithms. Furthermore, we apply our approach to multi-store dynamic inventory control, a notoriously difficult problem in supply chain operations, and explore the shape of the loss function associated with the optimal policy. We modelled and solved the problem using reinforcement learning while having a loss landscape in favor of optimality.
Modern Reinforcement Learning: Actor-Critic Methods
Modern Reinforcement Learning: Actor-Critic Methods Udemy Coupon ED How to Implement Cutting Edge Artificial Intelligence Research Papers in the Open AI Gym Using the PyTorch Framework Get Udemy Course What you'll learn How to code policy gradient methods in PyTorch How to code Deep Deterministic Policy Gradients (DDPG) in PyTorch How to code Twin Delayed Deep Deterministic Policy Gradients (TD3) in PyTorch How to code actor critic algorithms in PyTorch How to implement cutting edge artificial intelligence research papers in Python Description In this advanced course on deep reinforcement learning, you will learn how to implement policy gradient, actor critic, deep deterministic policy gradient (DDPG), and twin delayed deep deterministic policy gradient (TD3) algorithms in a variety of challenging environments from the Open AI gym. The course begins with a practical review of the fundamentals of reinforcement learning, including topics such as: The Bellman Equation Markov Decision Processes Monte Carlo Prediction Temporal Difference Prediction TD(0) Temporal Difference Control with Q Learning And moves straight into coding up our first agent: a blackjack playing artificial intelligence. From there we will progress to teaching an agent to balance the cart pole using Q learning. After mastering the fundamentals, the pace quickens, and we move straight into an introduction to policy gradient methods. We cover the REINFORCE algorithm, and use it to teach an artificial intelligence to land on the moon in the lunar lander environment from the Open AI gym.