The ability to learn from off-policy data -- data generated from past interaction with the environment -- is essential to data efficient reinforcement learning. Recent work has shown that the use of off-policy data not only allows the re-use of data but can even improve performance in comparison to on-policy reinforcement learning. In this work we investigate if a recently proposed method for learning a better data generation policy, commonly called a behavior policy, can also increase the data efficiency of policy gradient reinforcement learning. Empirical results demonstrate that with an appropriately selected behavior policy we can estimate the policy gradient more accurately. The results also motivate further work into developing methods for adapting the behavior policy as the policy we are learning changes.
Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family.
Till now, we have extended our reinforcement learning topic from discrete state to continuous state and have elaborated a bit on applying tile coding to on-policy learning, that is the learning process follows the trajectory the agent takes. Now let's have a talk of off-policy learning in continuous settings. While in discrete settings, on-policy learning can easily be generalised to off-policy learning(say, from Sarsa to Q-learning), in continuous settings, the generalisation can be a little troublesome, and in some scenarios can cause divergence issues. The most prominent consequence of off-policy learning is it may not necessarily converge in continuous settings. The major reason is caused by the distribution of updates in the off-policy case is not according to the on-policy distribution, that is the state, action being used to update might not be the state, action the agent takes.
While deep reinforcement learning has successfully solved many challenging control tasks, its real-world applicability has been limited by the inability to ensure the safety of learned policies. We propose an approach to verifiable reinforcement learning by training decision tree policies, which can represent complex policies (since they are nonparametric), yet can be efficiently verified using existing techniques (since they are highly structured). The challenge is that decision tree policies are difficult to train. We propose VIPER, an algorithm that combines ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy (called the oracle) and its Q-function, and show that it substantially outperforms two baselines. We use VIPER to (i) learn a provably robust decision tree policy for a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree policy for a toy game based on Pong that provably never loses, and (iii) learn a provably stable decision tree policy for cart-pole.
Applying reinforcement learning in large Markov Decision Process (MDP) is an important issue for solving very large problems. Since the exact resolution is often intractable, many approaches have been proposed to approximate the value function (for example, TD-Gammon (Tesauro 1995)) or to approximate directly the policy by gradient methods (Russell & Norvig 2002). Such approaches provide a policy on all the state space whereas classical reinforcement learning algorithms do not guarantee in finite time the exploration of all states. However, these approaches often need a manual definition of the parameter for approximation functions. Recently, (Lagoudakis & Parr 2003) introduced the problem of approximating policy by a policy iteration algorithm using a mix between a rollout algorithm and Support Vector Machines (SVM).