value and policy
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
Bridging the Gap Between Value and Policy Based Reinforcement Learning
We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on-and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic. The experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
Reviews: Bridging the Gap Between Value and Policy Based Reinforcement Learning
SUMMARY: The paper considers entropy regularized discounted Markov Decision Process (MDP), and shows the relation between the optimal value, action-value, and policy. Moreover, it shows that the optimal value function and policy satisfy a temporal consistency in the form of Bellman-like equation (Theorem 1), which can also be extended to its n-step version (Corollary 2). The paper introduces Path Consistent Learning by enforcing the temporal consistency, which is essentially a Bellman residual minimization procedure (Section 5). SUMMARY OF EVALUATION: Quality: Parts of the paper is sound (Section 3 and 4); parts of it is not (Section 5) Clarity: The paper is well-written. Originality: Some results seem to be novel, but similar ideas and analysis have been proposed/done before.
Reviews: Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs
The paper describes a sampling method for learning agent behaviors in interactive POMDPs (I-POMDPs). In general, I-POMDPs are a multi-agent POMDP model which, in addition to a belief about the environment state, the belief space includes nested recursive beliefs about the other agents' models. I-POMDP solutions, including the one proposed in the paper, largely approximate using a finite depth with either intentional models of others (e.g., their nested beliefs, state transitions, optimality criterion, etc.) or subintentional models of others (e.g., essentially "summaries of behavior" such as fictitious play). The proposed approach uses samples of the other agent at a particular depth to compute its values and policy. Related work on an interactive particle filter assumed the full frame was known (b, S, A, Omega, T, R, OC).
Bridging the Gap Between Value and Policy Based Reinforcement Learning
Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans
We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on-and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic. The experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn
Deep neural networks provide Reinforcement Learning (RL) powerful function approximators to address large-scale decision-making problems. However, these approximators introduce challenges due to the non-stationary nature of RL training. One source of the challenges in RL is that output predictions can churn, leading to uncontrolled changes after each batch update for states not included in the batch. Although such a churn phenomenon exists in each step of network training, how churn occurs and impacts RL remains under-explored. In this work, we start by characterizing churn in a view of Generalized Policy Iteration with function approximation, and we discover a chain effect of churn that leads to a cycle where the churns in value estimation and policy improvement compound and bias the learning dynamics throughout the iteration. Further, we concretize the study and focus on the learning issues caused by the chain effect in different settings, including greedy action deviation in value-based methods, trust region violation in proximal policy optimization, and dual bias of policy value in actor-critic methods. We then propose a method to reduce the chain effect across different settings, called Churn Approximated ReductIoN (CHAIN), which can be easily plugged into most existing DRL algorithms. Our experiments demonstrate the effectiveness of our method in both reducing churn and improving learning performance across online and offline, value-based and policy-based RL settings, as well as a scaling setting.
Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement Learning with Domain Randomization
Kadokawa, Yuki, Zhu, Lingwei, Tsurumine, Yoshihisa, Matsubara, Takamitsu
Deep reinforcement learning with domain randomization learns a control policy in various simulations with randomized physical and sensor model parameters to become transferable to the real world in a zero-shot setting. However, a huge number of samples are often required to learn an effective policy when the range of randomized parameters is extensive due to the instability of policy updates. To alleviate this problem, we propose a sample-efficient method named cyclic policy distillation (CPD). CPD divides the range of randomized parameters into several small sub-domains and assigns a local policy to each one. Then local policies are learned while cyclically transitioning to sub-domains. CPD accelerates learning through knowledge transfer based on expected performance improvements. Finally, all of the learned local policies are distilled into a global policy for sim-to-real transfers. CPD's effectiveness and sample efficiency are demonstrated through simulations with four tasks (Pendulum from OpenAIGym and Pusher, Swimmer, and HalfCheetah from Mujoco), and a real-robot, ball-dispersal task. We published code and videos from our experiments at https://github.com/yuki-kadokawa/cyclic-policy-distillation.
Bridging the Gap Between Value and Policy Based Reinforcement Learning
Nachum, Ofir, Norouzi, Mohammad, Xu, Kelvin, Schuurmans, Dale
We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic.