AITopics | value and policy

Collaborating Authors

value and policy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn

Neural Information Processing SystemsFeb-8-2026, 21:31:16 GMT

Network outputs can change indirectly to unexpected values after any random batch update for input data not included in the batch, called churn in this paper.

machine learning, reinforcement learning, value and policy, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Portugal > Braga > Braga (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Neural Information Processing SystemsNov-21-2025, 16:23:51 GMT

We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on-and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic. The experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.

name change, reinforcement learning, value and policy, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans

Neural Information Processing SystemsNov-21-2025, 13:59:12 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn

Neural Information Processing SystemsOct-9-2025, 20:05:38 GMT

Network outputs can change indirectly to unexpected values after any random batch update for input data not included in the batch, called churn in this paper.

deviation, reinforcement learning, value and policy, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Portugal > Braga > Braga (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Reviews: Bridging the Gap Between Value and Policy Based Reinforcement Learning

Neural Information Processing SystemsOct-8-2024, 13:48:04 GMT

SUMMARY: The paper considers entropy regularized discounted Markov Decision Process (MDP), and shows the relation between the optimal value, action-value, and policy. Moreover, it shows that the optimal value function and policy satisfy a temporal consistency in the form of Bellman-like equation (Theorem 1), which can also be extended to its n-step version (Corollary 2). The paper introduces Path Consistent Learning by enforcing the temporal consistency, which is essentially a Bellman residual minimization procedure (Section 5). SUMMARY OF EVALUATION: Quality: Parts of the paper is sound (Section 3 and 4); parts of it is not (Section 5) Clarity: The paper is well-written. Originality: Some results seem to be novel, but similar ideas and analysis have been proposed/done before.

reinforcement learning, temporal consistency, value and policy, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.53)

Add feedback

Reviews: Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs

Neural Information Processing SystemsOct-7-2024, 11:36:55 GMT

The paper describes a sampling method for learning agent behaviors in interactive POMDPs (I-POMDPs). In general, I-POMDPs are a multi-agent POMDP model which, in addition to a belief about the environment state, the belief space includes nested recursive beliefs about the other agents' models. I-POMDP solutions, including the one proposed in the paper, largely approximate using a finite depth with either intentional models of others (e.g., their nested beliefs, state transitions, optimality criterion, etc.) or subintentional models of others (e.g., essentially "summaries of behavior" such as fictitious play). The proposed approach uses samples of the other agent at a particular depth to compute its values and policy. Related work on an interactive particle filter assumed the full frame was known (b, S, A, Omega, T, R, OC).

i-pomdp, interactive pomdp, subintentional model, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans

Neural Information Processing SystemsOct-4-2024, 11:08:45 GMT

algorithm, consistency, pcl, (15 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn

Tang, Hongyao, Berseth, Glen

arXiv.org Artificial IntelligenceSep-7-2024

Deep neural networks provide Reinforcement Learning (RL) powerful function approximators to address large-scale decision-making problems. However, these approximators introduce challenges due to the non-stationary nature of RL training. One source of the challenges in RL is that output predictions can churn, leading to uncontrolled changes after each batch update for states not included in the batch. Although such a churn phenomenon exists in each step of network training, how churn occurs and impacts RL remains under-explored. In this work, we start by characterizing churn in a view of Generalized Policy Iteration with function approximation, and we discover a chain effect of churn that leads to a cycle where the churns in value estimation and policy improvement compound and bias the learning dynamics throughout the iteration. Further, we concretize the study and focus on the learning issues caused by the chain effect in different settings, including greedy action deviation in value-based methods, trust region violation in proximal policy optimization, and dual bias of policy value in actor-critic methods. We then propose a method to reduce the chain effect across different settings, called Churn Approximated ReductIoN (CHAIN), which can be easily plugged into most existing DRL algorithms. Our experiments demonstrate the effectiveness of our method in both reducing churn and improving learning performance across online and offline, value-based and policy-based RL settings, as well as a scaling setting.

pcr, timestep, value and policy, (13 more...)

arXiv.org Artificial Intelligence

2409.04792

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Portugal > Braga > Braga (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement Learning with Domain Randomization

Kadokawa, Yuki, Zhu, Lingwei, Tsurumine, Yoshihisa, Matsubara, Takamitsu

arXiv.org Artificial IntelligenceApr-10-2023

Deep reinforcement learning with domain randomization learns a control policy in various simulations with randomized physical and sensor model parameters to become transferable to the real world in a zero-shot setting. However, a huge number of samples are often required to learn an effective policy when the range of randomized parameters is extensive due to the instability of policy updates. To alleviate this problem, we propose a sample-efficient method named cyclic policy distillation (CPD). CPD divides the range of randomized parameters into several small sub-domains and assigns a local policy to each one. Then local policies are learned while cyclically transitioning to sub-domains. CPD accelerates learning through knowledge transfer based on expected performance improvements. Finally, all of the learned local policies are distilled into a global policy for sim-to-real transfers. CPD's effectiveness and sample efficiency are demonstrated through simulations with four tasks (Pendulum from OpenAIGym and Pusher, Swimmer, and HalfCheetah from Mujoco), and a real-robot, ball-dispersal task. We published code and videos from our experiments at https://github.com/yuki-kadokawa/cyclic-policy-distillation.

local policy, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2207.14561

Country: Asia > Japan (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Nachum, Ofir, Norouzi, Mohammad, Xu, Kelvin, Schuurmans, Dale

Neural Information Processing SystemsFeb-14-2020, 11:13:26 GMT

We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic.

artificial intelligence, machine learning, reinforcement learning, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback