AITopics | off-policy reinforcement learning

Collaborating Authors

off-policy reinforcement learning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Regret Minimization Experience Replay in Off-Policy Reinforcement Learning

Neural Information Processing SystemsDec-24-2025, 12:05:51 GMT

In reinforcement learning, experience replay stores past samples for further reuse. Prioritized sampling is a promising technique to better utilize these samples. Previous criteria of prioritization include TD error, recentness and corrective feedback, which are mostly heuristically designed. In this work, we start from the regret minimization objective, and obtain an optimal prioritization strategy for Bellman update that can directly maximize the return of the policy. The theory suggests that data with higher hindsight TD error, better on-policiness and more accurate Q value should be assigned with higher weights during sampling. Thus most previous criteria only consider this strategy partially. We not only provide theoretical justifications for previous criteria, but also propose two new methods to compute the prioritization weight, namely ReMERN and ReMERT. ReMERN learns an error network, while ReMERT exploits the temporal ordering of states. Both methods outperform previous prioritized sampling algorithms in challenging RL benchmarks, including MuJoCo, Atari and Meta-World.

name change, off-policy reinforcement learning, regret minimization experience replay, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Xi, Zhiheng, Guo, Xin, Nan, Yang, Zhou, Enyu, Shen, Junrui, Chen, Wenxiang, Liu, Jiaqi, Huang, Jixuan, Zhang, Zhihao, Guo, Honglin, Deng, Xun, Lei, Zhikai, Zheng, Miao, Wang, Guoteng, Zhang, Shuo, Sun, Peng, Zheng, Rui, Yan, Hang, Gui, Tao, Zhang, Qi, Huang, Xuanjing

arXiv.org Artificial IntelligenceOct-23-2025

Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.18927

Country:

Europe > Austria > Vienna (0.14)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Search-Based Adversarial Estimates for Improving Sample Efficiency in Off-Policy Reinforcement Learning

Malato, Federico, Hautamaki, Ville

arXiv.org Artificial IntelligenceFeb-3-2025

Sample inefficiency is a long-lasting challenge in deep reinforcement learning (DRL). Despite dramatic improvements have been made, the problem is far from being solved and is especially challenging in environments with sparse or delayed rewards. In our work, we propose to use Adversarial Estimates as a new, simple and efficient approach to mitigate this problem for a class of feedback-based DRL algorithms. Our approach leverages latent similarity search from a small set of human-collected trajectories to boost learning, using only five minutes of human-recorded experience. The results of our study show algorithms trained with Adversarial Estimates converge faster than their original version. Moreover, we discuss how our approach could enable learning in feedback-based algorithms in extreme scenarios with very sparse rewards.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2502.01558

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Portugal > Braga > Braga (0.04)
Europe > Finland > North Karelia > Joensuu (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.85)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Regret Minimization Experience Replay in Off-Policy Reinforcement Learning

Neural Information Processing SystemsJan-17-2025, 12:52:07 GMT

criteria, off-policy reinforcement learning, regret minimization experience replay, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)

Add feedback

When is Realizability Sufficient for Off-Policy Reinforcement Learning?

Zanette, Andrea

arXiv.org Artificial IntelligenceJun-5-2023

Model-free algorithms for reinforcement learning typically require a condition called Bellman completeness in order to successfully operate off-policy with function approximation, unless additional conditions are met. However, Bellman completeness is a requirement that is much stronger than realizability and that is deemed to be too strong to hold in practice. In this work, we relax this structural assumption and analyze the statistical complexity of off-policy reinforcement learning when only realizability holds for the prescribed function class. We establish finite-sample guarantees for off-policy reinforcement learning that are free of the approximation error term known as inherent Bellman error, and that depend on the interplay of three factors. The first two are well known: they are the metric entropy of the function class and the concentrability coefficient that represents the cost of learning off-policy. The third factor is new, and it measures the violation of Bellman completeness, namely the mis-alignment between the chosen function class and its image through the Bellman operator. In essence, these error bounds establish that off-policy reinforcement learning remains statistically viable even in absence of Bellman completeness, and characterize the intermediate situation between the favorable Bellman complete setting and the worst-case scenario where exponential lower bounds are in force. Our analysis directly applies to the solution found by temporal difference algorithms when they converge.

artificial intelligence, bellman error, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2211.05311

Country:

North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Research Papers based on Off-Policy based Reinforcement Learning

#artificialintelligenceMay-31-2022, 08:20:50 GMT

Abstract: We discuss the problem of decentralized multi-agent reinforcement learning (MARL) in this work. In our setting, the global state, action, and reward are assumed to be fully observable, while the local policy is protected as privacy by each agent, and thus cannot be shared with others. There is a communication graph, among which the agents can exchange information with their neighbors. The agents make individual decisions and cooperate to reach a higher accumulated reward. Towards this end, we first propose a decentralized actor-critic (AC) setting.

algorithm, learning, reinforcement learning, (13 more...)

#artificialintelligence

Industry:

Transportation > Ground > Road (0.34)
Automobiles & Trucks (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy Reinforcement Learning

Daley, Brett, Hickert, Cameron, Amato, Christopher

arXiv.org Artificial IntelligenceFeb-22-2021

Deep Reinforcement Learning (RL) methods rely on experience replay to approximate the minibatched supervised learning setting; however, unlike supervised learning where access to lots of training data is crucial to generalization, replay-based deep RL appears to struggle in the presence of extraneous data. Recent works have shown that the performance of Deep Q-Network (DQN) degrades when its replay memory becomes too large. This suggests that outdated experiences somehow impact the performance of deep RL, which should not be the case for off-policy methods like DQN. Consequently, we re-examine the motivation for sampling uniformly over a replay memory, and find that it may be flawed when using function approximation. We show that -- despite conventional wisdom -- sampling from the uniform distribution does not yield uncorrelated training samples and therefore biases gradients during training. Our theory prescribes a special non-uniform distribution to cancel this effect, and we propose a stratified sampling scheme to efficiently implement it.

experience replay, learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2102.11319

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.05)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback