Goto

Collaborating Authors

 Reinforcement Learning


Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

arXiv.org Artificial Intelligence

The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $ฮณ< 1$. In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $ฮณ= 1$, rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when $ฮณ= 1$) can be replaced with a new object that we call the transient visitation measure.


Nash Policy Gradient: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

arXiv.org Artificial Intelligence

Finding Nash equilibria in imperfect-information games remains a central challenge in multi-agent reinforcement learning. While regularization-based methods have recently achieved last-iteration convergence to a regularized equilibrium, they require the regularization strength to shrink toward zero to approximate a Nash equilibrium, often leading to unstable learning in practice. Instead, we fix the regularization strength at a large value for robustness and achieve convergence by iteratively refining the reference policy. Our main theoretical result shows that this procedure guarantees strictly monotonic improvement and convergence to an exact Nash equilibrium in two-player zero-sum games, without requiring a uniqueness assumption. Building on this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which preserves the generalizability of policy gradient methods while relying solely on the current and reference policies. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods on classic benchmark games and scales to large domains such as Battleship and No-Limit Texas Hold'em, where NashPG consistently attains higher Elo ratings.


Provably Optimal Reinforcement Learning under Safety Filtering

arXiv.org Artificial Intelligence

Recent advances in reinforcement learning (RL) enable its use on increasingly complex tasks, but the lack of formal safety guarantees still limits its application in safety-critical settings. A common practical approach is to augment the RL policy with a safety filter that overrides unsafe actions to prevent failures during both training and deployment. However, safety filtering is often perceived as sacrificing performance and hindering the learning process. We show that this perceived safety-performance tradeoff is not inherent and prove, for the first time, that enforcing safety with a sufficiently permissive safety filter does not degrade asymptotic performance. We formalize RL safety with a safety-critical Markov decision process (SC-MDP), which requires categorical, rather than high-probability, avoidance of catastrophic failure states. Additionally, we define an associated filtered MDP in which all actions result in safe effects, thanks to a safety filter that is considered to be a part of the environment. Our main theorem establishes that (i) learning in the filtered MDP is safe categorically, (ii) standard RL convergence carries over to the filtered MDP, and (iii) any policy that is optimal in the filtered MDP-when executed through the same filter-achieves the same asymptotic return as the best safe policy in the SC-MDP, yielding a complete separation between safety enforcement and performance optimization. We validate the theory on Safety Gymnasium with representative tasks and constraints, observing zero violations during training and final performance matching or exceeding unfiltered baselines. Together, these results shed light on a long-standing question in safety-filtered learning and provide a simple, principled recipe for safe RL: train and deploy RL policies with the most permissive safety filter that is available.


UniRL-Zero: Reinforcement Learning on Unified Models with Joint Language Model and Diffusion Model Experts

arXiv.org Artificial Intelligence

W e present UniRL-Zero, a unified reinforcement learning (RL) framework that boosts, multimodal language model understanding and reasoning, diffusion model multimedia generation, and their beneficial interaction capabilities within a unified model. Our work defines six scenarios for unified model reinforcement learning, providing systematic baselines for reinforcement learning of unified understanding and generation model. Our code is available at https://github.com/G-UN/UniRL.


Self-Evidencing Through Hierarchical Gradient Decomposition: A Dissipative System That Maintains Non-Equilibrium Steady-State by Minimizing Variational Free Energy

arXiv.org Artificial Intelligence

The Free Energy Principle (FEP) states that self-organizing systems must minimize variational free energy to persist (Friston, 2010, 2019), but the path from principle to implementable algorithm has remained unclear. We present a constructive proof that the FEP can be realized through exact local credit assignment. The system decomposes gradient computation hierarchically: spatial credit via feedback alignment, temporal credit via eligibility traces, and structural credit via a Trophic Field Map (TFM) that estimates expected gradient magnitude for each connection block. We prove these mechanisms are exact at their respective levels and validate the central claim empirically: the TFM achieves 0.9693 Pearson correlation with oracle gradients. This exactness produces emergent capabilities including 98.6% retention after task interference, autonomous recovery from 75% structural damage, self-organized criticality (spectral radius ฯ 1.0), and sample-efficient reinforcement learning on continuous control tasks without replay buffers. The architecture unifies Pri-gogine's dissipative structures (Prigogine, 1977), Fris-ton's free energy minimization (Friston, 2010), and Hopfield's attractor dynamics (Hopfield, 1982; Amit et al., 1985a,b), demonstrating that exact hierarchical inference over network topology can be implemented with local, biologically plausible rules.


DRL-Based Resource Allocation for Energy-Efficient IRS-Assisted UAV Spectrum Sharing Systems

arXiv.org Artificial Intelligence

Intelligent reflecting surface (IRS) assisted unmanned aerial vehicle (UAV) systems provide a new paradigm for reconfigurable and flexible wireless communications. To enable more energy efficient and spectrum efficient IRS assisted UAV wireless communications, this paper introduces a novel IRS-assisted UAV enabled spectrum sharing system with orthogonal frequency division multiplexing (OFDM). The goal is to maximize the energy efficiency (EE) of the secondary network by jointly optimizing the beamforming, subcarrier allocation, IRS phase shifts, and the UAV trajectory subject to practical transmit power and passive reflection constraints as well as UAV physical limitations. A physically grounded propulsion-energy model is adopted, with its tight upper bound used to form a tractable EE lower bound for the spectrum sharing system. To handle highly non convex, time coupled optimization problems with a mixed continuous and discrete policy space, we develop a deep reinforcement learning (DRL) approach based on the actor critic framework. Extended experiments show the significant EE improvement of the proposed DRL-based approach compared to several benchmark schemes, thus demonstrating the effectiveness and robustness of the proposed approach with mobility.


ProSh: Probabilistic Shielding for Model-free Reinforcement Learning

arXiv.org Artificial Intelligence

Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent's policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.


Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

arXiv.org Artificial Intelligence

Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the critical issue of logical coherence--particularly preference cycles (A B C A)--has been largely unaddressed. To address this gap, this work introduces an end-to-end framework to systematically detect and resolve these inconsistencies within the reinforcement learning training loop. Our framework features two core contributions: the Conflict Detection Rate (CDR), a novel metric to quantify judgment conflicts, and Deconflicted Graph Rewards (DGR), a signal-purification framework that eliminates cycles before policy optimization. DGR constructs preference graphs from raw judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal compatible with any policy optimizer. Experiments confirm that our framework significantly improves training stability and model performance over strong baselines, establishing logical consistency as a crucial and now-addressable dimension of AI feedback. Aligning large language models (LLMs) with human preferences, traditionally achieved through Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022), is critical for safe AI deployment. However, the reliance on costly and slow human annotation has created a scalability bottleneck, pushing the field towards Reinforcement Learning from AI Feedback (RLAIF) (Bai et al., 2022; Lee et al., 2023). Within RLAIF, the pairwise comparison paradigm--where an LLM judge selects the better of two responses--has become the de facto standard, prized for its intuitive nature and fine-grained feedback signal that underpins many state-of-the-art alignment techniques (Song et al., 2024; Wang et al., 2024). Recent advances in pairwise methods include the Pairwise-RL framework (Xu et al., 2025), which addresses the fundamental misalignment between generative base models and discriminative reward tasks by unifying reward model training and reinforcement learning application in a consistent pairwise paradigm. This framework combines generative reward modeling with pairwise policy optimization, leveraging generative modeling techniques to improve reward model performance and score calibration. Consequently, our work focuses on the pairwise paradigm, building upon these foundational approaches. However, this scalability comes at a hidden cost: the erosion of logical consistency.


Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

arXiv.org Artificial Intelligence

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.


Towards Versatile Humanoid Table Tennis: Unified Reinforcement Learning with Prediction Augmentation

arXiv.org Artificial Intelligence

Humanoid table tennis (TT) demands rapid perception, proactive whole-body motion, and agile footwork under strict timing -- capabilities that remain difficult for unified controllers. We propose a reinforcement learning framework that maps ball-position observations directly to whole-body joint commands for both arm striking and leg locomotion, strengthened by predictive signals and dense, physics-guided rewards. A lightweight learned predictor, fed with recent ball positions, estimates future ball states and augments the policy's observations for proactive decision-making. During training, a physics-based predictor supplies precise future states to construct dense, informative rewards that lead to effective exploration. The resulting policy attains strong performance across varied serve ranges (hit rate $\geq$ 96% and success rate $\geq$ 92%) in simulations. Ablation studies confirm that both the learned predictor and the predictive reward design are critical for end-to-end learning. Deployed zero-shot on a physical Booster T1 humanoid with 23 revolute joints, the policy produces coordinated lateral and forward-backward footwork with accurate, fast returns, suggesting a practical path toward versatile, competitive humanoid TT.