AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Value-Guided Decision Transformer: A Unified Reinforcement Learning Framework for Online and Offline Settings

Neural Information Processing SystemsJun-12-2026, 14:03:57 GMT

The Conditional Sequence Modeling (CSM) paradigm, benefiting from the transformer's powerful distribution modeling capabilities, has demonstrated considerable promise in Reinforcement Learning (RL) tasks. However, much of the work has focused on applying CSM to single online or offline settings, with the general architecture rarely explored. Additionally, existing methods primarily focus on deterministic trajectory modeling, overlooking the randomness of state transitions and the diversity of future trajectory distributions. Fortunately, value-based methods offer a viable solution for CSM, further bridging the potential gap between offline and online RL. In this paper, we propose Value-Guided Decision Transformer (VDT), which leverages value functions to perform advantage-weighting and behavior regularization on the Decision Transformer (DT), guiding the policy toward upper-bound optimal decisions during the offline training phase.

artificial intelligence, machine learning, reinforcement learning, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

BraVE: Offline Reinforcement Learning for Discrete Combinatorial Action Spaces

Neural Information Processing SystemsJun-12-2026, 13:50:08 GMT

Offline reinforcement learning in high-dimensional, discrete action spaces is challenging due to the exponential scaling of the joint action space with the number of sub-actions and the complexity of modeling sub-action dependencies. Existing methods either exhaustively evaluate the action space, making them computationally infeasible, or factorize Q-values, failing to represent joint sub-action effects. We propose \textbf{Bra}nch \textbf{V}alue \textbf{E}stimation (BraVE), a value-based method that uses tree-structured action traversal to evaluate a linear number of joint actions while preserving dependency structure. BraVE outperforms prior offline RL methods by up to $20\times$ in environments with over four million actions.

artificial intelligence, proceedings, reinforcement learning, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.33)

Add feedback

Adversarial Diffusion for Robust Reinforcement Learning

Neural Information Processing SystemsJun-12-2026, 12:05:03 GMT

Robustness to modeling errors and uncertainties remains a central challenge in reinforcement learning (RL). In this work, we address this challenge by leveraging diffusion models to train robust RL policies. Diffusion models have recently gained popularity in model-based RL due to their ability to generate full trajectories all at once, mitigating the compounding errors typical of step-by-step transition models. Moreover, they can be conditioned to sample from specific distributions, making them highly flexible. We leverage conditional sampling to learn policies that are robust to uncertainty in environment dynamics. Building on the established connection between Conditional Value at Risk (CVaR) optimization and robust RL, we introduce Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL). AD-RRL guides the diffusion process to generate worst-case trajectories during training, effectively optimizing the CVaR of the cumulative return. Empirical results across standard benchmarks show that AD-RRL achieves superior robustness and performance compared to existing robust RL methods.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.55)

Add feedback

On the Sample Complexity Bounds of Bilevel Reinforcement Learning

Neural Information Processing SystemsJun-12-2026, 11:18:54 GMT

Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain relatively underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\tilde{\mathcal{O}}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-Łojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\tilde{\mathcal{O}}(\epsilon^{-3})$ improving upon existing bounds of $\tilde{\mathcal{O}}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

CHPO: Constrained Hybrid-action Policy Optimization for Reinforcement Learning

Neural Information Processing SystemsJun-12-2026, 10:06:01 GMT

Constrained hybrid-action reinforcement learning (RL) promises to learn a safe policy within a parameterized action space, which is particularly valuable for safety-critical applications involving discrete-continuous hybrid action spaces. However, existing hybrid-action RL algorithms primarily focus on reward maximization, which faces significant challenges for tasks involving both cost constraints and hybrid action spaces. In this work, we propose a novel Constrained Hybrid-action Policy Optimization algorithm (CHPO) to address the problems of constrained hybrid-action RL. Concretely, we rethink the limitations of hybrid-action RL in handling safe tasks with parameterized action spaces and reframe the objective of constrained hybrid-action RL by introducing the concept of Constrained Parameterized-action Markov Decision Process (CPMDP). Subsequently, we present a constrained hybrid-action policy optimization algorithm to confront the constrained hybrid-action problems and conduct theoretical analyses demonstrating that the CHPO converges to the optimal solution while satisfying safety constraints. Finally, extensive experiments demonstrate that the CHPO achieves competitive performance across multiple experimental tasks.

artificial intelligence, machine learning, reinforcement learning, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.64)

Add feedback

Adaptable Safe Policy Learning from Multi-task Data with Constraint Prioritized Decision Transformer

Neural Information Processing SystemsJun-12-2026, 09:37:35 GMT

Learning safe reinforcement learning (RL) policies from offline multi-task datasets without direct environmental interaction is crucial for efficient and reliable deployment of RL agents. Benefiting from their scalability and strong in-context learning capabilities, recent approaches attempt to utilize Decision Transformer (DT) architectures for offline safe RL, demonstrating promising adaptability across varying safety budgets. However, these methods primarily focus on single-constraint scenarios and struggle with diverse constraint configurations across multiple tasks. Additionally, their reliance on heuristically defined Return-To-Go (RTG) inputs limits flexibility and reduces learning efficiency, particularly in complex multi-task environments. To address these limitations, we propose CoPDT, a novel DT-based framework designed to enhance adaptability to diverse constraints and varying safety budgets. Specifically, CoPDT introduces a constraint prioritized prompt encoder, which leverages sparse binary cost signals to accurately identify constraints, and a constraint prioritized Return-To-Go (CPRTG) token mechanism, which dynamically generates RTGs based on identified constraints and corresponding safety budgets. Extensive experiments on the OSRL benchmark demonstrate that CoPDT achieves superior efficiency and significantly enhanced safety compliance across diverse multi-task scenarios, surpassing state-of-the-art DT-based methods by satisfying safety constraints in more than twice as many tasks.

artificial intelligence, machine learning, reinforcement learning, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.60)

Add feedback

EVAAA: A Virtual Environment Platform for Essential Variables in Autonomous and Adaptive Agents

Neural Information Processing SystemsJun-12-2026, 08:18:42 GMT

Reinforcement learning (RL) agents have demonstrated strong performance in structured environments, yet they continue to struggle in real-world settings where goals are ambiguous, conditions change dynamically, and external supervision is limited. These challenges stem not primarily from the algorithmic limitations but from the characteristics of conventional training environments, which are usually static, task-specific, and externally defined. In contrast, biological agents develop autonomy and adaptivity by interacting with complex, dynamic environments, where most behaviors are ultimately driven by internal physiological needs. Inspired by these biological constraints, we introduce EVAAA (Essential Variables in Autonomous and Adaptive Agents), a 3D virtual environment for training and evaluating egocentric RL agents endowed with internal physiological state variables. In EVAAA, agents must maintain essential variables (EVs)--e.g., satiation, hydration, body temperature, and tissue integrity (the level of damage)--within viable bounds by interacting with environments that increase in difficulty at each stage.

artificial intelligence, machine learning, reinforcement learning, (10 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.83)

Add feedback

Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization

Neural Information Processing SystemsJun-12-2026, 07:49:50 GMT

Reinforcement learning has achieved significant milestones, but sample efficiency remains a bottleneck for real-world applications. Recently, CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1. In this work, we explore CrossQ's scaling behavior with higher UTD ratios. We identify challenges in the training dynamics, which are emphasized by higher UTD ratios. To address these, we integrate weight normalization into the CrossQ framework, a solution that stabilizes training, has been shown to prevent potential loss of plasticity, and keeps the effective learning rate constant. Our proposed approach reliably scales with increasing UTD ratios, achieving competitive performance across 25 challenging continuous control tasks on the DeepMind Control Suite and Myosuite benchmarks, notably the complex dog and humanoid environments. This work eliminates the need for drastic interventions, such as network resets, and offers a simple yet robust pathway for improving sample efficiency and scalability in model-free reinforcement learning.

artificial intelligence, machine learning, reinforcement learning, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.55)

Add feedback

A Differential and Pointwise Control Approach to Reinforcement Learning

Neural Information Processing SystemsJun-12-2026, 07:48:53 GMT

Reinforcement learning (RL) in continuous state-action spaces remains challenging in scientific computing due to poor sample efficiency and lack of pathwise physical consistency. We introduce Differential Reinforcement Learning (Differential RL), a novel framework that reformulates RL from a continuous-time control perspective via a differential dual formulation. This induces a Hamiltonian structure that embeds physics priors and ensures consistent trajectories without requiring explicit constraints. To implement Differential RL, we develop Differential Policy Optimization (dfPO), a pointwise, stage-wise algorithm that refines local movement operators along the trajectory for improved sample efficiency and dynamic alignment. We establish pointwise convergence guarantees, a property not available in standard RL, and derive a competitive theoretical regret bound of $\mathcal{O}(K^{5/6})$. Empirically, dfPO outperforms standard RL baselines on representative scientific computing tasks, including surface modeling, grid control, and molecular dynamics, under low-data and physics-constrained conditions.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.55)

Add feedback

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Neural Information Processing SystemsJun-12-2026, 06:51:45 GMT

Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce Reinforcement Learning from Rendering Feedback, an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward.

machine learning, natural language, reinforcement learning, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.59)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.53)

Add feedback