AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Regularized Q-Learning

Neural Information Processing SystemsMar-22-2026, 19:10:35 GMT

Q-learning is widely used algorithm in reinforcement learning (RL) community. Under the lookup table setting, its convergence is well established. However, its behavior is known to be unstable with the linear function approximation case. This paper develops a new Q-learning algorithm, called RegQ, that converges when linear function approximation is used. We prove that simply adding an appropriate regularization term ensures convergence of the algorithm. Its stability is established using a recent analysis tool based on switching system models. Moreover, we experimentally show that RegQ converges in environments where Q-learning with linear function approximation has known to diverge. An error bound on the solution where the algorithm converges is also given.

artificial intelligence, machine learning, reinforcement learning, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

RoME: A Robust Mixed-Effects Bandit Algorithm for Optimizing Mobile Health Interventions

Neural Information Processing SystemsMar-22-2026, 18:37:15 GMT

Mobile health leverages personalized and contextually tailored interventions optimized through bandit and reinforcement learning algorithms. In practice, however, challenges such as participant heterogeneity, nonstationarity, and nonlinear relationships hinder algorithm performance.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Add feedback

Parseval Regularization for Continual Reinforcement Learning

Neural Information Processing SystemsMar-22-2026, 18:36:30 GMT

Plasticity loss, trainability loss, and primacy bias have been identified as issues arising when training deep neural networks on sequences of tasks---referring to the increased difficulty in training on new tasks.We propose to use Parseval regularization, which maintains orthogonality of weight matrices, to preserve useful optimization properties and improve training in a continual reinforcement learning setting.We show that it provides significant benefits to RL agents on a suite of gridworld, CARL and MetaWorld tasks.We conduct comprehensive ablations to identify the source of its benefits and investigate the effect of certain metrics associated to network trainability including weight matrix rank, weight norms and policy entropy.

artificial intelligence, proceedings, reinforcement learning, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.62)

Add feedback

Contextual Bilevel Reinforcement Learning for Incentive Alignment

Neural Information Processing SystemsMar-22-2026, 18:35:31 GMT

The optimal policy in various real-world strategic decision-making problems depends both on the environmental configuration and exogenous events. For these settings, we introduce Contextual Bilevel Reinforcement Learning (CB-RL), a stochastic bilevel decision-making model, where the lower level consists of solving a contextual Markov Decision Process (CMDP). CB-RL can be viewed as a Stackelberg Game where the leader and a random context beyond the leader's control together decide the setup of many MDPs that potentially multiple followers best respond to. This framework extends beyond traditional bilevel optimization and finds relevance in diverse fields such as RLHF, tax design, reward shaping, contract theory and mechanism design. We propose a stochastic Hyper Policy Gradient Descent (HPGD) algorithm to solve CB-RL, and demonstrate its convergence. Notably, HPGD uses stochastic hypergradient estimates, based on observations of the followers' trajectories. Therefore, it allows followers to use any training procedure and the leader to be agnostic of the specific algorithm, which aligns with various real-world scenarios. We further consider the setting when the leader can influence the training of followers and propose an accelerated algorithm. We empirically demonstrate the performance of our algorithm for reward shaping and tax design.

artificial intelligence, machine learning, reinforcement learning, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

PUZZLES: A Benchmark for Neural Algorithmic Reasoning

Neural Information Processing SystemsMar-22-2026, 18:34:50 GMT

Algorithmic reasoning is a fundamental cognitive ability that plays a pivotal role in problem-solving and decision-making processes. Reinforcement Learning (RL) has demonstrated remarkable proficiency in tasks such as motor control, handling perceptual input, and managing stochastic environments. These advancements have been enabled in part by the availability of benchmarks. In this work we introduce PUZZLES, a benchmark based on Simon Tatham's Portable Puzzle Collection, aimed at fostering progress in algorithmic and logical reasoning in RL. PUZZLES contains 40 diverse logic puzzles of adjustable sizes and varying levels of complexity, providing detailed information on the strengths and generalization capabilities of RL agents. Furthermore, we evaluate various RL algorithms on PUZZLES, providing baseline comparisons and demonstrating the potential for future research. All the software, including the environment, is available at this https url.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.99)

Add feedback

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Neural Information Processing SystemsMar-22-2026, 18:07:57 GMT

Reinforcement Learning from Human Feedback (RLHF)has been crucial to the recent success of Large Language Models (LLMs), however it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimized the LLM. A prominent issue with such methods is reward over-optimization or reward hacking, where the performance as measured by the learned proxy reward model increases, but the true model quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs), such as Direct Preference Optimization (DPO) have emerged as alternatives to the classical RLHF pipeline. However, despite not training a separate proxy reward model or using RL, they still commonly deteriorate from over-optimization. While the so-called reward hacking phenomenon is not well-defined for DAAs, we still uncover similar trends: at higher KL-budgets, DAA algorithms exhibit similar degradation patters to their classic RLHF counterparts. In particular, we find that DAA methods deteriorate not only across a wide range of KL-budgets, but also often before even a single epoch of the dataset is completed. Through extensive empirical experimentation this work formulates the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.

large language model, machine learning, reinforcement learning, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.82)

Add feedback

From Text to Trajectory: Exploring Complex Constraint Representation and Decomposition in Safe Reinforcement Learning

Neural Information Processing SystemsMar-22-2026, 18:06:55 GMT

Safe reinforcement learning (RL) requires the agent to finish a given task while obeying specific constraints. Giving constraints in natural language form has great potential for practical scenarios due to its flexible transfer capability and accessibility. Previous safe RL methods with natural language constraints typically need to design cost functions manually for each constraint, which requires domain expertise and lacks flexibility. In this paper, we harness the dual role of text in this task, using it not only to provide constraint but also as a training signal. We introduce the Trajectory-level Textual Constraints Translator (TTCT) to replace the manually designed cost function. Our empirical results demonstrate that TTCT effectively comprehends textual constraint and trajectory, and the policies trained by TTCT can achieve a lower violation rate than the standard cost function. Extra studies are conducted to demonstrate that the TTCT has zero-shot transfer capability to adapt to constraint-shift environments.

machine learning, natural language, reinforcement learning, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

Identifying Latent State-Transition Processes for Individualized Reinforcement Learning

Neural Information Processing SystemsMar-22-2026, 17:35:12 GMT

The application of reinforcement learning (RL) involving interactions with individuals has grown significantly in recent years. These interactions, influenced by factors such as personal preferences and physiological differences, causally influence state transitions, ranging from health conditions in healthcare to learning progress in education. As a result, different individuals may exhibit different state-transition processes. Understanding individualized state-transition processes is essential for optimizing individualized policies. In practice, however, identifying these state-transition processes is challenging, as individual-specific factors often remain latent. In this paper, we establish the identifiability of these latent factors and introduce a practical method that effectively learns these processes from observed state-action trajectories. Experiments on various datasets show that the proposed method can effectively identify latent state-transition processes and facilitate the learning of individualized RL policies.

artificial intelligence, proceedings, reinforcement learning, (5 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.99)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Neural Information Processing SystemsMar-22-2026, 17:34:07 GMT

Aligning human preference and value is an important requirement for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages: 1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data; 2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning (RL) step to fine-tune the model. Such reward model serves as a proxy to human preference, and it is critical to guide the RL step towards improving the model quality. In this work, we argue that the SFT stage significantly benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse Reinforcement Learning (IRL) technique to {\it simultaneously} build an reward model and a policy model.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Neural Information Processing Systems

Genre: Research Report (0.38)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.58)

Add feedback

Rethinking Optimal Transport in Offline Reinforcement Learning

Neural Information Processing SystemsMar-22-2026, 17:32:56 GMT

We propose a novel algorithm for offline reinforcement learning using optimal transport. Typically, in offline reinforcement learning, the data is provided by various experts and some of them can be sub-optimal. To extract an efficient policy, it is necessary to \emph{stitch} the best behaviors from the dataset. To address this problem, we rethink offline reinforcement learning as an optimal transportation problem. And based on this, we present an algorithm that aims to find a policy that maps states to a \emph{partial} distribution of the best expert actions for each given state. We evaluate the performance of our algorithm on continuous control problems from the D4RL suite and demonstrate improvements over existing methods.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.81)

Add feedback