Goto

Collaborating Authors

 Reinforcement Learning


Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

arXiv.org Artificial Intelligence

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RL VR) represent the state-of-the-art, their practical utility is hampered by "overthinking", a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. CS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. CS can achieve a dramatic reduction in reasoning tokens by over 50% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.Figure 1: Left: Two major flaws of prior practice apply sequence-level length reward without control of training data. CS improves pass@1 of base models while reducing 60% token costs compared to the base model across 7 benchmarks. Experimental details are presented in Appendix E.5. Recent large reasoning models (LRM; Guo et al. (2025); OpenAI (2025); Qwen (2025)) trained with critic-free reinforcement learning (RL) algorithms, such as GRPO (Shao et al., 2024), DAPO (Y u et al., 2025), and REINFORCE++ (Hu et al., 2025a), have demonstrated impressive reasoning capabilities through verifiable outcome rewards.


Cooperative Autonomous Driving in Diverse Behavioral Traffic: A Heterogeneous Graph Reinforcement Learning Approach

arXiv.org Artificial Intelligence

Navigating heterogeneous traffic environments with diverse driving styles poses a significant challenge for autonomous vehicles (AVs) due to their inherent complexity and dynamic interactions. This paper addresses this challenge by proposing a heterogeneous graph reinforcement learning (GRL) framework enhanced with an expert system to improve AV decision-making performance. Initially, a heterogeneous graph representation is introduced to capture the intricate interactions among vehicles. Then, a heterogeneous graph neural network with an expert model (HGNN-EM) is proposed to effectively encode diverse vehicle features and produce driving instructions informed by domain-specific knowledge. Moreover, the double deep Q-learning (DDQN) algorithm is utilized to train the decision-making model. A case study on a typical four-way intersection, involving various driving styles of human vehicles (HVs), demonstrates that the proposed method has superior performance over several baselines regarding safety, efficiency, stability, and convergence rate, all while maintaining favorable real-time performance.


VLA Model Post-Training via Action-Chunked PPO and Self Behavior Cloning

arXiv.org Artificial Intelligence

Reinforcement learning (RL) is a promising avenue for post-training vision-language-action (VLA) models, but practical deployment is hindered by sparse rewards and unstable training. This work mitigates these challenges by introducing an action chunk based on proximal policy optimization (PPO) with behavior cloning using self-collected demonstrations. Aggregating consecutive actions into chunks improves the temporal consistency of the policy and the density of informative feedback. In addition, an auxiliary behavior cloning loss is applied with a dynamically updated demonstration buffer that continually collects high-quality task trials during training. The relative weight between the action-chunked PPO objective and the self behavior clone auxiliary loss is adapted online to stabilize the post-training process. Experiments on the MetaWorld benchmark indicate improved performance over supervised fine-tuning, achieving a high success rate (0.93) and few steps to success (42.17). These results demonstrate the viability of RL for VLA post-training and help lay the groundwork for downstream VLA applications.


Deep Reinforcement Learning-Based Precoding for Multi-RIS-Aided Multiuser Downlink Systems with Practical Phase Shift

arXiv.org Artificial Intelligence

This study considers multiple reconfigurable intelligent surfaces (RISs)-aided multiuser downlink systems with the goal of jointly optimizing the transmitter precoding and RIS phase shift matrix to maximize spectrum efficiency. Unlike prior work that assumed ideal RIS reflectivity, a practical coupling effect is considered between reflecting amplitude and phase shift for the RIS elements. This makes the optimization problem non-convex. To address this challenge, we propose a deep deterministic policy gradient (DDPG)-based deep reinforcement learning (DRL) framework. The proposed model is evaluated under both fixed and random numbers of users in practical mmWave channel settings. Simulation results demonstrate that, despite its complexity, the proposed DDPG approach significantly outperforms optimization-based algorithms and double deep Q-learning, particularly in scenarios with random user distributions.


Safe In-Context Reinforcement Learning

arXiv.org Artificial Intelligence

In-context reinforcement learning (ICRL) is an emerging RL paradigm where the agent, after some pretraining procedure, is able to adapt to out-of-distribution test tasks without any parameter updates. The agent achieves this by continually expanding the input (i.e., the context) to its policy neural networks. For example, the input could be all the history experience that the agent has access to until the current time step. The agent's performance improves as the input grows, without any parameter updates. In this work, we propose the first method that promotes the safety of ICRL's adaptation process in the framework of constrained Markov Decision Processes. In other words, during the parameter-update-free adaptation process, the agent not only maximizes the reward but also minimizes an additional cost function. We also demonstrate that our agent actively reacts to the threshold (i.e., budget) of the cost tolerance. With a higher cost budget, the agent behaves more aggressively, and with a lower cost budget, the agent behaves more conservatively.


Polychromic Objectives for Reinforcement Learning

arXiv.org Artificial Intelligence

Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling. To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective. Our method (1) employs vine sampling to collect on-policy rollouts and (2) modifies the advantage function to reflect the advantage under our new objective. Experiments on BabyAI, Minigrid, and Algorithmic Creativity show that our method improves success rates by reliably solving a larger set of environment configurations and generalizes better under large perturbations. Moreover, when given multiple attempts in pass@$k$ experiments, the policy achieves substantially higher coverage, demonstrating its ability to maintain and exploit a diverse repertoire of strategies.


AIhub monthly digest: September 2025 – conference reviewing, soccer ball detection, and memory traces

AIHub

Welcome to our monthly digest, where you can catch up with any AIhub stories you may have missed, peruse the latest news, recap recent events, and more. This month, we hear about the latest research on soccer ball detection, learn about energy-based transformers, find out about memory traces in reinforcement learning, and explore some potential solutions to the problems with conference reviewing. Issues with the peer-review process, and pertaining to conferences in particular, are often discussed among authors, reviewers and conference chairs alike. However, coming up with potential solutions to the problem has proved challenging. Jaeho told us more in this interview .


Policy Shaping: Integrating Human Feedback with Reinforcement Learning

Neural Information Processing Systems

A long term goal of Interactive Reinforcement Learning is to incorporate non-expert human feedback to solve complex tasks. State-of-the-art methods have approached this problem by mapping human information to reward and value signals to indicate preferences and then iterating over them to compute the necessary control policy. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct labels on the policy. We compare Advise to state-of-the-art approaches and highlight scenarios where it outperforms them and importantly is robust to infrequent and inconsistent human feedback.


Projected Natural Actor-Critic

Neural Information Processing Systems

Natural actor-critics are a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability - their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural Actor-Critics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent.


Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

Neural Information Processing Systems

Tetris is a popular video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A close look at the literature of this game shows that while ADP algorithms, that have been (almost) entirely based on approximating the value function (value function based), have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris. Our extensive experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small $10\times 10$ and large $10\times 20$ boards. Although the CBMPI's results are similar to those achieved by the CE method in the large board, CBMPI uses considerably fewer (almost 1/10) samples (call to the generative model of the game) than CE.