Reinforcement Learning
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Wang, Jing, Liang, Jiajun, Liu, Jie, Liu, Henglin, Liu, Gongye, Zheng, Jun, Pang, Wanyuan, Ma, Ao, Xie, Zhenyu, Wang, Xintao, Wang, Meng, Wan, Pengfei, Liang, Xiaodan
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
Computational Hardness of Reinforcement Learning with Partial $q^ฯ$-Realizability
This paper investigates the computational complexity of reinforcement learning in a novel linear function approximation regime, termed partial $q^ฯ$-realizability. In this framework, the objective is to learn an $ฮต$-optimal policy with respect to a predefined policy set $ฮ $, under the assumption that all value functions for policies in $ฮ $ are linearly realizable. The assumptions of this framework are weaker than those in $q^ฯ$-realizability but stronger than those in $q^*$-realizability, providing a practical model where function approximation naturally arises. We prove that learning an $ฮต$-optimal policy in this setting is computationally hard. Specifically, we establish NP-hardness under a parameterized greedy policy set (argmax) and show that - unless NP = RP - an exponential lower bound (in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those in $q^*$-realizability and suggest computational difficulty persists even when $ฮ $ is expanded beyond the optimal policy. To establish this, we reduce from two complexity problems, $ฮด$-Max-3SAT and $ฮด$-Max-3SAT(b), to instances of GLinear-$ฮบ$-RL (greedy policy) and SLinear-$ฮบ$-RL (softmax policy). Our findings indicate that positive computational results are generally unattainable in partial $q^ฯ$-realizability, in contrast to $q^ฯ$-realizability under a generative access model.
SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
Chen, Qianzhong, Yu, Justin, Schwager, Mac, Abbeel, Pieter, Shentu, Yide, Wu, Philipp
Large scale robot learning has recently shown promise in enabling robots to perform complex tasks by integrating perception, control, and optionally, language understanding into a unified framework. However, they continue to struggle with long-horizon, contact-rich manipulation tasks, such as the handling of deformable objects, where supervision from demonstrations is often inconsistent in quality. In such settings, reward modeling offers a natural solution: by providing grounded progress signals, it can transform noisy demonstrations into stable supervision that generalizes across diverse trajectories. In this work, we introduce a stage-aware, video-based reward modeling framework that jointly predicts the high-level task stage and fine-grained progress within each stage. Reward labels are automatically derived from natural language subtask annotations, enabling consistent progress estimation across variable-length and heterogeneous demonstrations. This design overcomes the limitations of frame-index-based labeling, which collapses in long, variable-duration tasks such as folding a T -shirt. Our reward model demonstrates robustness to demonstration variability, generalization to out-of-distribution scenarios, and strong utility for downstream policy training. Building upon this reward model, we propose the Reward-Aligned Behavior Cloning (RA-BC) framework, which selectively filters high-quality data and reweights training samples according to reward estimates. Extensive experiments demonstrate that the reward model outperforms baselines on out-of-distribution real robot policy rollouts and human demonstration validation. Our approach achieves 83% success on folding T -shirts from the flattened state and 67% from the crumpled state--dramatically surpassing vanilla behavior cloning, which attains only 8% and 0% success under the same training dataset, respectively. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon robotic manipulation. The long-standing vision of enabling robots to seamlessly assist humans in household chores has inspired decades of research in robotics. From tidying living spaces to preparing meals, such capabilities hold the promise of freeing up human time, and improving quality of life.
Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation
Gan, Feichen, Lu, Youcun, Zhang, Yingying, Liu, Yukun
Reliable uncertainty quantification is crucial for reinforcement learning (RL) in high-stakes settings. We propose a unified conformal prediction framework for infinite-horizon policy evaluation that constructs distribution-free prediction intervals {for returns} in both on-policy and off-policy settings. Our method integrates distributional RL with conformal calibration, addressing challenges such as unobserved returns, temporal dependencies, and distributional shifts. We propose a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling. These innovations mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts. Our theoretical analysis provides coverage guarantees that account for model misspecification and importance weight estimation. Empirical results, including experiments in synthetic and benchmark environments like Mountain Car, show that our method significantly improves coverage and reliability over standard distributional RL baselines.
Improving Robustness of AlphaZero Algorithms to Test-Time Environment Changes
Tamassia, Isidoro, Bรถhmer, Wendelin
The AlphaZero framework provides a standard way of combining Monte Carlo planning with prior knowledge provided by a previously trained policy-value neural network. AlphaZero usually assumes that the environment on which the neural network was trained will not change at test time, which constrains its applicability. In this paper, we analyze the problem of deploying AlphaZero agents in potentially changed test environments and demonstrate how the combination of simple modifications to the standard framework can significantly boost performance, even in settings with a low planning budget available. The code is publicly available on GitHub3.
Energy-Efficient Autonomous Driving with Adaptive Perception and Robust Decision
Xia, Yuyang, Liang, Zibo, Deng, Liwei, Zhao, Yan, Su, Han, Zheng, Kai
Abstract--Autonomous driving is an emerging technology that is expected to bring significant social, economic, and environmental benefits. However, these benefits come with rising energy consumption by computation engines, limiting the driving range of vehicles, especially electric ones. Perception computing is typically the most power-intensive component, as it relies on large-scale deep learning models to extract environmental features. Recently, numerous studies have employed model compression techniques, such as sparsification, quantization, and distillation, to reduce computational consumption. However, these methods often result in either a substantial model size or a significant drop in perception accuracy compared to high-computation models. T o address these challenges, we propose an energy-efficient autonomous driving framework, called EneAD, which includes an adaptive perception and a robust decision module. In the adaptive perception module, a perception optimization strategy is designed from the perspective of data management and tuning. Firstly, we manage multiple perception models with different computational consumption and adjust the execution framerate dynamically. Then, we define them as knobs and design a transferable tuning method based on Bayesian optimization to identify promising knob values that achieve low computation while maintaining desired accuracy. T o adaptively switch the knob values in various traffic scenarios, a lightweight classification model is proposed to distinguish the perception difficulty in different scenarios. In the robust decision module, we propose a decision model based on reinforcement learning and design a regularization term to enhance driving stability in the face of perturbed perception results. EneAD can reduce perception consumption by 1.9 to 3.5 and thus improve driving range by 3.9% to 8.5%. Autonomous driving has gained broad attention from the public during the last few years [1], [2]. With intelligence, the autonomous vehicle can have a more comprehensive perception of the surrounding traffic environment and make more reasonable driving decisions compared to human drivers. As a result, it is expected to bring society a large number of benefits, including improved mobility and a significant reduction in collisions. For example, the computing platform using the Nvidia AGX Orin SoC [4] has a Thermal Design Power (TDP) of 800W . These power demands can also increase the thermal demands on a vehicle's climate-control system.
Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning
Kong, He, Hu, Die, Ge, Jingguo, Li, Liangxiong, Li, Hui, Li, Tong
Abstract--Automating penetration testing is crucial for enhancing cybersecurity, yet current Large Language Models (LLMs) face significant limitations in this domain, including poor error handling, inefficient reasoning, and an inability to perform complex end-to-end tasks autonomously. T o address these challenges, we introduce Pentest-R1, a novel framework designed to optimize LLM reasoning capabilities for this task through a two-stage reinforcement learning pipeline. We first construct a dataset of over 500 real-world, multi-step walkthroughs, which Pentest-R1 leverages for offline reinforcement learning (RL) to instill foundational attack logic. Subsequently, the LLM is fine-tuned via online RL in an interactive Capture The Flag (CTF) environment, where it learns directly from environmental feedback to develop robust error self-correction and adaptive strategies. Our extensive experiments on the Cybench and AutoPenBench benchmarks demonstrate the framework's effectiveness. On AutoPenBench, Pentest-R1 achieves a 24.2% success rate, surpassing most state-of-the-art models and ranking second only to Gemini 2.5 Flash. Ablation studies confirm that the synergy of both training stages is critical to its success.
Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing
Choo, Davin, Pan, Yuqi, Wang, Tonghan, Tambe, Milind, van Heerden, Alastair, Johnson, Cheryl
We study a sequential decision-making problem on a $n$-node graph $\mathcal{G}$ where each node has an unknown label from a finite set $\mathbfฮฉ$, drawn from a joint distribution $\mathcal{P}$ that is Markov with respect to $\mathcal{G}$. At each step, selecting a node reveals its label and yields a label-dependent reward. The goal is to adaptively choose nodes to maximize expected accumulated discounted rewards. We impose a frontier exploration constraint, where actions are limited to neighbors of previously selected nodes, reflecting practical constraints in settings such as contact tracing and robotic exploration. We design a Gittins index-based policy that applies to general graphs and is provably optimal when $\mathcal{G}$ is a forest. Our implementation runs in $\mathcal{O}(n^2 \cdot |\mathbfฮฉ|^2)$ time while using $\mathcal{O}(n \cdot |\mathbfฮฉ|^2)$ oracle calls to $\mathcal{P}$ and $\mathcal{O}(n^2 \cdot |\mathbfฮฉ|)$ space. Experiments on synthetic and real-world graphs show that our method consistently outperforms natural baselines, including in non-tree, budget-limited, and undiscounted settings. For example, in HIV testing simulations on real-world sexual interaction networks, our policy detects nearly all positive cases with only half the population tested, substantially outperforming other baselines.
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Wang, Jiaqi, Lin, Kevin Qinghong, Cheng, James, Shou, Mike Zheng
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.
Federated Deep Reinforcement Learning for Privacy-Preserving Robotic-Assisted Surgery
Hafeez, Sana, Mulkana, Sundas Rafat, Imran, Muhammad Ali, Sevegnani, Michele
The integration of Reinforcement Learning (RL) into robotic-assisted surgery (RAS) holds significant promise for advancing surgical precision, adaptability, and autonomous decision-making. However, the development of robust RL models in clinical settings is hindered by key challenges, including stringent patient data privacy regulations, limited access to diverse surgical datasets, and high procedural variability. To address these limitations, this paper presents a Federated Deep Reinforcement Learning (FDRL) framework that enables decentralized training of RL models across multiple healthcare institutions without exposing sensitive patient information. A central innovation of the proposed framework is its dynamic policy adaptation mechanism, which allows surgical robots to select and tailor patient-specific policies in real-time, thereby ensuring personalized and Optimised interventions. To uphold rigorous privacy standards while facilitating collaborative learning, the FDRL framework incorporates secure aggregation, differential privacy, and homomorphic encryption techniques. Experimental results demonstrate a 60\% reduction in privacy leakage compared to conventional methods, with surgical precision maintained within a 1.5\% margin of a centralized baseline. This work establishes a foundational approach for adaptive, secure, and patient-centric AI-driven surgical robotics, offering a pathway toward clinical translation and scalable deployment across diverse healthcare environments.