Reinforcement Learning
Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step
Yang, Jingyi, Chen, Guanxu, Hu, Xuhao, Shao, Jing
Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.
Continual Learning to Generalize Forwarding Strategies for Diverse Mobile Wireless Networks
Park, Cheonjin, Manfredi, Victoria, Zhang, Xiaolan, Liu, Chengyi, Wolfe, Alicia P, Song, Dongjin, Tasneem, Sarah, Wang, Bing
Deep reinforcement learning (DRL) has been successfully used to design forwarding strategies for multi-hop mobile wireless networks. While such strategies can be used directly for networks with varied connectivity and dynamic conditions, developing generalizable approaches that are effective on scenarios significantly different from the training environment remains largely unexplored. In this paper, we propose a framework to address the challenge of generalizability by (i) developing a generalizable base model considering diverse mobile network scenarios, and (ii) using the generalizable base model for new scenarios, and when needed, fine-tuning the base model using a small amount of data from the new scenarios. To support this framework, we first design new features to characterize network variation and feature quality, thereby improving the information used in DRL-based forwarding decisions. We then develop a continual learning (CL) approach able to train DRL models across diverse network scenarios without ``catastrophic forgetting.'' Using extensive evaluation, including real-world scenarios in two cities, we show that our approach is generalizable to unseen mobility scenarios. Compared to a state-of-the-art heuristic forwarding strategy, it leads to up to 78% reduction in delay, 24% improvement in delivery rate, and comparable or slightly higher number of forwards.
Integrated Communication and Control for Energy-Efficient UAV Swarms: A Multi-Agent Reinforcement Learning Approach
Sun, Tianjiao, Guo, Ningyan, Gu, Haozhe, Peng, Yanyan, Feng, Zhiyong
The deployment of unmanned aerial vehicle (UAV) swarm-assisted communication networks has become an increasingly vital approach for remediating coverage limitations in infrastructure-deficient environments, with especially pressing applications in temporary scenarios, such as emergency rescue, military and security operations, and remote area coverage. However, complex geographic environments lead to unpredictable and highly dynamic wireless channel conditions, resulting in frequent interruptions of air-to-ground (A2G) links that severely constrain the reliability and quality of service in UAV swarm-assisted mobile communications. To improve the quality of UAV swarm-assisted communications in complex geographic environments, we propose an integrated communication and control co-design mechanism. Given the stringent energy constraints inherent in UAV swarms, our proposed mechanism is designed to optimize energy efficiency while maintaining an equilibrium between equitable communication rates for mobile ground users (GUs) and UAV energy expenditure. We formulate the joint resource allocation and 3D trajectory control problem as a Markov decision process (MDP), and develop a multi-agent reinforcement learning (MARL) framework to enable real-time coordinated actions across the UAV swarm. To optimize the action policy of UAV swarms, we propose a novel multi-agent hybrid proximal policy optimization with action masking (MAHPPO-AM) algorithm, specifically designed to handle complex hybrid action spaces. The algorithm incorporates action masking to enforce hard constraints in high-dimensional action spaces. Experimental results demonstrate that our approach achieves a fairness index of 0.99 while reducing energy consumption by up to 25% compared to baseline methods.
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation
Li, Pengxiang, Hu, Zechen, Shang, Zirui, Wu, Jingrong, Liu, Yang, Liu, Hui, Gao, Zhi, Shi, Chenrui, Zhang, Bofei, Zhang, Zihao, Shi, Xiaochuan, YU, Zedong, Wu, Yuwei, Wu, Xinxiao, Jia, Yunde, Xiang, Liuyu, He, Zhaofeng, Li, Qing
Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.
STAIR: Addressing Stage Misalignment through Temporal-Aligned Preference Reinforcement Learning
Luan, Yao, Mu, Ni, Yang, Yiqin, Xu, Bo, Jia, Qing-Shan
Preference-based reinforcement learning (PbRL) bypasses complex reward engineering by learning rewards directly from human preferences, enabling better alignment with human intentions. However, its effectiveness in multi-stage tasks, where agents sequentially perform sub-tasks (e.g., navigation, grasping), is limited by stage misalignment: Comparing segments from mismatched stages, such as movement versus manipulation, results in uninformative feedback, thus hindering policy learning. In this paper, we validate the stage misalignment issue through theoretical analysis and empirical experiments. To address this issue, we propose STage-AlIgned Reward learning (STAIR), which first learns a stage approximation based on temporal distance, then prioritizes comparisons within the same stage. Temporal distance is learned via contrastive learning, which groups temporally close states into coherent stages, without predefined task knowledge, and adapts dynamically to policy changes. Extensive experiments demonstrate STAIR's superiority in multi-stage tasks and competitive performance in single-stage tasks. Furthermore, human studies show that stages approximated by STAIR are consistent with human cognition, confirming its effectiveness in mitigating stage misalignment.
GES-UniGrasp: A Two-Stage Dexterous Grasping Strategy With Geometry-Based Expert Selection
Xu, Fangting, Zhu, Jilin, Gu, Xiaoming, Tang, Jianzhong
Robust and human-like dexterous grasping of general objects is a critical capability for advancing intelligent robotic manipulation in real-world scenarios. However, existing reinforcement learning methods guided by grasp priors often result in unnatural behaviors. In this work, we present \textit{ContactGrasp}, a robotic dexterous pre-grasp and grasp dataset that explicitly accounts for task-relevant wrist orientation and thumb-index pinching coordination. The dataset covers 773 objects in 82 categories, providing a rich foundation for training human-like grasp strategies. Building upon this dataset, we perform geometry-based clustering to group objects by shape, enabling a two-stage Geometry-based Expert Selection (GES) framework that selects among specialized experts for grasping diverse object geometries, thereby enhancing adaptability to diverse shapes and generalization across categories. Our approach demonstrates natural grasp postures and achieves high success rates of 99.4\% and 96.3\% on the train and test sets, respectively, showcasing strong generalization and high-quality grasp execution.
Zero-shot Whole-Body Manipulation with a Large-Scale Soft Robotic Torso via Guided Reinforcement Learning
Johnson, Curtis C., Alessi, Carlo, Falotico, Egidio, Killpack, Marc D.
Whole-body manipulation is a powerful yet underexplored approach that enables robots to interact with large, heavy, or awkward objects using more than just their end-effectors. Soft robots, with their inherent passive compliance, are particularly well-suited for such contact-rich manipulation tasks, but their uncertainties in kinematics and dynamics pose significant challenges for simulation and control. In this work, we address this challenge with a simulation that can run up to 350x real time on a single thread in MuJoCo and provide a detailed analysis of the critical tradeoffs between speed and accuracy for this simulation. Using this framework, we demonstrate a successful zero-shot sim-to-real transfer of a learned whole-body manipulation policy, achieving an 88% success rate on the Baloo hardware platform. We show that guiding RL with a simple motion primitive is critical to this success where standard reward shaping methods struggled to produce a stable and successful policy for whole-body manipulation. Furthermore, our analysis reveals that the learned policy does not simply mimic the motion primitive. It exhibits beneficial reactive behavior, such as re-grasping and perturbation recovery. We analyze and contrast this learned policy against an open-loop baseline to show that the policy can also exhibit aggressive over-corrections under perturbation. To our knowledge, this is the first demonstration of forceful, six-DoF whole-body manipulation using two continuum soft arms on a large-scale platform (10 kg payloads), with zero-shot policy transfer.
Solve Smart, Not Often: Policy Learning for Costly MILP Re-solving
Ai, Rui, Barbalho, Hugo De Oliveira, Li, Sirui, Robsky, Alexei, Simchi-Levi, David, Menache, Ishai
A common challenge in real-time operations is deciding whether to re-solve an optimization problem or continue using an existing solution. While modern data platforms may collect information at high frequencies, many real-time operations require repeatedly solving computationally intensive optimization problems formulated as Mixed-Integer Linear Programs (MILPs). Determining when to re-solve is, therefore, an economically important question. This problem poses several challenges: 1) How to characterize solution optimality and solving cost; 2) How to detect environmental changes and select beneficial samples for solving the MILP; 3) Given the large time horizon and non-MDP structure, vanilla reinforcement learning (RL) methods are not directly applicable and tend to suffer from value function explosion. Existing literature largely focuses on heuristics, low-data settings, and smooth objectives, with little focus on common NP-hard MILPs. We propose a framework called Proximal Policy Optimization with Change Point Detection (POC), which systematically offers a solution for balancing performance and cost when deciding appropriate re-solving times. Theoretically, we establish the relationship between the number of re-solves and the re-solving cost. To test our framework, we assemble eight synthetic and real-world datasets, and show that POC consistently outperforms existing baselines by 2%-17%. As a side benefit, our work fills the gap in the literature by introducing real-time MILP benchmarks and evaluation criteria.
Space Robotics Bench: Robot Learning Beyond Earth
Orsula, Andrej, Geist, Matthieu, Olivares-Mendez, Miguel, Martinez, Carol
The growing ambition for space exploration demands robust autonomous systems that can operate in unstructured environments under extreme extraterrestrial conditions. The adoption of robot learning in this domain is severely hindered by the prohibitive cost of technology demonstrations and the limited availability of data. To bridge this gap, we introduce the Space Robotics Bench, an open-source simulation framework for robot learning in space. It offers a modular architecture that integrates on-demand procedural generation with massively parallel simulation environments to support the creation of vast and diverse training distributions for learning-based agents. To ground research and enable direct comparison, the framework includes a comprehensive suite of benchmark tasks that span a wide range of mission-relevant scenarios. We establish performance baselines using standard reinforcement learning algorithms and present a series of experimental case studies that investigate key challenges in generalization, end-to-end learning, adaptive control, and sim-to-real transfer. Our results reveal insights into the limitations of current methods and demonstrate the utility of the framework in producing policies capable of real-world operation. These contributions establish the Space Robotics Bench as a valuable resource for developing, benchmarking, and deploying the robust autonomous systems required for the final frontier.
Online Dynamic Goal Recognition in Gym Environments
Matan, Shamir, Osher, Elhadad, Ben, Nageris, Reuth, Mirsky
Goal Recognition (GR) is the task of inferring an agent's intended goal from partial observations of its behavior, typically in an online and one-shot setting. Despite recent advances in model-free GR, particularly in applications such as human-robot interaction, surveillance, and assistive systems, the field remains fragmented due to inconsistencies in benchmarks, domains, and evaluation protocols. To address this, we introduce gr-libs (https://github.com/MatanShamir1/gr_libs) and gr-envs (https://github.com/MatanShamir1/gr_envs), two complementary open-source frameworks that support the development, evaluation, and comparison of GR algorithms in Gym-compatible environments. gr-libs includes modular implementations of MDP-based GR baselines, diagnostic tools, and evaluation utilities. gr-envs provides a curated suite of environments adapted for dynamic and goal-directed behavior, along with wrappers that ensure compatibility with standard reinforcement learning toolkits. Together, these libraries offer a standardized, extensible, and reproducible platform for advancing GR research. Both packages are open-source and available on GitHub and PyPI.