Reinforcement Learning
MPCFormer: A physics-informed data-driven approach for explainable socially-aware autonomous driving
Hu, Jia, Lian, Zhexi, Yan, Xuerun, Bi, Ruiang, Shen, Dou, Ruan, Yu, Wang, Haoran
Autonomous Driving (AD) vehicles still struggle to exhibit human - like behavior in highly dynamic and interactive traffic scenarios. The key challenge lies in AD's limited ability to interact with surrounding vehicles, largely due to a lack of understandi ng the underlying mechanisms of social interaction. To address this issue, we introduce MPCFormer, an explainable socially - aware autonomous driving approach with physics - informed and data - driven coupled social interaction dynamics. In this model, the dynam ics are formulated into a discrete space - state representation, which embeds physics priors to enhance modeling explainability. The dynamics coefficients are learned from naturalistic driving data via a Transformer - based encoder - decoder architecture. To the best of our knowledge, MPCFormer is the first approach to explicitly model the dynamics of multi - vehicle social interactions. The learned social interaction dynamics enable the planner to generate manifold, human - like behaviors when interacting with surro unding traffic. By leveraging the MPC framework, the approach mitigates the potential safety risks typically associated with purely learning - based methods. Open - looped evaluation on NGSIM dataset demonstrates that MPCFormer achieves superior social interac tion awareness, yielding the lowest trajectory p red iction errors compared with other state - of - the - art approach. The prediction achieves an ADE as low as 0.86 m over a long prediction horizon of 5 seconds. Close - looped experiments in highly intense interact ion scenarios, where consecutive lane changes are required to exit an off - ramp, further validate the effectiveness of MPCFormer. Results show that MPCFormer achieves the highest planning success rate of 94.67%, improves driving efficiency by 15.75%, and re duces the collision rate from 21.25% to 0.5%, outperforming a frontier Reinforcement Learning (RL) based planner. A. Research motivation During recent years, Autonomous Driving (AD) has demonstrated significant progress within transportation systems [1] [2] . However, AD vehicles still face significant challenges in exhibiting human - like behavior in highly dynamic and interactive traffic scenarios such as off - ramp and unprotected left turns [3] [4] . One critical reason is that AD vehic les lack the understanding of the underlying mechanisms of social interaction between surrounding vehicles.
Crossing the Sim2Real Gap Between Simulation and Ground Testing to Space Deployment of Autonomous Free-flyer Control
Stewart, Kenneth, Chapin, Samantha, Leontie, Roxana, Henshaw, Carl Glen
Abstract-- Reinforcement learning (RL) offers transforma-tive potential for robotic control in space. We present the first on-orbit demonstration of RL-based autonomous control of a free-flying robot, the NASA Astrobee, aboard the International Space Station (ISS). Using NVIDIA's Omniverse physics simulator and curriculum learning, we trained a deep neural network to replace Astrobee's standard attitude and translation control, enabling it to navigate in microgravity. This successful deployment demonstrates the feasibility of training RL policies terrestrially and transferring them to space-based applications. This paves the way for future work in In-Space Servicing, Assembly, and Manufacturing (ISAM), enabling rapid on-orbit adaptation to dynamic mission requirements. Future In-Space Servicing, Assembly, and Manufacturing (ISAM) missions require increasingly autonomous robotic systems capable of adapting to the dynamic and uncertain conditions of space.
Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) International Space Station Astrobee Testing
Chapin, Samantha, Stewart, Kenneth, Leontie, Roxana, Henshaw, Carl Glen
The US Naval Research Laboratory's (NRL's) Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) experiment pioneers the use of reinforcement learning (RL) for control of free-flying robots in the zero-gravity (zero-G) environment of space. On Tuesday, May 27th 2025 the APIARY team conducted the first ever, to our knowledge, RL control of a free-flyer in space using the NASA Astrobee robot on-board the International Space Station (ISS). A robust 6-degrees of freedom (DOF) control policy was trained using an actor-critic Proximal Policy Optimization (PPO) network within the NVIDIA Isaac Lab simulation environment, randomizing over goal poses and mass distributions to enhance robustness. This paper details the simulation testing, ground testing, and flight validation of this experiment. This on-orbit demonstration validates the transformative potential of RL for improving robotic autonomy, enabling rapid development and deployment (in minutes to hours) of tailored behaviors for space exploration, logistics, and real-time mission needs.
ContactRL: Safe Reinforcement Learning based Motion Planning for Contact based Human Robot Collaboration
Mulkana, Sundas Rafat, Yu, Ronyu, Guha, Tanaya, Li, Emma
Abstract-- In collaborative human-robot tasks, safety requires not only avoiding collisions but also ensuring safe, intentional physical contact. We present ContactRL, a reinforcement learning (RL) based framework that directly incorporates contact safety into the reward function through force feedback. This enables a robot to learn adaptive motion profiles that minimize human-robot contact forces while maintaining task efficiency. In simulation, ContactRL achieves a low safety violation rate of 0.2% with a high task success rate of 87.7%, outperforming state-of-the-art constrained RL baselines. In order to guarantee deployment safety, we augment the learned policy with a kinetic energy based Control Barrier Function (eCBF) shield. Real-world experiments on an UR3e robotic platform performing small object handovers from a human hand across 360 trials confirm safe contact, with measured normal forces consistently below 10N. These results demonstrate that ContactRL enables safe and efficient physical collaboration, thereby advancing the deployment of collaborative robots in contact-rich tasks.
A Learning-based Control Methodology for Transitioning VTOL UAVs
Lin, Zexin, Zhong, Yebin, Wan, Hanwen, Cheng, Jiu, Sun, Zhenglong, Ji, Xiaoqiang
Transition control poses a critical challenge in Vertical Take-Off and Landing Unmanned Aerial Vehicle (VTOL UAV) development due to the tilting rotor mechanism, which shifts the center of gravity and thrust direction during transitions. Current control methods' decoupled control of altitude and position leads to significant vibration, and limits interaction consideration and adaptability. In this study, we propose a novel coupled transition control methodology based on reinforcement learning (RL) driven controller. Besides, contrasting to the conventional phase-transition approach, the ST3M method demonstrates a new perspective by treating cruise mode as a special case of hover. We validate the feasibility of applying our method in simulation and real-world environments, demonstrating efficient controller development and migration while accurately controlling UAV position and attitude, exhibiting outstanding trajectory tracking and reduced vibrations during the transition process.
Multi-Agent Reinforcement Learning with Communication-Constrained Priors
Yang, Guang, Yang, Tianpei, Qiao, Jingwen, Wu, Yanqing, Huo, Jing, Chen, Xingguo, Gao, Yang
Communication is one of the effective means to improve the learning of cooperative policy in multi-agent systems. However, in most real-world scenarios, lossy communication is a prevalent issue. Existing multi-agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real-world environments. To address these challenges, we propose a generalized communication-constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision-making, drawing on a dual mutual information estimatior, and introduce a communication-constrained multi-agent reinforcement learning framework, quantifying the impact of communication messages into the global reward. Finally, we validate the effectiveness of our approach across several communication-constrained benchmarks.
Multimodal Reinforcement Learning with Agentic Verifier for AI Agents
Tan, Reuben, Peng, Baolin, Yang, Zhengyuan, Cheng, Hao, Mees, Oier, Zhao, Theodore, Tupini, Andrea, Meijier, Isar, Wu, Qianhui, Yang, Yuncong, Liden, Lars, Gu, Yu, Zhang, Sheng, Liu, Xiaodong, Wang, Lijuan, Pollefeys, Marc, Lee, Yong Jae, Gao, Jianfeng
Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.
A Multi-Agent, Policy-Gradient approach to Network Routing
Tao, Nigel, Baxter, Jonathan, Weaver, Lex
Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. OLPOMDP, a policy-gradient reinforcement learning algorithm, was successfully applied to simulated network routing under a number of network models. Multiple distributed agents (routers) learned co-operative behavior without explicit inter-agent communication, and they avoided behavior which was individually desirable, but detrimental to the group's overall performance. Furthermore, shaping the reward signal by explicitly penalizing certain patterns of sub-optimal behavior was found to dramatically improve the convergence rate.
GRAND: Guidance, Rebalancing, and Assignment for Networked Dispatch in Multi-Agent Path Finding
Gaber, Johannes, Alharbi, Meshal, Gammelli, Daniele, Zardini, Gioele
Large robot fleets are now common in warehouses and other logistics settings, where small control gains translate into large operational impacts. In this article, we address task scheduling for lifelong Multi-Agent Pickup-and-Delivery (MAPD) and propose a hybrid method that couples learning-based global guidance with lightweight optimization. A graph neural network policy trained via reinforcement learning outputs a desired distribution of free agents over an aggregated warehouse graph. This signal is converted into region-to-region rebalancing through a minimum-cost flow, and finalized by small, local assignment problems, preserving accuracy while keeping per-step latency within a 1 s compute budget. On congested warehouse benchmarks from the League of Robot Runners (LRR) with up to 500 agents, our approach improves throughput by up to 10% over the 2024 winning scheduler while maintaining real-time execution. The results indicate that coupling graph-structured learned guidance with tractable solvers reduces congestion and yields a practical, scalable blueprint for high-throughput scheduling in large fleets.
Safe and Sustainable Electric Bus Charging Scheduling with Constrained Hierarchical DRL
Qi, Jiaju, Lei, Lei, Jonsson, Thorsteinn, Niyato, Dusit
Abstract--The integration of Electric Buses (EBs) with renewable energy sources such as photovoltaic (PV) panels is a promising approach to promote sustainable and low-carbon public transportation. However, optimizing EB charging schedules to minimize operational costs while ensuring safe operation without battery depletion remains challenging - especially under real-world conditions, where uncertainties in PV generation, dynamic electricity prices, variable travel times, and limited charging infrastructure must be accounted for . In this paper, we propose a safe Hierarchical Deep Reinforcement Learning (HDRL) framework for solving the EB Charging Scheduling Problem (EBCSP) under multi-source uncertainties. We formulate the problem as a Constrained Markov Decision Process (CMDP) with options to enable temporally abstract decision-making. We develop a novel HDRL algorithm, namely Double Actor-Critic Multi-Agent Proximal Policy Optimization Lagrangian (DAC-MAPPO-Lagrangian), which integrates Lagrangian relaxation into the Double Actor-Critic (DAC) framework. At the high level, we adopt a centralized PPO-Lagrangian algorithm to learn safe charger allocation policies. At the low level, we incorporate MAPPO-Lagrangian to learn decentralized charging power decisions under the Centralized Training and Decentralized Execution (CTDE) paradigm. Extensive experiments with real-world data demonstrate that the proposed approach outperforms existing baselines in both cost minimization and safety compliance, while maintaining fast convergence speed. Recent advances in sustainable transportation have emphasized the critical role of Electric Buses (EBs) in mitigating urban pollution, reducing greenhouse gas emissions, and improving public transit comfort [1], [2]. However, the electrification of bus fleets introduces significant challenges, including increased strain on local power infrastructures and rising charging costs. To address these issues, two key approaches have gained substantial attention in recent years.