Reinforcement Learning
Incorporating Quality of Life in Climate Adaptation Planning via Reinforcement Learning
Costa, Miguel, Vandervoort, Arthur, Drews, Martin, Morrissey, Karyn, Pereira, Francisco C.
Urban flooding is expected to increase in frequency and severity as a consequence of climate change, causing wide-ranging impacts that include a decrease in urban Quality of Life (QoL). Meanwhile, policymakers must devise adaptation strategies that can cope with the uncertain nature of climate change and the complex and dynamic nature of urban flooding. Reinforcement Learning (RL) holds significant promise in tackling such complex, dynamic, and uncertain problems. Because of this, we use RL to identify which climate adaptation pathways lead to a higher QoL in the long term. We do this using an Integrated Assessment Model (IAM) which combines a rainfall projection model, a flood model, a transport accessibility model, and a quality of life index. Our preliminary results suggest that this approach can be used to learn optimal adaptation measures and it outperforms other realistic and real-world planning strategies.
Collaborative Assembly Policy Learning of a Sightless Robot
Zhang, Zeqing, Lu, Weifeng, Yang, Lei, Jing, Wei, Tang, Bowei, Pan, Jia
This paper explores a physical human-robot collaboration (pHRC) task involving the joint insertion of a board into a frame by a sightless robot and a human operator. While admittance control is commonly used in pHRC tasks, it can be challenging to measure the force/torque applied by the human for accurate human intent estimation, limiting the robot's ability to assist in the collaborative task. Other methods that attempt to solve pHRC tasks using reinforcement learning (RL) are also unsuitable for the board-insertion task due to its safety constraints and sparse rewards. Therefore, we propose a novel RL approach that utilizes a human-designed admittance controller to facilitate more active robot behavior and reduce human effort. Through simulation and real-world experiments, we demonstrate that our approach outperforms admittance control in terms of success rate and task completion time. Additionally, we observed a significant reduction in measured force/torque when using our proposed approach compared to admittance control. The video of the experiments is available at https://youtu.be/va07Gw6YIog.
Optimizing Earth-Moon Transfer and Cislunar Navigation: Integrating Low-Energy Trajectories, AI Techniques and GNSS-R Technologies
Muhammad, Arsalan, Ahmed, Wasiu Akande, Ojonugwa, Omada Friday, Biswas, Paul Puspendu
The rapid growth of cislunar activities, including lunar landings, the Lunar Gateway, and in-space refueling stations, requires advances in cost-efficient trajectory design and reliable integration of navigation and remote sensing. Traditional Earth-Moon transfers suffer from rigid launch windows and high propellant demands, while Earth-based GNSS systems provide little to no coverage beyond geostationary orbit. This limits autonomy and environmental awareness in cislunar space. This review compares four major transfer strategies by evaluating velocity requirements, flight durations, and fuel efficiency, and by identifying their suitability for both crewed and robotic missions. The emerging role of artificial intelligence and machine learning is highlighted: convolutional neural networks support automated crater recognition and digital terrain model generation, while deep reinforcement learning enables adaptive trajectory refinement during descent and landing to reduce risk and decision latency. The study also examines how GNSS-Reflectometry and advanced Positioning, Navigation, and Timing architectures can extend navigation capabilities beyond current limits. GNSS-R can act as a bistatic radar for mapping lunar ice, soil properties, and surface topography, while PNT systems support autonomous rendezvous, Lagrange point station-keeping, and coordinated satellite swarm operations. Combining these developments establishes a scalable framework for sustainable cislunar exploration and long-term human and robotic presence.
Learning Natural and Robust Hexapod Locomotion over Complex Terrains via Motion Priors based on Deep Reinforcement Learning
Liu, Xin, Wu, Jinze, Li, Yinghui, Qi, Chenkun, Xue, Yufei, Gao, Feng
Abstract-- Multi-legged robots offer enhanced stability to navigate complex terrains with their multiple legs interacting with the environment. However, how to effectively coordinate the multiple legs in a larger action exploration space to generate natural and robust movements is a key issue. In this paper, we introduce a motion prior-based approach, successfully applying deep reinforcement learning algorithms to a real hexapod robot. We generate a dataset of optimized motion priors, and train an adversarial discriminator based on the priors to guide the hexapod robot to learn natural gaits. The learned policy is then successfully transferred to a real hexapod robot, and demonstrate natural gait patterns and remarkable robustness without visual information in complex terrains. This is the first time that a reinforcement learning controller has been used to achieve complex terrain walking on a real hexapod robot.
Thor: Towards Human-Level Whole-Body Reactions for Intense Contact-Rich Environments
Li, Gangyang, Shi, Qing, Hu, Youhao, Hu, Jincheng, Wang, Zhongyuan, Wang, Xinlong, Luo, Shaqi
Humanoids hold great potential for service, industrial, and rescue applications, in which robots must sustain whole-body stability while performing intense, contact-rich interactions with the environment. However, enabling humanoids to generate human-like, adaptive responses under such conditions remains a major challenge. To address this, we propose Thor, a humanoid framework for human-level whole-body reactions in contact-rich environments. Based on the robot's force analysis, we design a force-adaptive torso-tilt (FAT2) reward function to encourage humanoids to exhibit human-like responses during force-interaction tasks. To mitigate the high-dimensional challenges of humanoid control, Thor introduces a reinforcement learning architecture that decouples the upper body, waist, and lower body. Each component shares global observations of the whole body and jointly updates its parameters. Finally, we deploy Thor on the Unitree G1, and it substantially outperforms baselines in force-interaction tasks. Specifically, the robot achieves a peak pulling force of 167.7 N (approximately 48% of the G1's body weight) when moving backward and 145.5 N when moving forward, representing improvements of 68.9% and 74.7%, respectively, compared with the best-performing baseline. Moreover, Thor is capable of pulling a loaded rack (130 N) and opening a fire door with one hand (60 N). These results highlight Thor's effectiveness in enhancing humanoid force-interaction capabilities.
Toward Humanoid Brain-Body Co-design: Joint Optimization of Control and Morphology for Fall Recovery
Yue, Bo, Xu, Sheng, Jia, Kui, Liu, Guiliang
Humanoid robots represent a central frontier in embodied intelligence, as their anthropomorphic form enables natural deployment in humans' workspace. Brain-body co-design for humanoids presents a promising approach to realizing this potential by jointly optimizing control policies and physical morphology. Within this context, fall recovery emerges as a critical capability. It not only enhances safety and resilience but also integrates naturally with locomotion systems, thereby advancing the autonomy of humanoids. In this paper, we propose RoboCraft, a scalable humanoid co-design framework for fall recovery that iteratively improves performance through the coupled updates of control policy and morphology. A shared policy pretrained across multiple designs is progressively finetuned on high-performing morphologies, enabling efficient adaptation without retraining from scratch. Concurrently, morphology search is guided by human-inspired priors and optimization algorithms, supported by a priority buffer that balances reevaluation of promising candidates with the exploration of novel designs. Experiments show that RoboCraft achieves an average performance gain of 44.55% on seven public humanoid robots, with morphology optimization drives at least 40% of improvements in co-designing four humanoid robots, underscoring the critical role of humanoid co-design.
Decentralized Aerial Manipulation of a Cable-Suspended Load using Multi-Agent Reinforcement Learning
Zeng, Jack, Gimenez, Andreu Matoses, Vinitsky, Eugene, Alonso-Mora, Javier, Sun, Sihao
This paper presents the first decentralized method to enable real-world 6-DoF manipulation of a cable-suspended load using a team of Micro-Aerial Vehicles (MAVs). Our method leverages multi-agent reinforcement learning (MARL) to train an outer-loop control policy for each MAV. Unlike state-of-the-art controllers that utilize a centralized scheme, our policy does not require global states, inter-MAV communications, nor neighboring MAV information. Instead, agents communicate implicitly through load pose observations alone, which enables high scalability and flexibility. It also significantly reduces computing costs during inference time, enabling onboard deployment of the policy. In addition, we introduce a new action space design for the MAVs using linear acceleration and body rates. This choice, combined with a robust low-level controller, enables reliable sim-to-real transfer despite significant uncertainties caused by cable tension during dynamic 3D motion. We validate our method in various real-world experiments, including full-pose control under load model uncertainties, showing setpoint tracking performance comparable to the state-of-the-art centralized method. We also demonstrate cooperation amongst agents with heterogeneous control policies, and robustness to the complete in-flight loss of one MAV. Videos of experiments: https://autonomousrobots.nl/paper_websites/aerial-manipulation-marl
Towards Formalizing Reinforcement Learning Theory
In this paper, we formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library. $Q$-learning and linear TD are among the earliest and most influential reinforcement learning (RL) algorithms. The investigation of their convergence properties is not only a major research topic during the early development of the RL field but also receives increasing attention nowadays. This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem. The framework developed in this work can be easily extended to convergence rates and other modes of convergence. This work thus makes an important step towards fully formalizing convergent RL results. The code is available at https://github.com/ShangtongZhang/rl-theory-in-lean.
Directional-Clamp PPO
Karpel, Gilad, Zhou, Ruida, Sabach, Shoham, Ghavamzadeh, Mohammad
Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the "right" direction -- starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these "right" direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective's behavior by altering the clipping in the "right" direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the "wrong" direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict "wrong" direction regions, where the advantage is positive (negative) and importance ratio falls below (above) $1 - β$ ($1+β$), for a tunable parameter $β\in (0, 1)$. The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective's behavior in the "right" direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid "wrong" direction updates while keeping the importance ratio closer to 1.
Imagine Beyond! Distributionally Robust Auto-Encoding for State Space Coverage in Online Reinforcement Learning
Castanet, Nicolas, Sigaud, Olivier, Lamprier, Sylvain
Goal-Conditioned Reinforcement Learning (GCRL) enables agents to autonomously acquire diverse behaviors, but faces major challenges in visual environments due to high-dimensional, semantically sparse observations. In the online setting, where agents learn representations while exploring, the latent space evolves with the agent's policy, to capture newly discovered areas of the environment. However, without incentivization to maximize state coverage in the representation, classical approaches based on auto-encoders may converge to latent spaces that over-represent a restricted set of states frequently visited by the agent. This is exacerbated in an intrinsic motivation setting, where the agent uses the distribution encoded in the latent space to sample the goals it learns to master. To address this issue, we propose to progressively enforce distributional shifts towards a uniform distribution over the full state space, to ensure a full coverage of skills that can be learned in the environment. We introduce DRAG (Distributionally Robust Auto-Encoding for GCRL), a method that combines the $β$-VAE framework with Distributionally Robust Optimization. DRAG leverages an adversarial neural weighter of training states of the VAE, to account for the mismatch between the current data distribution and unseen parts of the environment. This allows the agent to construct semantically meaningful latent spaces beyond its immediate experience. Our approach improves state space coverage and downstream control performance on hard exploration environments such as mazes and robotic control involving walls to bypass, without pre-training nor prior environment knowledge.