Reinforcement Learning
A short methodological review on social robot navigation benchmarking
Chhetri, Pranup, Torrejon, Alejandro, Eslava, Sergio, Manso, Luis J.
Social Robot Navigation is the skill that allows robots to move efficiently in human-populated environments while ensuring safety, comfort, and trust. Unlike other areas of research, the scientific community has not yet achieved an agreement on how Social Robot Navigation should be benchmarked. This is notably important, as the lack of a de facto standard to benchmark Social Robot Navigation can hinder the progress of the field and may lead to contradicting conclusions. Motivated by this gap, we contribute with a short review focused exclusively on benchmarking trends in the period from January 2020 to July 2025. Of the 130 papers identified by our search using IEEE Xplore, we analysed the 85 papers that met the criteria of the review. This review addresses the metrics used in the literature for benchmarking purposes, the algorithms employed in such benchmarks, the use of human surveys for benchmarking, and how conclusions are drawn from the benchmarking results, when applicable.
Reinforcement learning-guided optimization of critical current in high-temperature superconductors
Cheng, Mouyang, Wan, Qiwei, Yu, Bowen, Rha, Eunbi, Landry, Michael J, Li, Mingda
High-temperature superconductors are essential for next-generation energy and quantum technologies, yet their performance is often limited by the critical current density ($J_c$), which is strongly influenced by microstructural defects. Optimizing $J_c$ through defect engineering is challenging due to the complex interplay of defect type, density, and spatial correlation. Here we present an integrated workflow that combines reinforcement learning (RL) with time-dependent Ginzburg-Landau (TDGL) simulations to autonomously identify optimal defect configurations that maximize $J_c$. In our framework, TDGL simulations generate current-voltage characteristics to evaluate $J_c$, which serves as the reward signal that guides the RL agent to iteratively refine defect configurations. We find that the agent discovers optimal defect densities and correlations in two-dimensional thin-film geometries, enhancing vortex pinning and $J_c$ relative to the pristine thin-film, approaching 60\% of theoretical depairing limit with up to 15-fold enhancement compared to random initialization. This RL-driven approach provides a scalable strategy for defect engineering, with broad implications for advancing HTS applications in fusion magnets, particle accelerators, and other high-field technologies.
A Novel Multi-Timescale Stability-Preserving Hierarchical Reinforcement Learning Controller Framework for Adaptive Control in High-Dimensional Dynamical Systems
Khaniki, Mohammad Ali Labbaf, Taroodi, Fateme, Safizadeh, Benyamin
Controlling high-dimensional stochastic systems, critical in robotics, autonomous vehicles, and hyperchaotic systems, faces the curse of dimensionality, lacks temporal abstraction, and often fails to ensure stochastic stability. To overcome these limitations, this study introduces the Multi-Timescale Lyapunov-Constrained Hierarchical Reinforcement Learning (MTLHRL) framework. MTLHRL integrates a hierarchical policy within a semi-Markov Decision Process (SMDP), featuring a high-level policy for strategic planning and a low-level policy for reactive control, which effectively manages complex, multi-timescale decision-making and reduces dimensionality overhead. Stability is rigorously enforced using a neural Lyapunov function optimized via Lagrangian relaxation and multi-timescale actor-critic updates, ensuring mean-square boundedness or asymptotic stability in the face of stochastic dynamics. The framework promotes efficient and reliable learning through trust-region constraints and decoupled optimization. Extensive simulations on an 8D hyperchaotic system and a 5-DOF robotic manipulator demonstrate MTLHRL's empirical superiority. It significantly outperforms baseline methods in both stability and performance, recording the lowest error indices (e.g., Integral Absolute Error (IAE): 3.912 in hyperchaotic control and IAE: 1.623 in robotics), achieving faster convergence, and exhibiting superior disturbance rejection. MTLHRL offers a theoretically grounded and practically viable solution for robust control of complex stochastic systems.
BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles
Miangoleh, Seyed Ahmad Hosseini, Aghdasian, Amin Jalal, Abdollahi, Farzaneh
In this paper, we propose Bootstrapped Language-Image Pretraining-driven Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a novel multimodal reinforcement learning (RL) framework for autonomous lane-keeping (LK), in which semantic embeddings generated by a vision-language model (VLM) are directly fused with geometric states, LiDAR observations, and Proportional-Integral-Derivative-based (PID) control feedback within the agent observation space. The proposed method lets the agent learn driving rules that are aware of their surroundings and easy to understand by combining high-level scene understanding from the VLM with low-level control and spatial signals. Our architecture brings together semantic, geometric, and control-aware representations to make policy learning more robust. A hybrid reward function that includes semantic alignment, LK accuracy, obstacle avoidance, and speed regulation helps learning to be more efficient and generalizable. Our method is different from the approaches that only use semantic models to shape rewards. Instead, it directly embeds semantic features into the state representation. This cuts down on expensive runtime inference and makes sure that semantic guidance is always available. The simulation results show that the proposed model is better at LK stability and adaptability than the best vision-based and multimodal RL baselines in a wide range of difficult driving situations. We make our code publicly available.
CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
Liu, Tianhui, Pang, Hetian, Zhang, Xin, Feng, Jie, Li, Yong, Hui, Pan
Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
Right Place, Right Time: Market Simulation-based RL for Execution Optimisation
Olby, Ollie, Bacalum, Andreea, Baggott, Rory, Stillman, Namid
Execution algorithms are vital to modern trading, they enable market participants to execute large orders while minimising market impact and transaction costs. As these algorithms grow more sophisticated, optimising them becomes increasingly challenging. In this work, we present a reinforcement learning (RL) framework for discovering optimal execution strategies, evaluated within a reactive agent-based market simulator. This simulator creates reactive order flow and allows us to decompose slippage into its constituent components: market impact and execution risk. We assess the RL agent's performance using the efficient frontier based on work by Almgren and Chriss, measuring its ability to balance risk and cost. Results show that the RL-derived strategies consistently outperform baselines and operate near the efficient frontier, demonstrating a strong ability to optimise for risk and impact. These findings highlight the potential of reinforcement learning as a powerful tool in the trader's toolkit.
Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics
Magnino, Lorenzo, Shao, Kai, Wu, Zida, Shen, Jiacheng, Lauriรจre, Mathieu
Mean field games (MFGs) have emerged as a powerful framework for modeling interactions in large-scale multi-agent systems. Despite recent advancements in reinforcement learning (RL) for MFGs, existing methods are typically limited to finite spaces or stationary models, hindering their applicability to real-world problems. This paper introduces a novel deep reinforcement learning (DRL) algorithm specifically designed for non-stationary continuous MFGs. The proposed approach builds upon a Fictitious Play (FP) methodology, leveraging DRL for best-response computation and supervised learning for average policy representation. Furthermore, it learns a representation of the time-dependent population distribution using a Conditional Normalizing Flow. To validate the effectiveness of our method, we evaluate it on three different examples of increasing complexity. By addressing critical limitations in scalability and density approximation, this work represents a significant advancement in applying DRL techniques to complex MFG problems, bringing the field closer to real-world multi-agent systems.
When UAV Swarm Meets IRS: Collaborative Secure Communications in Low-altitude Wireless Networks
Li, Jiahui, Liang, Xinyue, Sun, Geng, Kang, Hui, Wang, Jiacheng, Niyato, Dusit, Mao, Shiwen, Jamalipour, Abbas
Abstract--Low-altitude wireless networks (LA WNs) represent a promising architecture that integrates unmanned aerial vehicles (UA Vs) as aerial nodes to provide enhanced coverage, reliability, and throughput for diverse applications. However, these networks face significant security vulnerabilities from both known and potential unknown eavesdroppers, which may threaten data confidentiality and system integrity. T o solve this critical issue, we propose a novel secure communication framework for LA WNs where the selected UA Vs within a swarm function as a virtual antenna array (V AA), complemented by intelligent reflecting surface (IRS) to create a robust defense against eavesdropping attacks. Specifically, we formulate a multi-objective optimization problem that simultaneously maximizes the secrecy rate while minimizing the maximum sidelobe level and total energy consumption, requiring joint optimization of UA V excitation current weights, flight trajectories, and IRS phase shifts. This problem presents significant difficulties due to the dynamic nature of the system and heterogeneous components. Thus, we first transform the problem into a heterogeneous Markov decision process (MDP). Then, we propose a heterogeneous multi-agent control approach (HMCA) that integrates a dedicated IRS control policy with a multi-agent soft actor-critic framework for UA V control, which enables coordinated operation across heterogeneous network elements. Simulation results show that the proposed HMCA achieves superior performance compared to baseline approaches in terms of secrecy rate improvement, sidelobe suppression, and energy efficiency. Furthermore, we find that the collaborative and passive beamforming synergy between V AA and IRS creates robust security guarantees when the number of UA Vs increases. Jiahui Li, Xinyue Liang, and Hui Kang are with the College of Computer Science and Technology, Jilin University, Changchun 130012, China (E-mails: lijiahui@jlu.edu.cn; Geng Sun is with the College of Computer Science and Technology, Jilin University, Changchun 130012, China, and also with the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China. He is also with the College of Computing and Data Science, Nanyang Technological University, Singapore 639798 (E-mail: sungeng@jlu.edu.cn).
STAR-RIS-assisted Collaborative Beamforming for Low-altitude Wireless Networks
Liang, Xinyue, Kang, Hui, Che, Junwei, Li, Jiahui, Sun, Geng, Wu, Qingqing, Wang, Jiacheng, Niyato, Dusit
Abstract--While low-altitude wireless networks (LA WNs) based on uncrewed aerial vehicles (UA Vs) offer high mobility, flexibility, and coverage for urban communications, they face severe signal attenuation in dense environments due to obstructions. T o address this critical issue, we consider introducing collaborative beamforming (CB) of UA Vs and omnidirectional reconfigurable beamforming (ORB) of simultaneous transmitting and reflecting reconfigurable intelligent surfaces (ST AR-RIS) to enhance the signal quality and directionality. On this basis, we formulate a joint rate and energy optimization problem (JREOP) to maximize the transmission rate of the overall system, while minimizing the energy consumption of the UA V swarm. Due to the non-convex and NP-hard nature of JREOP, we propose a heterogeneous multi-agent collaborative dynamic (HMCD) optimization framework, which has two core components. The first component is a simulated annealing (SA)-based ST AR-RIS control method, which dynamically optimizes reflection and transmission coefficients to enhance signal propagation. The second component is an improved multi-agent deep reinforcement learning (MADRL) control method, which incorporates a self-attention evaluation mechanism to capture interactions between UA Vs and an adaptive velocity transition mechanism to enhance training stability. Simulation results demonstrate that HMCD outperforms various baselines in terms of convergence speed, average transmission rate, and energy consumption. Further analysis reveals that the average transmission rate of the overall system scales positively with both UA V count and ST AR-RIS element numbers. Index T erms--UA V, ST AR-RIS, secure communications, collaborative beamforming, multi-agent deep reinforcement learning. Xinyue Liang, Hui Kang, Junwei Che, and Jiahui Li are with the College of Computer Science and Technology, Jilin University, Changchun 130012, China (e-mails: xyliang25@mails.jlu.edu.cn; Geng Sun is with the College of Computer Science and Technology, Jilin University, Changchun 130012, China, and with Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China; he is also affiliated with the College of Computing and Data Science, Nanyang Technological University, Singapore 639798 (e-mail: sungeng@jlu.edu.cn).
Predictive Coding Enhances Meta-RL To Achieve Interpretable Bayes-Optimal Belief Representation Under Partial Observability
Kuo, Po-Chen, Hou, Han, Dabney, Will, Walker, Edgar Y.
Learning a compact representation of history is critical for planning and generalization in partially observable environments. While meta-reinforcement learning (RL) agents can attain near Bayes-optimal policies, they often fail to learn the compact, interpretable Bayes-optimal belief states. This representational inefficiency potentially limits the agent's adaptability and generalization capacity. Inspired by predictive coding in neuroscience--which suggests that the brain predicts sensory inputs as a neural implementation of Bayesian inference--and by auxiliary predictive objectives in deep RL, we investigate whether integrating self-supervised predictive coding modules into meta-RL can facilitate learning of Bayes-optimal representations. Through state machine simulation, we show that meta-RL with predictive modules consistently generates more interpretable representations that better approximate Bayes-optimal belief states compared to conventional meta-RL across a wide variety of tasks, even when both achieve optimal policies. In challenging tasks requiring active information seeking, only meta-RL with predictive modules successfully learns optimal representations and policies, whereas conventional meta-RL struggles with inadequate representation learning. Finally, we demonstrate that better representation learning leads to improved generalization. Our results strongly suggest the role of predictive learning as a guiding principle for effective representation learning in agents navigating partial observability.