Reinforcement Learning
Connections between reinforcement learning with feedback,test-time scaling, and diffusion guidance: An anthology
Jiao, Yuchen, Chen, Yuxin, Li, Gen
The rise of large language models (LLMs) has catalyzed a diverse array of post-training techniques that are reshaping the frontiers of artificial intelligence. Approaches such as reinforcement learning with feedback -- e.g., reinforcement learning with human feedback (RLHF) (Kaufmann et al., 2023) and reinforcement learning with internal feedback (RLIF) (Zhao et al., 2025) -- and test-time scaling methods (e.g., best-of-N sampling) (Snell et al., 2024) have become pivotal in enhancing model performance. Historically, these approaches have evolved along nearly independent trajectories, with their mathematical underpinnings developed largely in isolation. This technical note reflects on some basic, yet intricate, connections among these paradigms, offering some unified perspectives that might inform and improve the design of each.
Hybrid Reinforcement Learning and Search for Flight Trajectory Planning
Luise, Alberto, Lombardi, Michele, Koenigsbuch, Florent Teichteil
This paper explores the combination of Reinforcement Learning (RL) and search-based path planners to speed up the optimization of flight paths for airliners, where in case of emergency a fast route re-calculation can be crucial. The fundamental idea is to train an RL Agent to pre-compute near-optimal paths based on location and atmospheric data and use those at runtime to constrain the underlying path planning solver and find a solution within a certain distance from the initial guess. The approach effectively reduces the size of the solver's search space, significantly speeding up route optimization. Although global optimality is not guaranteed, empirical results conducted with Airbus aircraft's performance models show that fuel consumption remains nearly identical to that of an unconstrained solver, with deviations typically within 1%. At the same time, computation speed can be improved by up to 50% as compared to using a conventional solver alone.
Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables
Chen, Yang, Lin, Xiao, Yan, Bo, Zhang, Libo, Liu, Jiamou, Tan, Neset รzkan, Witbrock, Michael
Designing suitable reward functions for numerous interacting intelligent agents is challenging in real-world applications. Inverse reinforcement learning (IRL) in mean field games (MFGs) offers a practical framework to infer reward functions from expert demonstrations. While promising, the assumption of agent homogeneity limits the capability of existing methods to handle demonstrations with heterogeneous and unknown objectives, which are common in practice. To this end, we propose a deep latent variable MFG model and an associated IRL method. Critically, our method can infer rewards from different yet structurally similar tasks without prior knowledge about underlying contexts or modifying the MFG model itself. Our experiments, conducted on simulated scenarios and a real-world spatial taxi-ride pricing problem, demonstrate the superiority of our approach over state-of-the-art IRL methods in MFGs.
Avoidance of an unexpected obstacle without reinforcement learning: Why not using advanced control-theoretic tools?
This communication on collision avoidance with unexpected obstacles is motivated by some critical appraisals on reinforcement learning (RL) which "requires ridiculously large numbers of trials to learn any new task" (Yann LeCun). We use the classic Dubins' car in order to replace RL with flatness-based control, combined with the HEOL feedback setting, and the latest model-free predictive control approach. The two approaches lead to convincing computer experiments where the results with the model-based one are only slightly better. They exhibit a satisfactory robustness with respect to randomly generated mismatches/disturbances, which become excellent in the model-free case. Those properties would have been perhaps difficult to obtain with today's popular machine learning techniques in AI. Finally, we should emphasize that our two methods require a low computational burden.
A Comprehensive Review of Multi-Agent Reinforcement Learning in Video Games
Li, Zhengyang, Ji, Qijin, Ling, Xinghong, Liu, Quan
Recent advancements in multi-agent reinforcement learning (MARL) have demonstrated its application potential in modern games. Beginning with foundational work and progressing to landmark achievements such as AlphaStar in StarCraft II and OpenAI Five in Dota 2, MARL has proven capable of achieving superhuman performance across diverse game environments through techniques like self-play, supervised learning, and deep reinforcement learning. With its growing impact, a comprehensive review has become increasingly important in this field. This paper aims to provide a thorough examination of MARL's application from turn-based two-agent games to real-time multi-agent video games including popular genres such as Sports games, First-Person Shooter (FPS) games, Real-Time Strategy (RTS) games and Multiplayer Online Battle Arena (MOBA) games. We further analyze critical challenges posed by MARL in video games, including nonstationary, partial observability, sparse rewards, team coordination, and scalability, and highlight successful implementations in games like Rocket League, Minecraft, Quake III Arena, StarCraft II, Dota 2, Honor of Kings, etc. This paper offers insights into MARL in video game AI systems, proposes a novel method to estimate game complexity, and suggests future research directions to advance MARL and its applications in game development, inspiring further innovation in this rapidly evolving field.
AutoGrid AI: Deep Reinforcement Learning Framework for Autonomous Microgrid Management
Guo, Kenny, Eckhert, Nicholas, Chhajer, Krish, Abeykoon, Luthira, Schell, Lorne
--We present a deep reinforcement learning-based framework for autonomous microgrid management. Using deep reinforcement learning and time-series forecasting models, we optimize microgrid energy dispatch strategies to minimize costs and maximize the utilization of renewable energy sources such as solar and wind. Our approach integrates the transformer architecture for forecasting of renewable generation and a proximal-policy optimization (PPO) agent to make decisions in a simulated environment. Our experimental results demonstrate significant improvements in both energy efficiency and operational resilience when compared to traditional rule-based methods. This work contributes to advancing smart-grid technologies in pursuit of zero-carbon energy systems. We finally provide an open-source framework for simulating several microgrid environments.
Diffusion-RL Based Air Traffic Conflict Detection and Resolution Method
Li, Tonghe, Liu, Jixin, Zeng, Weili, Jiang, Hao
In the context of continuously rising global air traffic, efficient and safe Conflict Detection and Resolution (CD&R) is paramount for air traffic management. Although Deep Reinforcement Learning (DRL) offers a promising pathway for CD&R automation, existing approaches commonly suffer from a "unimodal bias" in their policies. This leads to a critical lack of decision-making flexibility when confronted with complex and dynamic constraints, often resulting in "decision deadlocks." To overcome this limitation, this paper pioneers the integration of diffusion probabilistic models into the safety-critical task of CD&R, proposing a novel autonomous conflict resolution framework named Diffusion-AC. Diverging from conventional methods that converge to a single optimal solution, our framework models its policy as a reverse denoising process guided by a value function, enabling it to generate a rich, high-quality, and multimodal action distribution. This core architecture is complemented by a Density-Progressive Safety Curriculum (DPSC), a training mechanism that ensures stable and efficient learning as the agent progresses from sparse to high-density traffic environments. Extensive simulation experiments demonstrate that the proposed method significantly outperforms a suite of state-of-the-art DRL benchmarks. Most critically, in the most challenging high-density scenarios, Diffusion-AC not only maintains a high success rate of 94.1% but also reduces the incidence of Near Mid-Air Collisions (NMACs) by approximately 59% compared to the next-best-performing baseline, significantly enhancing the system's safety margin. This performance leap stems from its unique multimodal decision-making capability, which allows the agent to flexibly switch to effective alternative maneuvers.
AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models
Yeh, Cheng-Kai, Lee, Hsing-Wang, Kuo, Chung-Hung, Huang, Hen-Hsen
Abstraction--the ability to recognize and distill essential computational patterns from complex problem statements--is a foundational skill in computer science, critical both for human problem-solvers and coding-oriented large language models (LLMs). Despite recent advances in training LLMs for code generation using reinforcement learning (RL), most existing approaches focus primarily on superficial pattern recognition, overlooking explicit training for abstraction. In this study, we propose AR$^2$ (Adversarial Reinforcement Learning for Abstract Reasoning), a novel framework explicitly designed to enhance the abstraction abilities of LLMs. AR$^2$ employs a teacher model to transform kernel problems into narrative-rich, challenging descriptions without changing their fundamental logic. Simultaneously, a student coding model is trained to solve these complex narrative problems by extracting their underlying computational kernels. Experimental results demonstrate that AR$^2$ substantially improves the student model's accuracy on previously unseen, challenging programming tasks, underscoring abstraction as a key skill for enhancing LLM generalization.
First Order Model-Based RL through Decoupled Backpropagation
Amigo, Joseph, Khorrambakht, Rooholla, Chane-Sane, Elliot, Mansard, Nicolas, Righetti, Ludovic
There is growing interest in reinforcement learning (RL) methods that leverage the simulator's derivatives to improve learning efficiency. While early gradient-based approaches have demonstrated superior performance compared to derivative-free methods, accessing simulator gradients is often impractical due to their implementation cost or unavailability. Model-based RL (MBRL) can approximate these gradients via learned dynamics models, but the solver efficiency suffers from compounding prediction errors during training rollouts, which can degrade policy performance. We propose an approach that decouples trajectory generation from gradient computation: trajectories are unrolled using a simulator, while gradients are computed via backpropagation through a learned differentiable model of the simulator. This hybrid design enables efficient and consistent first-order policy optimization, even when simulator gradients are unavailable, as well as learning a critic from simulation rollouts, which is more accurate. Our method achieves the sample efficiency and speed of specialized optimizers such as SHAC, while maintaining the generality of standard approaches like PPO and avoiding ill behaviors observed in other first-order MBRL methods. We empirically validate our algorithm on benchmark control tasks and demonstrate its effectiveness on a real Go2 quadruped robot, across both quadrupedal and bipedal locomotion tasks.
HITTER: A HumanoId Table TEnnis Robot via Hierarchical Planning and Learning
Su, Zhi, Zhang, Bike, Rahmanian, Nima, Gao, Yuman, Liao, Qiayuan, Regan, Caitlin, Sreenath, Koushil, Sastry, S. Shankar
Our system enables both humanoid-humanoid (left) and humanoid-human (right) matches, achieving rallies of up to 106 consecutive shots against a human opponent. Abstract -- Humanoid robots have recently achieved impressive progress in locomotion and whole-body control, yet they remain constrained in tasks that demand rapid interaction with dynamic environments through manipulation. T able tennis exemplifies such a challenge: with ball speeds exceeding 5 m/s, players must perceive, predict, and act within sub-second reaction times, requiring both agility and precision. T o address this, we present a hierarchical framework for humanoid table tennis that integrates a model-based planner for ball trajectory prediction and racket target planning with a reinforcement learning-based whole-body controller . The planner determines striking position, velocity and timing, while the controller generates coordinated arm and leg motions that mimic human strikes and maintain stability and agility across consecutive rallies. Moreover, to encourage natural movements, human motion references are incorporated during training. We validate our system on a general-purpose humanoid robot, achieving up to 106 consecutive shots with a human opponent and sustained exchanges against another humanoid.