Reinforcement Learning
Dynamic Configuration of On-Street Parking Spaces using Multi Agent Reinforcement Learning
Jayasinghe, Oshada, Choudhury, Farhana, Tanin, Egemen, Karunasekera, Shanika
With increased travelling needs more than ever, traffic congestion has become a major concern in most urban areas. Allocating spaces for on-street parking, further hinders traffic flow, by limiting the effective road width available for driving. With the advancement of vehicle-to-infrastructure connectivity technologies, we explore how the impact of on-street parking on traffic congestion could be minimized, by dynamically configuring on-street parking spaces. Towards that end, we formulate dynamic on-street parking space configuration as an optimization problem, and we follow a data driven approach, considering the nature of our problem. Our proposed solution comprises a two-layer multi agent reinforcement learning based framework, which is inherently scalable to large road networks. The lane level agents are responsible for deciding the optimal parking space configuration for each lane, and we introduce a novel Deep Q-learning architecture which effectively utilizes long short term memory networks and graph attention networks to capture the spatio-temporal correlations evident in the given problem. The block level agents control the actions of the lane level agents and maintain a sufficient level of parking around the block. We conduct a set of comprehensive experiments using SUMO, on both synthetic data as well as real-world data from the city of Melbourne. Our experiments show that the proposed framework could reduce the average travel time loss of vehicles significantly, reaching upto 47%, with a negligible increase in the walking distance for parking.
Synthetic Error Injection Fails to Elicit Self-Correction In Language Models
Wu, David X., Kapur, Shreyas, Sahai, Anant, Russell, Stuart
Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration of alternatives. Inspired by techniques from autonomous driving and robotics, we investigate whether supervised learning with synthetic error injection can induce self-correction abilities in language models. Our approach inserts artificial errors into reasoning chains, masks them, and supervises the model to recognize and correct these mistakes. Despite the intuitive appeal of this method, we find that it fails to significantly improve performance even on simple synthetic tasks across multiple models. Moreover, even when the model catches its own error, it often parrots the original mistake. We find that the distribution shift of synthetic errors to on-policy errors significantly degrades the error-correction capabilities of the fine-tuned model, even with good synthetic coverage of on-policy errors. Our results help explain why on-policy reinforcement learning methods have proven uniquely effective for eliciting self-correction.
Risk-Sensitive Q-Learning in Continuous Time with Application to Dynamic Portfolio Selection
This paper studies the problem of risk-sensitive reinforcement learning (RSRL) in continuous time, where the environment is characterized by a controllable stochastic differential equation (SDE) and the objective is a potentially nonlinear functional of cumulative rewards. We prove that when the functional is an optimized certainty equivalent (OCE), the optimal policy is Markovian with respect to an augmented environment. We also propose \textit{CT-RS-q}, a risk-sensitive q-learning algorithm based on a novel martingale characterization approach. Finally, we run a simulation study on a dynamic portfolio selection problem and illustrate the effectiveness of our algorithm.
FOVA: Offline Federated Reinforcement Learning with Mixed-Quality Data
Qiao, Nan, Yue, Sheng, Ren, Ju, Zhang, Yaoxue
Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement, we find that the performance of most existing offline FRL methods drops dramatically when provided with mixed-quality data, that is, the logging behaviors (offline data) are collected by policies with varying qualities across clients. To overcome this limitation, this paper introduces a new vote-based offline FRL framework, named FOVA. It exploits a \emph{vote mechanism} to identify high-return actions during local policy evaluation, alleviating the negative effect of low-quality behaviors from diverse local learning policies. Besides, building on advantage-weighted regression (AWR), we construct consistent local and global training objectives, significantly enhancing the efficiency and stability of FOVA. Further, we conduct an extensive theoretical analysis and rigorously show that the policy learned by FOVA enjoys strict policy improvement over the behavioral policy. Extensive experiments corroborate the significant performance gains of our proposed algorithm over existing baselines on widely used benchmarks.
Improved Training Mechanism for Reinforcement Learning via Online Model Selection
We study the problem of online model selection in reinforcement learning, where the selector has access to a class of reinforcement learning agents and learns to adaptively select the agent with the right configuration. Our goal is to establish the improved efficiency and performance gains achieved by integrating online model selection methods into reinforcement learning training procedures. We examine the theoretical characterizations that are effective for identifying the right configuration in practice, and address three practical criteria from a theoretical perspective: 1) Efficient resource allocation, 2) Adaptation under non-stationary dynamics, and 3) Training stability across different seeds. Our theoretical results are accompanied by empirical evidence from various model selection tasks in reinforcement learning, including neural architecture selection, step-size selection, and self model selection.
How Market Volatility Shapes Algorithmic Collusion: A Comparative Analysis of Learning-Based Pricing Algorithms
Sravon, Aheer, Ibrahim, Md., Mazumder, Devdyuti, Aziz, Ridwan Al
The rapid diffusion of autonomous pricing algorithms has reshaped competitive dynamics in digital marketplaces, raising important economic and policy questions about their potential for collusive behavior. A substantial body of research demonstrates that reinforcement-learning (RL) agents can autonomously coordinate on supracompetitive outcomes even in the absence of explicit communication. Foundational contributions--including the work in [1]--show that algorithmic agents may systematically learn tacitly collusive strategies across multiple market structures, with Q-learning in particular generating prices above competitive levels in Logit, Hotelling, and linear demand environments. These concerns are reinforced by seminal work such as [2], which demonstrates that simple Q-learning agents reliably sustain collusion through structured punishment and reward cycles in repeated pricing games, as well as by [3], who document how algorithmic systems may generate sudden price spikes in response to high-impact, low-probability events (HILP), unintentionally coordinating on elevated prices. The study of [4] establishes a robust empirical and computational foundation demonstrating that pricing algorithms may autonomously learn to collude. A complementary line of research focuses specifically on Q-learning's capacity to learn collusive equilibria, as documented in papers [2], [5], and [6]. These findings are consistent with the theoretical properties of Q-learning established by [7], who show that the algorithm incrementally learns long-run discounted value-maximizing strategies in sequential decision problems. More recent studies further reveal that deep reinforcement-learning (deep RL) algorithms--including DDQN and SAC--may also display collusive tendencies. For instance, [8] documents that modern RL systems can coordinate on higher-than-competitive prices under a variety of market configurations.
Reinforcement Learning for Robotic Safe Control with Force Sensing
Lin, Nan, Zhang, Linrui, Chen, Yuxuan, Chen, Zhenrui, Zhu, Yujun, Chen, Ruoxi, Wu, Peichen, Chen, Xiaoping
-- For the task with complicated manipulation in unstructured environments, traditional hand-coded methods are ineffective, while reinforcement learning can provide more general and useful policy. Although the reinforcement learning is able to obtain impressive results, its stability and reliability is hard to guarantee, which would cause the potential safety threats. Besides, the transfer from simulation to real-world also will lead in unpredictable situations. T o enhance the safety and reliability of robots, we introduce the force and haptic perception into reinforcement learning. We demonstrate that the force-based reinforcement learning method can be more adaptive to environment, especially in sim-to-real transfer . Experimental results show in object pushing task, our strategy is safer and more efficient in both simulation and real world, thus it holds prospects for a wide variety of robotic applications.
Learning Massively Multitask World Models for Continuous Control
Hansen, Nicklas, Su, Hao, Wang, Xiaolong
General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.
Multi-agent In-context Coordination via Decentralized Memory Retrieval
Jiang, Tao, Lin, Zichuan, Li, Lihe, Li, Yi-Chen, Guan, Cong, Yuan, Lei, Zhang, Zongzhang, Yu, Yang, Ye, Deheng
Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods. Code is available at https://github.com/LAMDA-RL/MAICC.
Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents
Lee, Jane H., Saglam, Baturay, Pougkakiotis, Spyridon, Karbasi, Amin, Kalogerias, Dionysis
Constrained optimization provides a common framework for dealing with conflicting objectives in reinforcement learning (RL). In most of these settings, the objectives (and constraints) are expressed though the expected accumulated reward. However, this formulation neglects risky or even possibly catastrophic events at the tails of the reward distribution, and is often insufficient for high-stakes applications in which the risk involved in outliers is critical. In this work, we propose a framework for risk-aware constrained RL, which exhibits per-stage robustness properties jointly in reward values and time using optimized certainty equivalents (OCEs). Our framework ensures an exact equivalent to the original constrained problem within a parameterized strong Lagrangian duality framework under appropriate constraint qualifications, and yields a simple algorithmic recipe which can be wrapped around standard RL solvers, such as PPO. Lastly, we establish the convergence of the proposed algorithm under common assumptions, and verify the risk-aware properties of our approach through several numerical experiments.