Goto

Collaborating Authors

 Reinforcement Learning


Feasibility-Guided Fair Adaptive Offline Reinforcement Learning for Medicaid Care Management

arXiv.org Artificial Intelligence

Decision support for care coordination can benefit from offline RL, yet concerns about safety and equity limit deployment. We build on recent safety-aware (e.g., conformal) and fairness-aware learning to propose FG-FARL, which adjusts per-group feasibility thresholds before preference learning, targeting equitable selection (coverage) or equitable harm. Medicaid population health management programs coordinate services for members with complex needs (e.g., chronic conditions, behavioral health, social risks). Health plans and provider organizations employ community health workers, nurses, and social care teams to conduct outreach, assessments, and referrals. Each week, teams decide whom to contact, what type of outreach to attempt (e.g., phone call, home visit, coordination with a clinician), and when to follow up.


Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

arXiv.org Artificial Intelligence

In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/


Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning

arXiv.org Artificial Intelligence

Constrained Reinforcement Learning (RL) aims to maximize the return while adhering to predefined constraint limits, which represent domain-specific safety requirements. In continuous control settings, where learning agents govern system actions, balancing the trade-off between reward maximization and constraint satisfaction remains a significant challenge. Policy optimization methods often exhibit instability near constraint boundaries, resulting in suboptimal training performance. To address this issue, we introduce a novel approach that integrates an adaptive incentive mechanism in addition to the reward structure to stay within the constraint bound before approaching the constraint boundary. Building on this insight, we propose Incrementally Penalized Proximal Policy Optimization (IP3O), a practical algorithm that enforces a progressively increasing penalty to stabilize training dynamics. Through empirical evaluation on benchmark environments, we demonstrate the efficacy of IP3O compared to the performance of state-of-the-art Safe RL algorithms. Furthermore, we provide theoretical guarantees by deriving a bound on the worst-case error of the optimality achieved by our algorithm.


Narrative-Guided Reinforcement Learning: A Platform for Studying Language Model Influence on Decision Making

arXiv.org Machine Learning

We present a preliminary experimental platform that explores how narrative elements might shape AI decision-making by combining reinforcement learning (RL) with language model reasoning. While AI systems can now both make decisions and engage in narrative reasoning, these capabilities have mostly been studied separately. Our platform attempts to bridge this gap using a dual-system architecture to examine how narrative frameworks could influence reward-based learning. The system comprises a reinforcement learning policy that suggests actions based on past experience, and a language model that processes these suggestions through different narrative frameworks to guide decisions. This setup enables initial experimentation with narrative elements while maintaining consistent environment and reward structures. We implement this architecture in a configurable gridworld environment, where agents receive both policy suggestions and information about their surroundings. The platform's modular design facilitates controlled testing of environmental complexity, narrative parameters, and the interaction between reinforcement learning and narrative-based decisions. Our logging system captures basic decision metrics, from RL policy values to language model reasoning to action selection patterns. While preliminary, this implementation provides a foundation for studying how different narrative frameworks might affect reward-based decisions and exploring potential interactions between optimization-based learning and symbolic reasoning in AI systems.


Data-driven generative simulation of SDEs using diffusion models

arXiv.org Machine Learning

This paper introduces a new approach to generating sample paths of unknown stochastic differential equations (SDEs) using diffusion models, a class of generative AI models commonly employed in image and video applications. Unlike the traditional Monte Carlo methods for simulating SDEs, which require explicit specifications of the drift and diffusion coefficients, our method takes a model-free, data-driven approach. Given a finite set of sample paths from an SDE, we utilize conditional diffusion models to generate new, synthetic paths of the same SDE. To demonstrate the effectiveness of our approach, we conduct a simulation experiment to compare our method with alternative benchmark ones including neural SDEs. Furthermore, in an empirical study we leverage these synthetically generated sample paths to enhance the performance of reinforcement learning algorithms for continuous-time mean-variance portfolio selection, hinting promising applications of diffusion models in financial analysis and decision-making.


Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

arXiv.org Artificial Intelligence

Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.


A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator

arXiv.org Artificial Intelligence

Autonomous-driving research has recently embraced deep Reinforcement Learning (RL) as a promising framework for data-driven decision making, yet a clear picture of how these algorithms are currently employed, benchmarked and evaluated is still missing. This survey fills that gap by systematically analysing around 100 peer-reviewed papers that train, test or validate RL policies inside the open-source CARLA simulator. We first categorize the literature by algorithmic family model-free, model-based, hierarchical, and hybrid and quantify their prevalence, highlighting that more than 80% of existing studies still rely on model-free methods such as DQN, PPO and SAC. Next, we explain the diverse state, action and reward formulations adopted across works, illustrating how choices of sensor modality (RGB, LiDAR, BEV, semantic maps, and carla kinematics states), control abstraction (discrete vs. continuous) and reward shaping are used across various literature. We also consolidate the evaluation landscape by listing the most common metrics (success rate, collision rate, lane deviation, driving score) and the towns, scenarios and traffic configurations used in CARLA benchmarks. Persistent challenges including sparse rewards, sim-to-real transfer, safety guarantees and limited behaviour diversity are distilled into a set of open research questions, and promising directions such as model-based RL, meta-learning and richer multi-agent simulations are outlined. By providing a unified taxonomy, quantitative statistics and a critical discussion of limitations, this review aims to serve both as a reference for newcomers and as a roadmap for advancing RL-based autonomous driving toward real-world deployment.


Quadrotor Navigation using Reinforcement Learning with Privileged Information

arXiv.org Artificial Intelligence

This paper presents a reinforcement learning-based quadrotor navigation method that leverages efficient differentiable simulation, novel loss functions, and privileged information to navigate around large obstacles. Prior learning-based methods perform well in scenes that exhibit narrow obstacles, but struggle when the goal location is blocked by large walls or terrain. In contrast, the proposed method utilizes time-of-arrival (ToA) maps as privileged information and a yaw alignment loss to guide the robot around large obstacles. The policy is evaluated in photo-realistic simulation environments containing large obstacles, sharp corners, and dead-ends. Our approach achieves an 86% success rate and outperforms baseline strategies by 34%. We deploy the policy onboard a custom quadrotor in outdoor cluttered environments both during the day and night. The policy is validated across 20 flights, covering 589 meters without collisions at speeds up to 4 m/s.


RAPID Quantum Detection and Demodulation of Covert Communications: Breaking the Noise Limit with Solid-State Spin Sensors

arXiv.org Artificial Intelligence

We introduce a comprehensive framework for the detection and demodulation of covert electromagnetic signals using solid-state spin sensors. Our approach, named RAPID, is a two-stage hybrid strategy that leverages nitrogen-vacancy (NV) centers to operate below the classical noise floor employing a robust adaptive policy via imitation and distillation. We first formulate the joint detection and estimation task as a unified stochastic optimal control problem, optimizing a composite Bayesian risk objective under realistic physical constraints. The RAPID algorithm solves this by first computing a robust, non-adaptive baseline protocol grounded in the quantum Fisher information matrix (QFIM), and then using this baseline to warm-start an online, adaptive policy learned via deep reinforcement learning (Soft Actor-Critic). This method dynamically optimizes control pulses, interrogation times, and measurement bases to maximize information gain while actively suppressing non-Markovian noise and decoherence. Numerical simulations demonstrate that the protocol achieves a significant sensitivity gain over static methods, maintains high estimation precision in correlated noise environments, and, when applied to sensor arrays, enables coherent quantum beamforming that achieves Heisenberg-like scaling in precision. This work establishes a theoretically rigorous and practically viable pathway for deploying quantum sensors in security-critical applications such as electronic warfare and covert surveillance.


Reward function compression facilitates goal-dependent reinforcement learning

arXiv.org Artificial Intelligence

Reinforcement learning agents learn from rewards, but humans can uniquely assign value to novel, abstract outcomes in a goal-dependent manner. However, this flexibility is cognitively costly, making learning less efficient. Here, we propose that goal-dependent learning is initially supported by a capacity-limited working memory system. With consistent experience, learners create a "compressed" reward function (a simplified rule defining the goal) which is then transferred to long-term memory and applied automatically upon receiving feedback. This process frees up working memory resources, boosting learning efficiency. We test this theory across six experiments. Consistent with our predictions, our findings demonstrate that learning is parametrically impaired by the size of the goal space, but improves when the goal space structure allows for compression. We also find faster reward processing to correlate with better learning performance, supporting the idea that as goal valuation becomes more automatic, more resources are available for learning. We leverage computational modeling to support this interpretation. Our work suggests that efficient goal-directed learning relies on compressing complex goal information into a stable reward function, shedding light on the cognitive mechanisms of human motivation. These findings generate new insights into the neuroscience of intrinsic motivation and could help improve behavioral techniques that support people in achieving their goals.