Goto

Collaborating Authors

 Reinforcement Learning


OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning

arXiv.org Artificial Intelligence

Reinforcement learning algorithms typically utilize an interactive simulator (i.e., environment) with a predefined reward function for policy training. Developing such simulators and manually defining reward functions, however, is often time-consuming and labor-intensive. To address this, we propose an Offline Simulator (OffSim), a novel model-based offline inverse reinforcement learning (IRL) framework, to emulate environmental dynamics and reward structure directly from expert-generated state-action trajectories. OffSim jointly optimizes a high-entropy transition model and an IRL-based reward function to enhance exploration and improve the generalizability of the learned reward. Leveraging these learned components, OffSim can subsequently train a policy offline without further interaction with the real environment. Additionally, we introduce OffSim$^+$, an extension that incorporates a marginal reward for multi-dataset settings to enhance exploration. Extensive MuJoCo experiments demonstrate that OffSim achieves substantial performance gains over existing offline IRL methods, confirming its efficacy and robustness.


Procedural Game Level Design with Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Abstract--Procedural content generation (PCG) has become an increasingly popular technique in game development, allowing developers to generate dynamic, replayable, and scalable environments with reduced manual effort. In this study, a novel method for procedural level design using Deep Reinforcement Learning (DRL) within a Unity-based 3D environment is propesed. The system comprises two agents: a hummingbird agent, acting as a solver, and a floating island agent, responsible for generating and placing collectible objects (flowers) on the terrain in a realistic and context-aware manner . The hummingbird is trained using the Proximal Policy Optimization (PPO) algorithm from the Unity ML-Agents toolkit. It learns to navigate through the terrain efficiently, locate flowers, and collect them while adapting to the ever-changing procedural layout of the island. The island agent is also trained using the Proximal Policy Optimization (PPO) algorithm. It learns to generate flower layouts based on observed obstacle positions, the hummingbird's initial state, and performance feedback from previous episodes. The interaction between these agents leads to emergent behavior and robust generalization across various environmental configurations. The results demonstrate that the approach not only produces effective and efficient agent behavior but also opens up new opportunities for autonomous game level design driven by machine learning. This work highlights the potential of DRL in enabling intelligent agents to both generate and solve content in virtual environments, pushing the boundaries of what AI can contribute to creative game development processes.


VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

arXiv.org Artificial Intelligence

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2$\times$ speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.


MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning

arXiv.org Artificial Intelligence

We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions occurring in parts of the transition space with high dynamics shift. As a result, they optimize the policy using data from low-shift regions, limiting exploration of high-reward states in the target domain that do not fall within these regions. Consequently, such methods often fail when the dynamics shift is significant or the optimal trajectories lie outside the low-shift regions. To overcome this limitation, we propose MOBODY, a Model-Based Off-Dynamics Offline RL algorithm that optimizes a policy using learned target dynamics transitions to explore the target domain, rather than only being trained with the low dynamics-shift transitions. For the dynamics learning, built on the observation that achieving the same next state requires taking different actions in different domains, MOBODY employs separate action encoders for each domain to encode different actions to the shared latent space while sharing a unified representation of states and a common transition function. We further introduce a target Q-weighted behavior cloning loss in policy optimization to avoid out-of-distribution actions, which push the policy toward actions with high target-domain Q-values, rather than high source domain Q-values or uniformly imitating all actions in the offline dataset. We evaluate MOBODY on a wide range of MuJoCo and Adroit benchmarks, demonstrating that it outperforms state-of-the-art off-dynamics RL baselines as well as policy learning methods based on different dynamics learning baselines, with especially pronounced improvements in challenging scenarios where existing methods struggle.


Decentralizing Multi-Agent Reinforcement Learning with Temporal Causal Information

arXiv.org Artificial Intelligence

Reinforcement learning (RL) algorithms can find an optimal policy for a single agent to accomplish a particular task. However, many real-world problems require multiple agents to collaborate in order to achieve a common goal. For example, a robot executing a task in a warehouse may require the assistance of a drone to retrieve items from high shelves. In Decentralized Multi-Agent RL (DMARL), agents learn independently and then combine their policies at execution time, but often must satisfy constraints on compatibility of local policies to ensure that they can achieve the global task when combined. In this paper, we study how providing high-level symbolic knowledge to agents can help address unique challenges of this setting, such as privacy constraints, communication limitations, and performance concerns. In particular, we extend the formal tools used to check the compatibility of local policies with the team task, making decentralized training with theoretical guarantees usable in more scenarios. Furthermore, we empirically demonstrate that symbolic knowledge about the temporal evolution of events in the environment can significantly expedite the learning process in DMARL.


Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles

arXiv.org Artificial Intelligence

Autonomous vehicles (AV) offer a cost-effective solution for scientific missions such as underwater tracking. Recently, reinforcement learning (RL) has emerged as a powerful method for controlling AVs in complex marine environments. However, scaling these techniques to a fleet--essential for multi-target tracking or targets with rapid, unpredictable motion--presents significant computational challenges. Multi-Agent Reinforcement Learning (MARL) is notoriously sample-inefficient, and while high-fidelity simulators like Gazebo's LRAUV provide 100x faster-than-real-time single-robot simulations, they offer no significant speedup for multi-vehicle scenarios, making MARL training impractical. To address these limitations, we propose an iterative distillation method that transfers high-fidelity simulations into a simplified, GPU-accelerated environment while preserving high-level dynamics. This approach achieves up to a 30,000x speedup over Gazebo through parallelization, enabling efficient training via end-to-end GPU acceleration. Additionally, we introduce a novel Transformer-based architecture (TransfMAPPO) that learns multi-agent policies invariant to the number of agents and targets, significantly improving sample efficiency. Following large-scale curriculum learning conducted entirely on GPU, we perform extensive evaluations in Gazebo, demonstrating that our method maintains tracking errors below 5 meters over extended durations, even in the presence of multiple fast-moving targets. This work bridges the gap between large-scale MARL training and high-fidelity deployment, providing a scalable framework for autonomous fleet control in real-world sea missions.


FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement

arXiv.org Artificial Intelligence

Quantum computing has the potential to revolutionize fields like quantum optimization and quantum machine learning. However, current quantum devices are hindered by noise, reducing their reliability. A key challenge in gate-based quantum computing is improving the reliability of quantum circuits, measured by process fidelity, during the transpilation process, particularly in the routing stage. In this paper, we address the Fidelity Maximization in Routing Stage (FMRS) problem by introducing FIDDLE, a novel learning framework comprising two modules: a Gaussian Process-based surrogate model to estimate process fidelity with limited training samples and a reinforcement learning module to optimize routing. Our approach is the first to directly maximize process fidelity, outperforming traditional methods that rely on indirect metrics such as circuit depth or gate count. We rigorously evaluate FIDDLE by comparing it with state-of-the-art fidelity estimation techniques and routing optimization methods. The results demonstrate that our proposed surrogate model is able to provide a better estimation on the process fidelity compared to existing learning techniques, and our end-to-end framework significantly improves the process fidelity of quantum circuits across various noise models.


RLAF: Reinforcement Learning from Automaton Feedback

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) in environments with complex, history-dependent reward structures poses significant challenges for traditional methods. In this work, we introduce a novel approach that leverages automaton-based feedback to guide the learning process, replacing explicit reward functions with preferences derived from a deterministic finite automaton (DFA). Unlike conventional approaches that use automata for direct reward specification, our method employs the structure of the DFA to generate preferences over trajectories that are used to learn a reward function, eliminating the need for manual reward engineering. Our framework introduces a static approach that uses the learned reward function directly for policy optimization and a dynamic approach that involves continuous refining of the reward function and policy through iterative updates until convergence. Our experiments in both discrete and continuous environments demonstrate that our approach enables the RL agent to learn effective policies for tasks with temporal dependencies, outperforming traditional reward engineering and automaton-based baselines such as reward machines and LTL-guided methods. Our results highlight the advantages of automaton-based preferences in handling non-Markovian rewards, offering a scalable, efficient, and human-independent alternative to traditional reward modeling. We also provide a convergence guarantee showing that under standard assumptions our automaton-guided preference-based framework learns a policy that is near-optimal with respect to the true non-Markovian objective.


Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

arXiv.org Artificial Intelligence

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.


HEADER: Hierarchical Robot Exploration via Attention-Based Deep Reinforcement Learning with Expert-Guided Reward

arXiv.org Artificial Intelligence

Abstract--This work pushes the boundaries of learning-based methods in autonomous robot exploration in terms of environmental scale and exploration efficiency. HEADER follows existing conventional methods to construct hierarchical representations for the robot belief/map, but further designs a novel community-based algorithm to construct and update a global graph, which remains fully incremental, shape-adaptive, and operates with linear complexity. Building upon attention-based networks, our planner finely reasons about the nearby belief within the local range while coarsely leveraging distant information at the global scale, enabling next-best-viewpoint decisions that consider multi-scale spatial dependencies. Beyond novel map representation, we introduce a parameter-free privileged reward that significantly improves model performance and produces near-optimal exploration behaviors, by avoiding training objective bias caused by handcrafted reward shaping. In simulated challenging, large-scale exploration scenarios, HEADER demonstrates better scalability than most existing learning and non-learning methods, while achieving a significant improvement in exploration efficiency (up to 20%) over state-of-the-art baselines. N autonomous exploration, a mobile robot is tasked with exploring and mapping an unknown environment as fast as possible. By planning and executing its exploration path, the robot classifies unknown areas into free or obstacle areas based on its accumulated sensor measurements. In this work, we focus on tasks where a ground robot is equipped with an omnidirectional 3D LiDAR to obtain long-range, low-noise, and dense point cloud measurements. Recent advancements in LiDAR odometry have enabled accurate and robust localization and mapping in large-scale environments [1]-[3], allowing recent planners to focus on exploring the environment without concerns about mapping/localization accuracy [4]- [9]. Despite this, few planners support exploration at large scale in real-world environments [5], [10], mainly due to the complexity that comes with long-term, real-time path planning requirements. That is, to achieve efficient exploration, the planner must actively react to belief and map updates at a high frequency by (re-)reasoning about the full partial belief, to replan a long-term, non-myopic exploration path. Authors are with the Department of Mechanical Engineering, College of Design and Engineering, National University of Singapore. Example hierarchical graph constructed by HEADER during its autonomous exploration of our campus.