Goto

Collaborating Authors

 Reinforcement Learning


Towards Agent-based Test Support Systems: An Unsupervised Environment Design Approach

arXiv.org Artificial Intelligence

Modal testing plays a critical role in structural analysis by providing essential insights into dynamic behaviour across a wide range of engineering industries. In practice, designing an effective modal test campaign involves complex experimental planning, comprising a series of interdependent decisions that significantly influence the final test outcome. Traditional approaches to test design are typically static-focusing only on global tests without accounting for evolving test campaign parameters or the impact of such changes on previously established decisions, such as sensor configurations, which have been found to significantly influence test outcomes. These rigid methodologies often compromise test accuracy and adaptability. To address these limitations, this study introduces an agent-based decision support framework for adaptive sensor placement across dynamically changing modal test environments. The framework formulates the problem using an underspecified partially observable Markov decision process, enabling the training of a generalist reinforcement learning agent through a dual-curriculum learning strategy. A detailed case study on a steel cantilever structure demonstrates the efficacy of the proposed method in optimising sensor locations across frequency segments, validating its robustness and real-world applicability in experimental settings.


SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning

arXiv.org Artificial Intelligence

Generating physically realistic humanoid-object interactions (HOI) is a fundamental challenge in robotics. Existing HOI generation approaches, such as diffusion-based models, often suffer from artifacts such as implausible contacts, penetrations, and unrealistic whole-body actions, which hinder successful execution in physical environments. To address these challenges, we introduce SimGenHOI, a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI. Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose. These key actions capture essential interaction dynamics and are interpolated into smooth motion trajectories, naturally supporting long-horizon generation. To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding. Furthermore, we introduce a mutual fine-tuning strategy, where the generative model and the control policy iteratively refine each other, improving both motion realism and tracking robustness. Extensive experiments demonstrate that SimGenHOI generates realistic, diverse, and physically plausible humanoid-object interactions, achieving significantly higher tracking success rates in simulation and enabling long-horizon manipulation tasks.


Beyond Fixed Morphologies: Learning Graph Policies with Trust Region Compensation in Variable Action Spaces

arXiv.org Artificial Intelligence

Trust region-based optimization methods have become foundational reinforcement learning algorithms that offer stability and strong empirical performance in continuous control tasks. Growing interest in scalable and reusable control policies translate also in a demand for morphological generalization, the ability of control policies to cope with different kinematic structures. Graph-based policy architectures provide a natural and effective mechanism to encode such structural differences. However, while these architectures accommodate variable morphologies, the behavior of trust region methods under varying action space dimensionality remains poorly understood. To this end, we conduct a theoretical analysis of trust region-based policy optimization methods, focusing on both Trust Region Policy Optimization (TRPO) and its widely used first-order approximation, Proximal Policy Optimization (PPO). The goal is to demonstrate how varying action space dimensionality influence the optimization landscape, particularly under the constraints imposed by KL-divergence or policy clipping penalties. Complementing the theoretical insights, an empirical evaluation under morphological variation is carried out using the Gymnasium Swimmer environment. This benchmark offers a systematically controlled setting for varying the kinematic structure without altering the underlying task, making it particularly well-suited to study morphological generalization.


Physics-Informed Reward Machines

arXiv.org Artificial Intelligence

Reward machines (RMs) provide a structured way to specify non-Markovian rewards in reinforcement learning (RL), thereby improving both expressiveness and programmability. Viewed more broadly, they separate what is known about the environment, captured by the reward mechanism, from what remains unknown and must be discovered through sampling. This separation supports techniques such as counterfactual experience generation and reward shaping, which reduce sample complexity and speed up learning. We introduce physics-informed reward machines (pRMs), a symbolic machine designed to express complex learning objectives and reward structures for RL agents, thereby enabling more programmable, expressive, and efficient learning. We present RL algorithms capable of exploiting pRMs via counterfactual experiences and reward shaping. Our experimental results show that these techniques accelerate reward acquisition during the training phases of RL. We demonstrate the expressiveness and effectiveness of pRMs through experiments in both finite and continuous physical environments, illustrating that incorporating pRMs significantly improves learning efficiency across several control tasks.


CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

arXiv.org Artificial Intelligence

Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN's effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN's success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual refinement. Code can be found at https://github.com/deepreinforce-ai/CRINN


Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning

arXiv.org Artificial Intelligence

Imitation learning is a promising approach for enabling generalist capabilities in humanoid robots, but its scaling is fundamentally constrained by the scarcity of high-quality expert demonstrations. This limitation can be mitigated by leveraging suboptimal, open-ended play data, often easier to collect and offering greater diversity. This work builds upon recent advances in generative modeling, specifically Flow Matching, an alternative to Diffusion models. We introduce a method for estimating the minimum or maximum of the learned distribution by leveraging the unique properties of Flow Matching, namely, deterministic transport and support for arbitrary source distributions. We apply this method to develop several goal-conditioned imitation and reinforcement learning algorithms based on Flow Matching, where policies are conditioned on both current and goal observations. We explore and compare different architectural configurations by combining core components, such as critic, planner, actor, or world model, in various ways. We evaluated our agents on the OGBench benchmark and analyzed how different demonstration behaviors during data collection affect performance in a 2D non-prehensile pushing task. Furthermore, we validated our approach on real hardware by deploying it on the Talos humanoid robot to perform complex manipulation tasks based on high-dimensional image observations, featuring a sequence of pick-and-place and articulated object manipulation in a realistic kitchen environment. Experimental videos and code are available at: https://hucebot.github.io/extremum_flow_matching_website/


Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

arXiv.org Artificial Intelligence

Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at \href{https://github.com/yyht/openrlhf_async_pipline}{https://github.com/yyht/openrlhf\_async\_pipline}.


An Improved Multi-Agent Algorithm for Cooperative and Competitive Environments by Identifying and Encouraging Cooperation among Agents

arXiv.org Artificial Intelligence

We propose an improved algorithm by identifying and encouraging cooperative behavior in multi-agent environments. First, we analyze the shortcomings of existing algorithms in addressing multi-agent reinforcement learning problems. Then, based on the existing algorithm MADDPG, we introduce a new parameter to increase the reward that an agent can obtain when cooperative behavior among agents is identified. Finally, we compare our improved algorithm with MADDPG in environments from PettingZoo. The results show that the new algorithm helps agents achieve both higher team rewards and individual rewards.


Efficient Environment Design for Multi-Robot Navigation via Continuous Control

arXiv.org Artificial Intelligence

Multi-robot navigation and path planning in continuous state and action spaces with uncertain environments remains an open challenge. Deep Reinforcement Learning (RL) is one of the most popular paradigms for solving this task, but its real-world application has been limited due to sample inefficiency and long training periods. Moreover, the existing works using RL for multi-robot navigation lack formal guarantees while designing the environment. In this paper, we introduce an efficient and highly customizable environment for continuous-control multi-robot navigation, where the robots must visit a set of regions of interest (ROIs) by following the shortest paths. The task is formally modeled as a Markov Decision Process (MDP). We describe the multi-robot navigation task as an optimization problem and relate it to finding an optimal policy for the MDP. We crafted several variations of the environment and measured the performance using both gradient and non-gradient based RL methods: A2C, PPO, TRPO, TQC, CrossQ and ARS. To show real-world applicability, we deployed our environment to a 3-D agricultural field with uncertainties using the CoppeliaSim robot simulator and measured the robustness by running inference on the learned models. We believe our work will guide the researchers on how to develop MDP-based environments that are applicable to real-world systems and solve them using the existing state-of-the-art RL methods with limited resources and within reasonable time periods.


No More Marching: Learning Humanoid Locomotion for Short-Range SE(2) Targets

arXiv.org Artificial Intelligence

Humanoids operating in real-world workspaces must frequently execute task-driven, short-range movements to SE(2) target poses. To be practical, these transitions must be fast, robust, and energy efficient. While learning-based locomotion has made significant progress, most existing methods optimize for velocity-tracking rather than direct pose reaching, resulting in inefficient, marching-style behavior when applied to short-range tasks. In this work, we develop a reinforcement learning approach that directly optimizes humanoid locomotion for SE(2) targets. Central to this approach is a new constellation-based reward function that encourages natural and efficient target-oriented movement. To evaluate performance, we introduce a benchmarking framework that measures energy consumption, time-to-target, and footstep count on a distribution of SE(2) goals. Our results show that the proposed approach consistently outperforms standard methods and enables successful transfer from simulation to hardware, highlighting the importance of targeted reward design for practical short-range humanoid locomotion.