Goto

Collaborating Authors

 Zhou, Hongyi


Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns

arXiv.org Artificial Intelligence

Unlike traditional methods that focus on evaluating single state-action pairs or apply action chunking in the actor network, this approach feeds chunked actions directly into the critic. Leveraging the Transformer's strength in processing sequential data, the proposed architecture achieves more robust value estimation. Empirical evaluations demonstrate that this method leads to efficient and stable training, particularly excelling in environments with sparse rewards or Multi-Phase tasks.Contribution(s) 1. We present a novel critic architecture for SAC that leverages Transformers to process sequential information, resulting in more accurate value estimations. Context: Transformer-Based Critic Network 2. We introduce a method for incorporating N-Step returns into the critic network in a stable and efficient manner, effectively mitigating the common challenges of variance and importance sampling associated with N-returns. Context: Stable Integration of N-Returns 3. We shift action chunking from the actor to the critic, demonstrating that enhanced temporal reasoning at the critic level--beyond traditional actor-side exploration--drives performance improvements in sparse and multi-phase tasks. Context: Unlike previous approaches that focus on actor-side chunking for exploration, our Transformer-based critic network produces a smooth value surface that is highly responsive to dataset variations, eliminating the need for additional exploration enhancements.


IRIS: An Immersive Robot Interaction System

arXiv.org Artificial Intelligence

This paper introduces IRIS, an immersive Robot Interaction System leveraging Extended Reality (XR), designed for robot data collection and interaction across multiple simulators, benchmarks, and real-world scenarios. While existing XR-based data collection systems provide efficient and intuitive solutions for large-scale data collection, they are often challenging to reproduce and reuse. This limitation arises because current systems are highly tailored to simulator-specific use cases and environments. IRIS is a novel, easily extendable framework that already supports multiple simulators, benchmarks, and even headsets. Furthermore, IRIS is able to include additional information from real-world sensors, such as point clouds captured through depth cameras. A unified scene specification is generated directly from simulators or real-world sensors and transmitted to XR headsets, creating identical scenes in XR. This specification allows IRIS to support any of the objects, assets, and robots provided by the simulators. In addition, IRIS introduces shared spatial anchors and a robust communication protocol that links simulations between multiple XR headsets. This feature enables multiple XR headsets to share a synchronized scene, facilitating collaborative and multi-user data collection. IRIS can be deployed on any device that supports the Unity Framework, encompassing the vast majority of commercially available headsets. In this work, IRIS was deployed and tested on the Meta Quest 3 and the HoloLens 2. IRIS showcased its versatility across a wide range of real-world and simulated scenarios, using current popular robot simulators such as MuJoCo, IsaacSim, CoppeliaSim, and Genesis. In addition, a user study evaluates IRIS on a data collection task for the LIBERO benchmark. The study shows that IRIS significantly outperforms the baseline in both objective and subjective metrics.


X-IL: Exploring the Design Space of Imitation Learning Policies

arXiv.org Artificial Intelligence

Designing modern imitation learning (IL) policies requires making numerous decisions, including the selection of feature encoding, architecture, policy representation, and more. As the field rapidly advances, the range of available options continues to grow, creating a vast and largely unexplored design space for IL policies. In this work, we present X-IL, an accessible open-source framework designed to systematically explore this design space. The framework's modular design enables seamless swapping of policy components, such as backbones (e.g., Transformer, Mamba, xLSTM) and policy optimization techniques (e.g., Score-matching, Flow-matching). This flexibility facilitates comprehensive experimentation and has led to the discovery of novel policy configurations that outperform existing methods on recent robot learning benchmarks. Our experiments demonstrate not only significant performance gains but also provide valuable insights into the strengths and weaknesses of various design choices. This study serves as both a practical reference for practitioners and a foundation for guiding future research in imitation learning.


Towards Fusing Point Cloud and Visual Representations for Imitation Learning

arXiv.org Artificial Intelligence

Learning for manipulation requires using policies that have access to rich sensory information such as point clouds or RGB images. Point clouds efficiently capture geometric structures, making them essential for manipulation tasks in imitation learning. In contrast, RGB images provide rich texture and semantic information that can be crucial for certain tasks. Existing approaches for fusing both modalities assign 2D image features to point clouds. However, such approaches often lose global contextual information from the original images. In this work, we propose a novel imitation learning method that effectively combines the strengths of both point cloud and RGB modalities. Our method conditions the point-cloud encoder on global and local image tokens using adaptive layer norm conditioning, leveraging the beneficial properties of both modalities. Through extensive experiments on the challenging RoboCasa benchmark, we demonstrate the limitations of relying on either modality alone and show that our method achieves state-of-the-art performance across all tasks.


Uniform-in-time weak propagation of chaos for consensus-based optimization

arXiv.org Artificial Intelligence

We study the uniform-in-time weak propagation of chaos for the consensus-based optimization (CBO) method on a bounded searching domain. We apply the methodology for studying long-time behaviors of interacting particle systems developed in the work of Delarue and Tse (ArXiv:2104.14973). Our work shows that the weak error has order $O(N^{-1})$ uniformly in time, where $N$ denotes the number of particles. The main strategy behind the proofs are the decomposition of the weak errors using the linearized Fokker-Planck equations and the exponential decay of their Sobolev norms. Consequently, our result leads to the joint convergence of the empirical distribution of the CBO particle system to the Dirac-delta distribution at the global minimizer in population size and running time in Wasserstein-type metrics.


BMP: Bridging the Gap between B-Spline and Movement Primitives

arXiv.org Artificial Intelligence

This work introduces B-spline Movement Primitives (BMPs), a new Movement Primitive (MP) variant that leverages B-splines for motion representation. B-splines are a well-known concept in motion planning due to their ability to generate complex, smooth trajectories with only a few control points while satisfying boundary conditions, i.e., passing through a specified desired position with desired velocity. However, current usages of B-splines tend to ignore the higher-order statistics in trajectory distributions, which limits their usage in imitation learning (IL) and reinforcement learning (RL), where modeling trajectory distribution is essential. In contrast, MPs are commonly used in IL and RL for their capacity to capture trajectory likelihoods and correlations. However, MPs are constrained by their abilities to satisfy boundary conditions and usually need extra terms in learning objectives to satisfy velocity constraints. By reformulating B-splines as MPs, represented through basis functions and weight parameters, BMPs combine the strengths of both approaches, allowing B-splines to capture higher-order statistics while retaining their ability to satisfy boundary conditions. Empirical results in IL and RL demonstrate that BMPs broaden the applicability of B-splines in robot learning and offer greater expressiveness compared to existing MP variants.


A Retrospective on the Robot Air Hockey Challenge: Benchmarking Robust, Reliable, and Safe Learning Techniques for Real-world Robotics

arXiv.org Artificial Intelligence

Machine learning methods have a groundbreaking impact in many application domains, but their application on real robotic platforms is still limited. Despite the many challenges associated with combining machine learning technology with robotics, robot learning remains one of the most promising directions for enhancing the capabilities of robots. When deploying learning-based approaches on real robots, extra effort is required to address the challenges posed by various real-world factors. To investigate the key factors influencing real-world deployment and to encourage original solutions from different researchers, we organized the Robot Air Hockey Challenge at the NeurIPS 2023 conference. We selected the air hockey task as a benchmark, encompassing low-level robotics problems and high-level tactics. Different from other machine learning-centric benchmarks, participants need to tackle practical challenges in robotics, such as the sim-to-real gap, low-level control issues, safety problems, real-time requirements, and the limited availability of real-world data. Furthermore, we focus on a dynamic environment, removing the typical assumption of quasi-static motions of other real-world benchmarks. The competition's results show that solutions combining learning-based approaches with prior knowledge outperform those relying solely on data when real-world deployment is challenging. Our ablation study reveals which real-world factors may be overlooked when building a learning-based solution. The successful real-world air hockey deployment of best-performing agents sets the foundation for future competitions and follow-up research directions.


HIRO: Heuristics Informed Robot Online Path Planning Using Pre-computed Deterministic Roadmaps

arXiv.org Artificial Intelligence

Dividing robot environments into static and dynamic elements, we use the static part for initializing a deterministic roadmap, which provides a lower bound of the final path cost as informed heuristics for fast path-finding. These heuristics guide a search tree to explore the roadmap during runtime. The search tree examines the edges using a fuzzy collision checking concerning the dynamic environment. Finally, the heuristics tree exploits knowledge fed back from the fuzzy collision checking module and updates the lower bound for the path cost. As we demonstrate in real-world experiments, the closed-loop formed by these three components significantly accelerates the planning procedure. An additional backtracking step ensures the feasibility of the resulting paths. Experiments in simulation and the real world show that HIRO can find collision-free paths considerably faster than baseline methods with and without prior knowledge of the environment.


TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

arXiv.org Artificial Intelligence

This work introduces a novel off-policy Reinforcement Learning (RL) algorithm that utilizes a transformer-based architecture for predicting the state-action values for a sequence of actions. These returns are effectively used to update the policy that predicts a smooth trajectory instead of a single action in each decision step. Predicting a whole trajectory of actions is commonly done in episodic RL (ERL) (Kober & Peters, 2008) and differs conceptually from conventional step-based RL (SRL) methods like PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018a) where an action is sampled in each time step. The action selection concept in ERL is promising as shown in recent works in RL (Otto et al., 2022; Li et al., 2024). Similar insights have been made in the field of Imitation Learning, where predicting action sequences instead of single actions has led to great success (Zhao et al., 2023; Reuss et al., 2024). Additionally, decision-making in ERL aligns with the human's decision-making strategy, where the human generally does not decide in each single time step but rather performs a whole sequence of actions to complete a task - for instance, swinging an arm to play tennis without overthinking each per-step movement. Episodic RL is a distinct family of RL that emphasizes the maximization of returns over entire episodes, typically lasting several seconds, rather than optimizing the intermediate states during environment interactions (Whitley et al., 1993; Igel, 2003; Peters & Schaal, 2008). Unlike SRL, ERL shifts the solution search from per-step actions to a parameterized trajectory space, leveraging techniques like Movement Primitives (MPs) (Schaal, 2006; Paraschos et al., 2013) for generating action sequences. This approach enables a broader exploration horizon (Kober & Peters, 2008), captures temporal and degrees of freedom (DoF) correlations (Li et al., 2024), and ensures smooth transitions between re-planning phases (Otto et al., 2023).


Variational Distillation of Diffusion Policies into Mixture of Experts

arXiv.org Artificial Intelligence

This work introduces Variational Diffusion Distillation (VDD), a novel method that distills denoising diffusion policies into Mixtures of Experts (MoE) through variational inference. Diffusion Models are the current state-of-the-art in generative modeling due to their exceptional ability to accurately learn and represent complex, multi-modal distributions. This ability allows Diffusion Models to replicate the inherent diversity in human behavior, making them the preferred models in behavior learning such as Learning from Human Demonstrations (LfD). However, diffusion models come with some drawbacks, including the intractability of likelihoods and long inference times due to their iterative sampling process. The inference times, in particular, pose a significant challenge to real-time applications such as robot control. In contrast, MoEs effectively address the aforementioned issues while retaining the ability to represent complex distributions but are notoriously difficult to train. VDD is the first method that distills pre-trained diffusion models into MoE models, and hence, combines the expressiveness of Diffusion Models with the benefits of Mixture Models. Specifically, VDD leverages a decompositional upper bound of the variational objective that allows the training of each expert separately, resulting in a robust optimization scheme for MoEs. VDD demonstrates across nine complex behavior learning tasks, that it is able to: i) accurately distill complex distributions learned by the diffusion model, ii) outperform existing state-of-the-art distillation methods, and iii) surpass conventional methods for training MoE.