Goto

Collaborating Authors

 visual demonstration


Visual Adversarial Imitation Learning using Variational Models

Neural Information Processing Systems

Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling on-policy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions. All results including videos can be found online at https://sites.google.com/view/variational-mail


InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

Neural Information Processing Systems

The goal of imitation learning is to mimic expert behavior without access to an explicit reward signal. Expert demonstrations provided by humans, however, often show significant variability due to latent factors that are typically not explicitly modeled. In this paper, we propose a new algorithm that can infer the latent structure of expert demonstrations in an unsupervised way. Our method, built on top of Generative Adversarial Imitation Learning, can not only imitate complex behaviors, but also learn interpretable and meaningful representations of complex behavioral data, including visual demonstrations. In the driving domain, we show that a model learned from human demonstrations is able to both accurately reproduce a variety of behaviors and accurately anticipate human actions using raw visual inputs. Compared with various baselines, our method can better capture the latent structure underlying expert demonstrations, often recovering semantically meaningful factors of variation in the data.


Robot Learning from a Physical World Model

Mao, Jiageng, He, Sicheng, Wu, Hao-Ning, You, Yang, Sun, Shuyang, Wang, Zhicheng, Bao, Yanan, Chen, Huizhong, Guibas, Leonidas, Guizilini, Vitor, Zhou, Howard, Wang, Yue

arXiv.org Artificial Intelligence

We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task-conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object-centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero-shot generalizable robotic manipulation. Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit \href{https://pointscoder.github.io/PhysWorld_Web/}{the project webpage} for details.


Force-Based Robotic Imitation Learning: A Two-Phase Approach for Construction Assembly Tasks

You, Hengxu, Ye, Yang, Zhou, Tianyu, Du, Jing

arXiv.org Artificial Intelligence

Robots have shown enormous potential to alleviate repetitive, and dangerous tasks from human workers, such as assembly, infrastructure inspection, material handling and heavy rigging [4-6]. Integrating the artificial intelligence (AI) agent with a physical robotic system could further improve the precision, reliability, and consistency of operations with competent training [7, 8]. While AI-enabled robots excel in performing repetitive and predefined tasks, dexterous and complex tasks still pose a significant difficulty such as welding and pipe insertion [9, 10]. Training a robot to perform these dexterous tasks demands delicate manipulation and adaptive force control, which induces diversity and several potential actions leading to a substantial increase in the complexity of the learning process and resulting in slow convergence or lack of convergence [11] To tackle the challenges of learning in high-dimensional action spaces, Imitation Learning (IL) based methods are applied to leverage demonstrations from human experts or proficient use of human demonstrations as a form of instruction and reduce the size of action spaces that need to be explored [12-14]. Generative Adversarial Imitation Learning (GAIL)[15] could further address some key limitations of traditional IL by mitigating distributional shifts, thus enabling better exploration and performance in unseen states and generalizing better to new tasks [15].


ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

Chen, Letian, Gombolay, Matthew

arXiv.org Artificial Intelligence

Reinforcement learning (RL) has demonstrated compelling performance in robotic tasks, but its success often hinges on the design of complex, ad hoc reward functions. Researchers have explored how Large Language Models (LLMs) could enable non-expert users to specify reward functions more easily. However, LLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot represent the problem properly with only text-based descriptions. To address these challenges, we propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a novel framework that combines natural language guidance with visual user demonstrations to align robot behavior with user intentions better. By incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only task specifications, while leveraging inverse reinforcement learning (IRL) to balance feature weights and match the demonstrated behaviors optimally. ELEMENTAL also introduces an iterative feedback-loop through self-reflection to improve feature, reward, and policy learning. Our experiment results demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and achieves 41.3% better generalization in out-of-distribution tasks, highlighting its robustness in LfD.


Visual Adversarial Imitation Learning using Variational Models

Neural Information Processing Systems

Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm.


Reviews: InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

Neural Information Processing Systems

Paper Summary: This paper focuses on using GANs for imitation learning using trajectories from an expert. The authors extend the GAIL (Generative Adversarial Imitation Learning) framework by including a term in the objective function to incorporate latent structure (similar to InfoGAN). The authors then proceed to show that using their framework, which they call InfoGAIL, they are able to learn interpretable latent structure when the expert policy has multiple modes and that in some setting this robustness allows them to outperform current methods. Paper Overview: The paper is generally well written. I appreciated that the authors first demon- started how the mechanism works on a toy 2D plane example before moving onto more complex driving simulation environment. This helped illustrate the core concepts of allowing the learned policy to be conditioned on a latent variable in a minimalistic setting before moving on to a more complex 3D driving simulation.


Learning from Visual Demonstrations through Differentiable Nonlinear MPC for Personalized Autonomous Driving

Acerbo, Flavia Sofia, Swevers, Jan, Tuytelaars, Tinne, Son, Tong Duy

arXiv.org Artificial Intelligence

Human-like autonomous driving controllers have the potential to enhance passenger perception of autonomous vehicles. This paper proposes DriViDOC: a model for Driving from Vision through Differentiable Optimal Control, and its application to learn personalized autonomous driving controllers from human demonstrations. DriViDOC combines the automatic inference of relevant features from camera frames with the properties of nonlinear model predictive control (NMPC), such as constraint satisfaction. Our approach leverages the differentiability of parametric NMPC, allowing for end-to-end learning of the driving model from images to control. The model is trained on an offline dataset comprising various driving styles collected on a motion-base driving simulator. During online testing, the model demonstrates successful imitation of different driving styles, and the interpreted NMPC parameters provide insights into the achievement of specific driving behaviors. Our experimental results show that DriViDOC outperforms other methods involving NMPC and neural networks, exhibiting an average improvement of 20% in imitation scores.


Learning from Demonstration Framework for Multi-Robot Systems Using Interaction Keypoints and Soft Actor-Critic Methods

Venkatesh, Vishnunandan L. N., Min, Byung-Cheol

arXiv.org Artificial Intelligence

Learning from Demonstration (LfD) is a promising approach to enable Multi-Robot Systems (MRS) to acquire complex skills and behaviors. However, the intricate interactions and coordination challenges in MRS pose significant hurdles for effective LfD. In this paper, we present a novel LfD framework specifically designed for MRS, which leverages visual demonstrations to capture and learn from robot-robot and robot-object interactions. Our framework introduces the concept of Interaction Keypoints (IKs) to transform the visual demonstrations into a representation that facilitates the inference of various skills necessary for the task. The robots then execute the task using sensorimotor actions and reinforcement learning (RL) policies when required. A key feature of our approach is the ability to handle unseen contact-based skills that emerge during the demonstration. In such cases, RL is employed to learn the skill using a classifier-based reward function, eliminating the need for manual reward engineering and ensuring adaptability to environmental changes. We evaluate our framework across a range of mobile robot tasks, covering both behavior-based and contact-based domains. The results demonstrate the effectiveness of our approach in enabling robots to learn complex multi-robot tasks and behaviors from visual demonstrations.


CasIL: Cognizing and Imitating Skills via a Dual Cognition-Action Architecture

Chen, Zixuan, Ji, Ze, Liu, Shuyang, Huo, Jing, Chen, Yiyu, Gao, Yang

arXiv.org Artificial Intelligence

Enabling robots to effectively imitate expert skills in longhorizon tasks such as locomotion, manipulation, and more, poses a long-standing challenge. Existing imitation learning (IL) approaches for robots still grapple with sub-optimal performance in complex tasks. In this paper, we consider how this challenge can be addressed within the human cognitive priors. Heuristically, we extend the usual notion of action to a dual Cognition (high-level)-Action (low-level) architecture by introducing intuitive human cognitive priors, and propose a novel skill IL framework through human-robot interaction, called Cognition-Action-based Skill Imitation Learning (CasIL), for the robotic agent to effectively cognize and imitate the critical skills from raw visual demonstrations. CasIL enables both cognition and action imitation, while high-level skill cognition explicitly guides low-level primitive actions, providing robustness and reliability to the entire skill IL process. We evaluated our method on MuJoCo and RLBench benchmarks, as well as on the obstacle avoidance and point-goal navigation tasks for quadrupedal robot locomotion. Experimental results show that our CasIL consistently achieves competitive and robust skill imitation capability compared to other counterparts in a variety of long-horizon robotic tasks.