AITopics | visual demonstration

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.59)

Neural Information Processing SystemsNov-21-2025, 14:42:48 GMT

InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

The goal of imitation learning is to mimic expert behavior without access to an explicit reward signal. Expert demonstrations provided by humans, however, often show significant variability due to latent factors that are typically not explicitly modeled. In this paper, we propose a new algorithm that can infer the latent structure of expert demonstrations in an unsupervised way. Our method, built on top of Generative Adversarial Imitation Learning, can not only imitate complex behaviors, but also learn interpretable and meaningful representations of complex behavioral data, including visual demonstrations. In the driving domain, we show that a model learned from human demonstrations is able to both accurately reproduce a variety of behaviors and accurately anticipate human actions using raw visual inputs. Compared with various baselines, our method can better capture the latent structure underlying expert demonstrations, often recovering semantically meaningful factors of variation in the data.

demonstration, interpretable imitation learning, name change, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.69)

arXiv.org Artificial IntelligenceNov-11-2025

Robot Learning from a Physical World Model

Mao, Jiageng, He, Sicheng, Wu, Hao-Ning, You, Yang, Sun, Shuyang, Wang, Zhicheng, Bao, Yanan, Chen, Huizhong, Guibas, Leonidas, Guizilini, Vitor, Zhou, Howard, Wang, Yue

We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task-conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object-centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero-shot generalizable robotic manipulation. Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit \href{https://pointscoder.github.io/PhysWorld_Web/}{the project webpage} for details.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2511.07416

Country: Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.73)
Information Technology > Artificial Intelligence > Representation & Reasoning > Model-Based Reasoning (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.49)

arXiv.org Artificial IntelligenceJan-24-2025

Force-Based Robotic Imitation Learning: A Two-Phase Approach for Construction Assembly Tasks

You, Hengxu, Ye, Yang, Zhou, Tianyu, Du, Jing

Robots have shown enormous potential to alleviate repetitive, and dangerous tasks from human workers, such as assembly, infrastructure inspection, material handling and heavy rigging [4-6]. Integrating the artificial intelligence (AI) agent with a physical robotic system could further improve the precision, reliability, and consistency of operations with competent training [7, 8]. While AI-enabled robots excel in performing repetitive and predefined tasks, dexterous and complex tasks still pose a significant difficulty such as welding and pipe insertion [9, 10]. Training a robot to perform these dexterous tasks demands delicate manipulation and adaptive force control, which induces diversity and several potential actions leading to a substantial increase in the complexity of the learning process and resulting in slow convergence or lack of convergence [11] To tackle the challenges of learning in high-dimensional action spaces, Imitation Learning (IL) based methods are applied to leverage demonstrations from human experts or proficient use of human demonstrations as a form of instruction and reduce the size of action spaces that need to be explored [12-14]. Generative Adversarial Imitation Learning (GAIL)[15] could further address some key limitations of traditional IL by mitigating distributional shifts, thus enabling better exploration and performance in unseen states and generalizing better to new tasks [15].

demonstration, machine learning, reinforcement learning, (14 more...)

2501.14942

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry:

Education (1.00)
Construction & Engineering (0.94)
Energy > Oil & Gas > Upstream (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.87)

Chen, Letian, Gombolay, Matthew

ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

arXiv.org Artificial IntelligenceDec-5-2024

Reinforcement learning (RL) has demonstrated compelling performance in robotic tasks, but its success often hinges on the design of complex, ad hoc reward functions. Researchers have explored how Large Language Models (LLMs) could enable non-expert users to specify reward functions more easily. However, LLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot represent the problem properly with only text-based descriptions. To address these challenges, we propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a novel framework that combines natural language guidance with visual user demonstrations to align robot behavior with user intentions better. By incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only task specifications, while leveraging inverse reinforcement learning (IRL) to balance feature weights and match the demonstrated behaviors optimally. ELEMENTAL also introduces an iterative feedback-loop through self-reflection to improve feature, reward, and policy learning. Our experiment results demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and achieves 41.3% better generalization in out-of-distribution tasks, highlighting its robustness in LfD.

demonstration, elemental, reward function, (14 more...)

2411.18825

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Industry: Education > Educational Setting > Online (0.70)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsOct-9-2024, 15:40:58 GMT

Visual Adversarial Imitation Learning using Variational Models

variational model, visual adversarial imitation learning, visual demonstration, (4 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.62)

Neural Information Processing SystemsOct-7-2024, 16:30:10 GMT

Reviews: InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

Paper Summary: This paper focuses on using GANs for imitation learning using trajectories from an expert. The authors extend the GAIL (Generative Adversarial Imitation Learning) framework by including a term in the objective function to incorporate latent structure (similar to InfoGAN). The authors then proceed to show that using their framework, which they call InfoGAIL, they are able to learn interpretable latent structure when the expert policy has multiple modes and that in some setting this robustness allows them to outperform current methods. Paper Overview: The paper is generally well written. I appreciated that the authors first demon- started how the mechanism works on a toy 2D plane example before moving onto more complex driving simulation environment. This helped illustrate the core concepts of allowing the learned policy to be conditioned on a latent variable in a minimalistic setting before moving on to a more complex 3D driving simulation.

infogail, latent variable, reward augmentation, (12 more...)

Technology:

Information Technology > Artificial Intelligence > Robots (0.83)
Information Technology > Artificial Intelligence > Machine Learning (0.83)

Acerbo, Flavia Sofia, Swevers, Jan, Tuytelaars, Tinne, Son, Tong Duy

Learning from Visual Demonstrations through Differentiable Nonlinear MPC for Personalized Autonomous Driving

arXiv.org Artificial IntelligenceJul-8-2024

Human-like autonomous driving controllers have the potential to enhance passenger perception of autonomous vehicles. This paper proposes DriViDOC: a model for Driving from Vision through Differentiable Optimal Control, and its application to learn personalized autonomous driving controllers from human demonstrations. DriViDOC combines the automatic inference of relevant features from camera frames with the properties of nonlinear model predictive control (NMPC), such as constraint satisfaction. Our approach leverages the differentiability of parametric NMPC, allowing for end-to-end learning of the driving model from images to control. The model is trained on an offline dataset comprising various driving styles collected on a motion-base driving simulator. During online testing, the model demonstrates successful imitation of different driving styles, and the interpreted NMPC parameters provide insights into the achievement of specific driving behaviors. Our experimental results show that DriViDOC outperforms other methods involving NMPC and neural networks, exhibiting an average improvement of 20% in imitation scores.

artificial intelligence, drividoc, machine learning, (17 more...)

2403.15102

Country: Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.05)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.70)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Venkatesh, Vishnunandan L. N., Min, Byung-Cheol

Learning from Demonstration Framework for Multi-Robot Systems Using Interaction Keypoints and Soft Actor-Critic Methods

arXiv.org Artificial IntelligenceApr-2-2024

Learning from Demonstration (LfD) is a promising approach to enable Multi-Robot Systems (MRS) to acquire complex skills and behaviors. However, the intricate interactions and coordination challenges in MRS pose significant hurdles for effective LfD. In this paper, we present a novel LfD framework specifically designed for MRS, which leverages visual demonstrations to capture and learn from robot-robot and robot-object interactions. Our framework introduces the concept of Interaction Keypoints (IKs) to transform the visual demonstrations into a representation that facilitates the inference of various skills necessary for the task. The robots then execute the task using sensorimotor actions and reinforcement learning (RL) policies when required. A key feature of our approach is the ability to handle unseen contact-based skills that emerge during the demonstration. In such cases, RL is employed to learn the skill using a classifier-based reward function, eliminating the need for manual reward engineering and ensuring adaptability to environmental changes. We evaluate our framework across a range of mobile robot tasks, covering both behavior-based and contact-based domains. The results demonstrate the effectiveness of our approach in enabling robots to learn complex multi-robot tasks and behaviors from visual demonstrations.

demonstration, interaction keypoint, robot, (9 more...)

2404.02324

Country:

North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Artificial Intelligence > Robots > Robots in the Workplace (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

arXiv.org Artificial IntelligenceSep-28-2023

CasIL: Cognizing and Imitating Skills via a Dual Cognition-Action Architecture

Chen, Zixuan, Ji, Ze, Liu, Shuyang, Huo, Jing, Chen, Yiyu, Gao, Yang

Enabling robots to effectively imitate expert skills in longhorizon tasks such as locomotion, manipulation, and more, poses a long-standing challenge. Existing imitation learning (IL) approaches for robots still grapple with sub-optimal performance in complex tasks. In this paper, we consider how this challenge can be addressed within the human cognitive priors. Heuristically, we extend the usual notion of action to a dual Cognition (high-level)-Action (low-level) architecture by introducing intuitive human cognitive priors, and propose a novel skill IL framework through human-robot interaction, called Cognition-Action-based Skill Imitation Learning (CasIL), for the robotic agent to effectively cognize and imitate the critical skills from raw visual demonstrations. CasIL enables both cognition and action imitation, while high-level skill cognition explicitly guides low-level primitive actions, providing robustness and reliability to the entire skill IL process. We evaluated our method on MuJoCo and RLBench benchmarks, as well as on the obstacle avoidance and point-goal navigation tasks for quadrupedal robot locomotion. Experimental results show that our CasIL consistently achieves competitive and robust skill imitation capability compared to other counterparts in a variety of long-horizon robotic tasks.

artificial intelligence, casil, machine learning, (18 more...)

2309.16299

Country:

Asia > China > Jiangsu Province > Nanjing (0.04)
Europe > United Kingdom > Wales > Cardiff (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.89)