cliport
Analyzing the Impact of Multimodal Perception on Sample Complexity and Optimization Landscapes in Imitation Learning
Abuelsamen, Luai, Adebanjo, Temitope Lukman
This paper examines the theoretical foundations of multimodal imitation learning through the lens of statistical learning theory. We analyze how multimodal perception (RGB-D, proprioception, language) affects sample complexity and optimization landscapes in imitation policies. Building on recent advances in multimodal learning theory, we show that properly integrated multimodal policies can achieve tighter generalization bounds and more favorable optimization landscapes than their unimodal counterparts. We provide a comprehensive review of theoretical frameworks that explain why multimodal architectures like PerAct and CLIPort achieve superior performance, connecting these empirical results to fundamental concepts in Rademacher complexity, PAC learning, and information theory.
- Research Report (1.00)
- Overview (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.75)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback
Meng, Yuan, Yao, Xiangtong, Ye, Haihui, Zhou, Yirui, Zhang, Shengqiang, Bing, Zhenshan, Knoll, Alois
Our framework demonstrates state-of-the-art performance across diverse long-horizon tasks, achieving strong generalization in both simulated and real-world scenarios. Videos and code are available at https://ghiara.github.io/DAHLIA/. I. INTRODUCTION Language-conditioned robotic manipulation is an emerging field at the intersection of robotics, natural language processing, and computer vision, which aims to enable robots to interpret human commands and perform complex tasks using multi-modal sensing [1]. Imitation learning (IL) and reinforcement learning (RL) have traditionally been the dominant approaches for training robotic manipulation policies. However, recent IL and RL methods are often constrained to narrow task distributions, leading to sampling inefficiency and high sensitivity to distributional shifts, which limits their ability to generalize to diverse and complex scenarios. Additionally, both IL and RL are data-driven, requiring large-scale expert demonstrations, yet Internet-scale data collection for embodied AI remains a substantial challenge. In contrast, the natural language processing domain has seen state-of-the-art (SOT A) LLMs like GPT [2] and Llama [3] achieve humanlike semantic understanding and common sense reasoning by training on massive datasets. Within embodied AI, LLMs offer a promising solution to bridge the gap between high-level language instructions and low-level robotic control, 1 Y uan Meng, Xiangtong Y ao, Haihui Y e, Yirui Zhou, and Alois Knoll are with the School of Computation, Information and Technology, Technical University of Munich, Germany. 2 Shengqiang Zhang is with the Center for Information and Language Processing, Ludwig Maximilian University of Munich, Germany. 3 Zhenshan Bing is with the State Key Laboratory for Novel Software Technology, Nanjing University, China.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.44)
- Asia > China > Jiangsu Province > Nanjing (0.24)
Uncertainty-Aware Deployment of Pre-trained Language-Conditioned Imitation Learning Policies
Wu, Bo, Lee, Bruce D., Daniilidis, Kostas, Bucher, Bernadette, Matni, Nikolai
Large-scale robotic policies trained on data from diverse tasks and robotic platforms hold great promise for enabling general-purpose robots; however, reliable generalization to new environment conditions remains a major challenge. Toward addressing this challenge, we propose a novel approach for uncertainty-aware deployment of pre-trained language-conditioned imitation learning agents. Specifically, we use temperature scaling to calibrate these models and exploit the calibrated model to make uncertainty-aware decisions by aggregating the local information of candidate actions. We implement our approach in simulation using three such pre-trained models, and showcase its potential to significantly enhance task completion rates. The accompanying code is accessible at the link: https://github.com/BobWu1998/uncertainty_quant_all.git
- North America > United States > Pennsylvania (0.04)
- North America > United States > Michigan (0.04)
- Asia > Middle East > Jordan (0.04)
GenSim: Generating Robotic Simulation Tasks via Large Language Models
Wang, Lirui, Ling, Yiyang, Yuan, Zhecheng, Shridhar, Mohit, Bao, Chen, Qin, Yuzhe, Wang, Bailin, Xu, Huazhe, Wang, Xiaolong
Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Vietnam > Hanoi > Hanoi (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
ROSO: Improving Robotic Policy Inference via Synthetic Observations
Miyashita, Yusuke, Gahtidis, Dimitris, La, Colin, Rabinowicz, Jeremy, Leitner, Jurgen
In this paper, we propose the use of generative artificial intelligence (AI) to improve zero-shot performance of a pre-trained policy by altering observations during inference. Modern robotic systems, powered by advanced neural networks, have demonstrated remarkable capabilities on pre-trained tasks. However, generalizing and adapting to new objects and environments is challenging, and fine-tuning visuomotor policies is time-consuming. To overcome these issues we propose Robotic Policy Inference via Synthetic Observations (ROSO). ROSO uses stable diffusion to pre-process a robot's observation of novel objects during inference time to fit within its distribution of observations of the pre-trained policies. This novel paradigm allows us to transfer learned knowledge from known tasks to previously unseen scenarios, enhancing the robot's adaptability without requiring lengthy fine-tuning. Our experiments show that incorporating generative AI into robotic inference significantly improves successful outcomes, finishing up to 57% of tasks otherwise unsuccessful with the pre-trained policy.
Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
Gkanatsios, Nikolaos, Jain, Ayush, Xian, Zhou, Zhang, Yunchu, Atkeson, Christopher, Fragkiadaki, Katerina
Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Austria > Vienna (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Programmatically Grounded, Compositionally Generalizable Robotic Manipulation
Wang, Renhao, Mao, Jiayuan, Hsu, Joy, Zhao, Hang, Wu, Jiajun, Gao, Yang
Robots operating in the real world require both rich manipulation skills as well as the ability to semantically reason about when to apply those skills. Towards this goal, recent works have integrated semantic representations from large-scale pretrained vision-language (VL) models into manipulation models, imparting them with more general reasoning capabilities. However, we show that the conventional pretraining-finetuning pipeline for integrating such representations entangles the learning of domain-specific action information and domain-general visual information, leading to less data-efficient training and poor generalization to unseen objects and tasks. To this end, we propose ProgramPort, a modular approach to better leverage pretrained VL models by exploiting the syntactic and semantic structures of language instructions. Our framework uses a semantic parser to recover an executable program, composed of functional modules grounded on vision and action across different modalities. Each functional module is realized as a combination of deterministic computation and learnable neural networks. Program execution produces parameters to general manipulation primitives for a robotic end-effector. The entire modular network can be trained with end-to-end imitation learning objectives. Experiments show that our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors. Project webpage at: \url{https://progport.github.io}.
Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning
Grounded understanding of natural language in physical scenes can greatly benefit robots that follow human instructions. In object manipulation scenarios, existing end-to-end models are proficient at understanding semantic concepts, but typically cannot handle complex instructions involving spatial relations among multiple objects. which require both reasoning object-level spatial relations and learning precise pixel-level manipulation affordances. We take an initial step to this challenge with a decoupled two-stage solution. In the first stage, we propose an object-centric semantic-spatial reasoner to select which objects are relevant for the language instructed task. The segmentation of selected objects are then fused as additional input to the affordance learning stage. Simply incorporating the inductive bias of relevant objects to a vision-language affordance learning agent can effectively boost its performance in a custom testbed designed for object manipulation with spatial-related language instructions.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > United Kingdom > England > Greater London > London (0.05)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- (9 more...)
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement
Zhao, Zirui, Lee, Wee Sun, Hsu, David
We present a new method, PARsing And visual GrOuNding (ParaGon), for grounding natural language in object placement tasks. Natural language generally describes objects and spatial relations with compositionality and ambiguity, two major obstacles to effective language grounding. For compositionality, ParaGon parses a language instruction into an object-centric graph representation to ground objects individually. For ambiguity, ParaGon uses a novel particle-based graph neural network to reason about object placements with uncertainty. Essentially, ParaGon integrates a parsing algorithm into a probabilistic, data-driven learning framework. It is fully differentiable and trained end-to-end from data for robustness against complex, ambiguous language input.
OpenD: A Benchmark for Language-Driven Door and Drawer Opening
Zhao, Yizhou, Gao, Qiaozi, Qiu, Liang, Thattai, Govind, Sukhatme, Gaurav S.
We introduce OPEND, a benchmark for learning how to use a hand to open cabinet doors or drawers in a photo-realistic and physics-reliable simulation environment driven by language instruction. To solve the task, we propose a multi-step planner composed of a deep neural network and rule-base controllers. The network is utilized to capture spatial relationships from images and understand semantic meaning from language instructions. Controllers efficiently execute the plan based on the spatial and semantic understanding. We evaluate our system by measuring its zero-shot performance in test data set. Experimental results demonstrate the effectiveness of decision planning by our multi-step planner for different hands, while suggesting that there is significant room for developing better models to address the challenge brought by language understanding, spatial reasoning, and long-term manipulation. We will release OPEND and host challenges to promote future research in this area.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > Mexico > Querétaro > Santiago de Querétaro (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)