Goto

Collaborating Authors

 cliport


Analyzing the Impact of Multimodal Perception on Sample Complexity and Optimization Landscapes in Imitation Learning

Abuelsamen, Luai, Adebanjo, Temitope Lukman

arXiv.org Artificial Intelligence

This paper examines the theoretical foundations of multimodal imitation learning through the lens of statistical learning theory. We analyze how multimodal perception (RGB-D, proprioception, language) affects sample complexity and optimization landscapes in imitation policies. Building on recent advances in multimodal learning theory, we show that properly integrated multimodal policies can achieve tighter generalization bounds and more favorable optimization landscapes than their unimodal counterparts. We provide a comprehensive review of theoretical frameworks that explain why multimodal architectures like PerAct and CLIPort achieve superior performance, connecting these empirical results to fundamental concepts in Rademacher complexity, PAC learning, and information theory.


Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback

Meng, Yuan, Yao, Xiangtong, Ye, Haihui, Zhou, Yirui, Zhang, Shengqiang, Bing, Zhenshan, Knoll, Alois

arXiv.org Artificial Intelligence

Our framework demonstrates state-of-the-art performance across diverse long-horizon tasks, achieving strong generalization in both simulated and real-world scenarios. Videos and code are available at https://ghiara.github.io/DAHLIA/. I. INTRODUCTION Language-conditioned robotic manipulation is an emerging field at the intersection of robotics, natural language processing, and computer vision, which aims to enable robots to interpret human commands and perform complex tasks using multi-modal sensing [1]. Imitation learning (IL) and reinforcement learning (RL) have traditionally been the dominant approaches for training robotic manipulation policies. However, recent IL and RL methods are often constrained to narrow task distributions, leading to sampling inefficiency and high sensitivity to distributional shifts, which limits their ability to generalize to diverse and complex scenarios. Additionally, both IL and RL are data-driven, requiring large-scale expert demonstrations, yet Internet-scale data collection for embodied AI remains a substantial challenge. In contrast, the natural language processing domain has seen state-of-the-art (SOT A) LLMs like GPT [2] and Llama [3] achieve humanlike semantic understanding and common sense reasoning by training on massive datasets. Within embodied AI, LLMs offer a promising solution to bridge the gap between high-level language instructions and low-level robotic control, 1 Y uan Meng, Xiangtong Y ao, Haihui Y e, Yirui Zhou, and Alois Knoll are with the School of Computation, Information and Technology, Technical University of Munich, Germany. 2 Shengqiang Zhang is with the Center for Information and Language Processing, Ludwig Maximilian University of Munich, Germany. 3 Zhenshan Bing is with the State Key Laboratory for Novel Software Technology, Nanjing University, China.


Uncertainty-Aware Deployment of Pre-trained Language-Conditioned Imitation Learning Policies

Wu, Bo, Lee, Bruce D., Daniilidis, Kostas, Bucher, Bernadette, Matni, Nikolai

arXiv.org Artificial Intelligence

Large-scale robotic policies trained on data from diverse tasks and robotic platforms hold great promise for enabling general-purpose robots; however, reliable generalization to new environment conditions remains a major challenge. Toward addressing this challenge, we propose a novel approach for uncertainty-aware deployment of pre-trained language-conditioned imitation learning agents. Specifically, we use temperature scaling to calibrate these models and exploit the calibrated model to make uncertainty-aware decisions by aggregating the local information of candidate actions. We implement our approach in simulation using three such pre-trained models, and showcase its potential to significantly enhance task completion rates. The accompanying code is accessible at the link: https://github.com/BobWu1998/uncertainty_quant_all.git


GenSim: Generating Robotic Simulation Tasks via Large Language Models

Wang, Lirui, Ling, Yiyang, Yuan, Zhecheng, Shridhar, Mohit, Bao, Chen, Qin, Yuzhe, Wang, Bailin, Xu, Huazhe, Wang, Xiaolong

arXiv.org Artificial Intelligence

Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.


ROSO: Improving Robotic Policy Inference via Synthetic Observations

Miyashita, Yusuke, Gahtidis, Dimitris, La, Colin, Rabinowicz, Jeremy, Leitner, Jurgen

arXiv.org Artificial Intelligence

In this paper, we propose the use of generative artificial intelligence (AI) to improve zero-shot performance of a pre-trained policy by altering observations during inference. Modern robotic systems, powered by advanced neural networks, have demonstrated remarkable capabilities on pre-trained tasks. However, generalizing and adapting to new objects and environments is challenging, and fine-tuning visuomotor policies is time-consuming. To overcome these issues we propose Robotic Policy Inference via Synthetic Observations (ROSO). ROSO uses stable diffusion to pre-process a robot's observation of novel objects during inference time to fit within its distribution of observations of the pre-trained policies. This novel paradigm allows us to transfer learned knowledge from known tasks to previously unseen scenarios, enhancing the robot's adaptability without requiring lengthy fine-tuning. Our experiments show that incorporating generative AI into robotic inference significantly improves successful outcomes, finishing up to 57% of tasks otherwise unsuccessful with the pre-trained policy.


Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

Gkanatsios, Nikolaos, Jain, Ayush, Xian, Zhou, Zhang, Yunchu, Atkeson, Christopher, Fragkiadaki, Katerina

arXiv.org Artificial Intelligence

Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.


Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Wang, Renhao, Mao, Jiayuan, Hsu, Joy, Zhao, Hang, Wu, Jiajun, Gao, Yang

arXiv.org Artificial Intelligence

Robots operating in the real world require both rich manipulation skills as well as the ability to semantically reason about when to apply those skills. Towards this goal, recent works have integrated semantic representations from large-scale pretrained vision-language (VL) models into manipulation models, imparting them with more general reasoning capabilities. However, we show that the conventional pretraining-finetuning pipeline for integrating such representations entangles the learning of domain-specific action information and domain-general visual information, leading to less data-efficient training and poor generalization to unseen objects and tasks. To this end, we propose ProgramPort, a modular approach to better leverage pretrained VL models by exploiting the syntactic and semantic structures of language instructions. Our framework uses a semantic parser to recover an executable program, composed of functional modules grounded on vision and action across different modalities. Each functional module is realized as a combination of deterministic computation and learnable neural networks. Program execution produces parameters to general manipulation primitives for a robotic end-effector. The entire modular network can be trained with end-to-end imitation learning objectives. Experiments show that our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors. Project webpage at: \url{https://progport.github.io}.


Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning

Luo, Qian, Li, Yunfei, Wu, Yi

arXiv.org Artificial Intelligence

Grounded understanding of natural language in physical scenes can greatly benefit robots that follow human instructions. In object manipulation scenarios, existing end-to-end models are proficient at understanding semantic concepts, but typically cannot handle complex instructions involving spatial relations among multiple objects. which require both reasoning object-level spatial relations and learning precise pixel-level manipulation affordances. We take an initial step to this challenge with a decoupled two-stage solution. In the first stage, we propose an object-centric semantic-spatial reasoner to select which objects are relevant for the language instructed task. The segmentation of selected objects are then fused as additional input to the affordance learning stage. Simply incorporating the inductive bias of relevant objects to a vision-language affordance learning agent can effectively boost its performance in a custom testbed designed for object manipulation with spatial-related language instructions.


Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement

Zhao, Zirui, Lee, Wee Sun, Hsu, David

arXiv.org Artificial Intelligence

We present a new method, PARsing And visual GrOuNding (ParaGon), for grounding natural language in object placement tasks. Natural language generally describes objects and spatial relations with compositionality and ambiguity, two major obstacles to effective language grounding. For compositionality, ParaGon parses a language instruction into an object-centric graph representation to ground objects individually. For ambiguity, ParaGon uses a novel particle-based graph neural network to reason about object placements with uncertainty. Essentially, ParaGon integrates a parsing algorithm into a probabilistic, data-driven learning framework. It is fully differentiable and trained end-to-end from data for robustness against complex, ambiguous language input.


OpenD: A Benchmark for Language-Driven Door and Drawer Opening

Zhao, Yizhou, Gao, Qiaozi, Qiu, Liang, Thattai, Govind, Sukhatme, Gaurav S.

arXiv.org Artificial Intelligence

We introduce OPEND, a benchmark for learning how to use a hand to open cabinet doors or drawers in a photo-realistic and physics-reliable simulation environment driven by language instruction. To solve the task, we propose a multi-step planner composed of a deep neural network and rule-base controllers. The network is utilized to capture spatial relationships from images and understand semantic meaning from language instructions. Controllers efficiently execute the plan based on the spatial and semantic understanding. We evaluate our system by measuring its zero-shot performance in test data set. Experimental results demonstrate the effectiveness of decision planning by our multi-step planner for different hands, while suggesting that there is significant room for developing better models to address the challenge brought by language understanding, spatial reasoning, and long-term manipulation. We will release OPEND and host challenges to promote future research in this area.