Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
Xu, Kechun, Xia, Xunlong, Wang, Kaixuan, Yang, Yifei, Mao, Yunxuan, Deng, Bing, Xiong, Rong, Wang, Yue
–arXiv.org Artificial Intelligence
--We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions. Videos and codes are available at the project page. HE ability to pick and place objects is essential for robotic manipulation [1]-[6]. Consider a scenario where a robot is commanded with language instructions to grasp a target object in open clutter, and move it to a specified place. The target object may be partially or fully occluded, posing challenges for object grounding and grasping. In such scenarios, multiple pick and place actions may be needed to clear obstacles for object rearrangement. A common way to construct a policy for such tasks is to predict 6-DoF actions directly from raw sensory information, as in classic end-to-end policies. Recently, these policies have achieved promising performances by incorporating features of pre-trained foundation models, e.g., vision-language models (VLM) and large language models (LLM) [7]-[12]. However, they require large amounts of demonstration data for policy learning, particularly for tasks involving cluttered environments. In addition, one has to deal with generalization issues to deploy these policies in real-world applications. Kechun Xu is with Zhejiang University, Hangzhou, China, and Alibaba Cloud, Hangzhou, China. Xunlong Xia, and Bing Deng are with Alibaba Cloud, Hangzhou, China. Kaixuan Wang, Yifei Y ang, Y unxuan Mao, Rong Xiong, and Y ue Wang are with Zhejiang University, Hangzhou, China.
arXiv.org Artificial Intelligence
Mar-12-2025