DeLTa: Demonstration and Language-Guided Novel Transparent Object Manipulation

Lee, Taeyeop, Kang, Gyuree, Wen, Bowen, Kim, Youngho, Back, Seunghyeok, Kweon, In So, Shim, David Hyunchul, Yoon, Kuk-Jin

arXiv.org Artificial Intelligence 

Abstract-- Despite the prevalence of transparent object interactions in human everyday life, transparent robotic manipulation research remains limited to short-horizon tasks and basic grasping capabilities. Although some methods have partially addressed these issues, most of them have limitations in generalizability to novel objects and are insufficient for precise long-horizon robot manipulation. T o address this limitation, we propose DeL T a (Demonstration and Language-Guided Novel Transparent Object Manipulation), a novel framework that integrates depth estimation, 6D pose estimation, and vision-language planning for precise long-horizon manipulation of transparent objects guided by natural task instructions. A key advantage of our method is its single-demonstration approach, which generalizes 6D trajectories to novel transparent objects without requiring category-level priors or additional training. Additionally, we present a task planner that refines the VLM-generated plan to account for the constraints of a single-arm, eye-in-hand robot for long-horizon object manipulation tasks. Through comprehensive evaluation, we demonstrate that our method significantly outperforms existing transparent object manipulation approaches, particularly in long-horizon scenarios requiring precise manipulation capabilities. I. INTRODUCTION Transparent objects are prevalent across real-world environments, including laboratories, kitchens, and manufacturing facilities. However, conventional depth sensors often fail to perceive these objects accurately.