Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT

Muttaqien, Muhammad A., Motoda, Tomohiro, Hanai, Ryo, Domae, Yukiyasu

arXiv.org Artificial Intelligence 

Embodied AI Research T eam National Institute of AIST Tokyo, Japan muha.muttaqien@aist.go.jp Embodied AI Research T eam National Institute of AIST Tokyo, Japan tomohiro.motoda@aist.go.jp Embodied AI Research T eam National Institute of AIST Tokyo, Japan ryo.hanai@aist.go.jp Abstract --Robotic pick-and-place tasks in convenience stores pose challenges due to dense object arrangements, occlusions, and variations in object properties such as color, shape, size, and texture. These factors complicate trajectory planning and grasping. This paper introduces a perception-action pipeline leveraging annotation-guided visual prompting, where bounding box annotations identify both pickable objects and placement locations, providing structured spatial guidance. Instead of traditional step-by-step planning, we employ Action Chunking with Transformers (ACT) as an imitation learning algorithm, enabling the robotic arm to predict chunked action sequences from human demonstrations. We evaluate our system based on success rate and visual analysis of grasping behavior, demonstrating improved grasp accuracy and adaptability in retail environments. Robotic pick-and-place tasks are essential in various industrial and retail applications, particularly in convenience stores where robots must handle a diverse range of products with different shapes, sizes, textures, and colors, as shown in Figure 1. However, real-world pick-and-place scenarios pose significant challenges due to dense object arrangements, frequent occlusions, and the need for precise grasping and placement.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found