Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT
Muttaqien, Muhammad A., Motoda, Tomohiro, Hanai, Ryo, Domae, Yukiyasu
–arXiv.org Artificial Intelligence
Embodied AI Research T eam National Institute of AIST Tokyo, Japan muha.muttaqien@aist.go.jp Embodied AI Research T eam National Institute of AIST Tokyo, Japan tomohiro.motoda@aist.go.jp Embodied AI Research T eam National Institute of AIST Tokyo, Japan ryo.hanai@aist.go.jp Abstract --Robotic pick-and-place tasks in convenience stores pose challenges due to dense object arrangements, occlusions, and variations in object properties such as color, shape, size, and texture. These factors complicate trajectory planning and grasping. This paper introduces a perception-action pipeline leveraging annotation-guided visual prompting, where bounding box annotations identify both pickable objects and placement locations, providing structured spatial guidance. Instead of traditional step-by-step planning, we employ Action Chunking with Transformers (ACT) as an imitation learning algorithm, enabling the robotic arm to predict chunked action sequences from human demonstrations. We evaluate our system based on success rate and visual analysis of grasping behavior, demonstrating improved grasp accuracy and adaptability in retail environments. Robotic pick-and-place tasks are essential in various industrial and retail applications, particularly in convenience stores where robots must handle a diverse range of products with different shapes, sizes, textures, and colors, as shown in Figure 1. However, real-world pick-and-place scenarios pose significant challenges due to dense object arrangements, frequent occlusions, and the need for precise grasping and placement.
arXiv.org Artificial Intelligence
Aug-13-2025