PRISM: Pointcloud Reintegrated Inference via Segmentation and Cross-attention for Manipulation
Huang, Daqi, Cai, Zhehao, Hao, Yuzhi, Li, Zechen, Chew, Chee-Meng
–arXiv.org Artificial Intelligence
Figure 1: PRISM is a visual imitation learning algorithm that marries 3D visual representations with diffusion policies, achieving surprising effectiveness in diverse simulation and real-world tasks, with a practical inference speed. Abstract --Robust imitation learning for robot manipulation requires comprehensive 3D perception, yet many existing methods struggle in cluttered environments. Fixed camera view approaches are vulnerable to perspective changes, and 3D point cloud techniques often limit themselves to keyframes predictions, reducing their efficacy in dynamic, contact-intensive tasks. T o address these challenges, we propose PRISM, designed as an end-to-end framework that directly learns from raw point cloud observations and robot states, eliminating the need for pre-trained models or external datasets. PRISM comprises three main components: a segmentation embedding unit that partitions the raw point cloud into distinct object clusters and encodes local geometric details; a cross-attention component that merges these visual features with processed robot joint states to highlight relevant targets; and a diffusion module that translates the fused representation into smooth robot actions. Code and some demos are available on https://github.com/czknuaa/PRISM. With advancements in robotics, the application scenarios for robotic arms are becoming increasely diverse . As robotic arms are required to interact with numerous objects in complex and dynamic environments, manipulation has emerged as one of the most cruicial aspects of the robotic systems [1]-[3].
arXiv.org Artificial Intelligence
Jul-8-2025